7.3 KiB
7.3 KiB
Evaluation Guide
Overview
This guide covers model evaluation procedures for YOLOv8 French ID Card Detection models.
🎯 Evaluation Process
1. Basic Evaluation
Evaluate the best trained model:
python eval.py
This will:
- Automatically find the best model from
runs/train/
- Load the test dataset
- Run evaluation on test set
- Save results to
runs/val/test_evaluation/
2. Custom Evaluation
Evaluate Specific Model
python eval.py --model runs/train/yolov8_n_french_id_card/weights/best.pt
Custom Thresholds
python eval.py --conf 0.3 --iou 0.5
Different Model Size
python eval.py --model-size m
📊 Evaluation Metrics
Key Metrics Explained
-
mAP50 (Mean Average Precision at IoU=0.5)
- Measures precision across different recall levels
- IoU threshold of 0.5 (50% overlap)
- Range: 0-1 (higher is better)
-
mAP50-95 (Mean Average Precision across IoU thresholds)
- Average of mAP at IoU thresholds from 0.5 to 0.95
- More comprehensive than mAP50
- Range: 0-1 (higher is better)
-
Precision
- Ratio of correct detections to total detections
- Measures accuracy of positive predictions
- Range: 0-1 (higher is better)
-
Recall
- Ratio of correct detections to total ground truth objects
- Measures ability to find all objects
- Range: 0-1 (higher is better)
Expected Performance
For French ID Card detection:
Metric | Target | Good | Excellent |
---|---|---|---|
mAP50 | >0.8 | >0.9 | >0.95 |
mAP50-95 | >0.6 | >0.8 | >0.9 |
Precision | >0.8 | >0.9 | >0.95 |
Recall | >0.8 | >0.9 | >0.95 |
📈 Understanding Results
Sample Output
Class Images Instances Box(P R mAP50 mAP50-95): 100%|██████████| 14/14
all 212 209 1 0.99 0.995 0.992
Interpretation:
- Images: 212 test images
- Instances: 209 ground truth objects
- Box(P): Precision = 1.0 (100% accurate detections)
- R: Recall = 0.99 (99% of objects found)
- mAP50: 0.995 (excellent performance)
- mAP50-95: 0.992 (excellent across IoU thresholds)
Confidence vs IoU Thresholds
Confidence Threshold Impact
# High confidence (fewer detections, higher precision)
python eval.py --conf 0.7
# Low confidence (more detections, lower precision)
python eval.py --conf 0.1
IoU Threshold Impact
# Strict IoU (higher precision requirements)
python eval.py --iou 0.7
# Lenient IoU (easier to match detections)
python eval.py --iou 0.3
📁 Evaluation Outputs
Results Directory Structure
runs/val/test_evaluation/
├── predictions.json # Detailed predictions
├── results.png # Performance plots
├── confusion_matrix.png # Confusion matrix
├── BoxR_curve.png # Precision-Recall curve
├── labels/ # Predicted labels
└── images/ # Visualization images
Key Output Files
-
predictions.json
{ "metrics": { "metrics/mAP50": 0.995, "metrics/mAP50-95": 0.992, "metrics/precision": 1.0, "metrics/recall": 0.99 } }
-
results.png
- Training curves
- Loss plots
- Metric evolution
-
confusion_matrix.png
- True vs predicted classifications
- Error analysis
🔍 Advanced Evaluation
Batch Evaluation
Evaluate multiple models:
# Evaluate different model sizes
for size in n s m l; do
python eval.py --model-size $size
done
Cross-Validation
# Evaluate with different data splits
python eval.py --data data/data_val1.yaml
python eval.py --data data/data_val2.yaml
Performance Analysis
Speed vs Accuracy Trade-off
Model Size | Inference Time | mAP50 | Use Case |
---|---|---|---|
n (nano) | ~2ms | 0.995 | Real-time |
s (small) | ~4ms | 0.998 | Balanced |
m (medium) | ~8ms | 0.999 | High accuracy |
l (large) | ~12ms | 0.999 | Best accuracy |
📊 Visualization
Generated Plots
-
Precision-Recall Curve
- Shows precision vs recall at different thresholds
- Area under curve = mAP
-
Confusion Matrix
- True positives, false positives, false negatives
- Helps identify error patterns
-
Training Curves
- Loss evolution during training
- Metric progression
Custom Visualizations
# Load evaluation results
import json
with open('runs/val/test_evaluation/predictions.json', 'r') as f:
results = json.load(f)
# Analyze specific metrics
mAP50 = results['metrics']['metrics/mAP50']
precision = results['metrics']['metrics/precision']
recall = results['metrics']['metrics/recall']
🔧 Troubleshooting
Common Evaluation Issues
1. Model Not Found
# Check available models
ls runs/train/*/weights/
# Specify model path explicitly
python eval.py --model path/to/model.pt
2. Test Data Not Found
# Validate data structure
python train.py --validate-only
# Check data.yaml paths
cat data/data.yaml
3. Memory Issues
# Reduce batch size
python eval.py --batch-size 8
# Use smaller model
python eval.py --model-size n
Debug Commands
# Check model file
python -c "import torch; model = torch.load('model.pt'); print(model.keys())"
# Validate data paths
python -c "import yaml; data = yaml.safe_load(open('data/data.yaml')); print(data)"
# Test GPU availability
python -c "import torch; print(torch.cuda.is_available())"
📋 Evaluation Checklist
- Model trained successfully
- Test dataset available
- GPU memory sufficient
- Correct model path
- Appropriate thresholds set
- Results directory writable
🎯 Best Practices
1. Threshold Selection
# Start with default thresholds
python eval.py
# Adjust based on use case
python eval.py --conf 0.5 --iou 0.5 # Balanced
python eval.py --conf 0.7 --iou 0.7 # High precision
python eval.py --conf 0.3 --iou 0.3 # High recall
2. Model Comparison
# Compare different models
python eval.py --model-size n
python eval.py --model-size s
python eval.py --model-size m
# Compare results
diff runs/val/test_evaluation_n/predictions.json \
runs/val/test_evaluation_s/predictions.json
3. Performance Monitoring
# Regular evaluation
python eval.py --model-size n
# Log results
echo "$(date): mAP50=$(grep 'mAP50' runs/val/test_evaluation/predictions.json)" >> eval_log.txt
📈 Continuous Evaluation
Automated Evaluation
#!/bin/bash
# eval_script.sh
MODEL_SIZE=${1:-n}
THRESHOLD=${2:-0.25}
echo "Evaluating model size: $MODEL_SIZE"
python eval.py --model-size $MODEL_SIZE --conf $THRESHOLD
# Save results
cp runs/val/test_evaluation/predictions.json \
results/eval_${MODEL_SIZE}_$(date +%Y%m%d).json
Integration with CI/CD
# .github/workflows/evaluate.yml
name: Model Evaluation
on: [push, pull_request]
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Evaluate Model
run: |
pip install -r requirements.txt
python eval.py --model-size n
Note: Regular evaluation helps ensure model performance remains consistent over time.