# Evaluation Guide ## Overview This guide covers model evaluation procedures for YOLOv8 French ID Card Detection models. ## 🎯 Evaluation Process ### 1. Basic Evaluation Evaluate the best trained model: ```bash python eval.py ``` This will: - Automatically find the best model from `runs/train/` - Load the test dataset - Run evaluation on test set - Save results to `runs/val/test_evaluation/` ### 2. Custom Evaluation #### Evaluate Specific Model ```bash python eval.py --model runs/train/yolov8_n_french_id_card/weights/best.pt ``` #### Custom Thresholds ```bash python eval.py --conf 0.3 --iou 0.5 ``` #### Different Model Size ```bash python eval.py --model-size m ``` ## 📊 Evaluation Metrics ### Key Metrics Explained 1. **mAP50 (Mean Average Precision at IoU=0.5)** - Measures precision across different recall levels - IoU threshold of 0.5 (50% overlap) - Range: 0-1 (higher is better) 2. **mAP50-95 (Mean Average Precision across IoU thresholds)** - Average of mAP at IoU thresholds from 0.5 to 0.95 - More comprehensive than mAP50 - Range: 0-1 (higher is better) 3. **Precision** - Ratio of correct detections to total detections - Measures accuracy of positive predictions - Range: 0-1 (higher is better) 4. **Recall** - Ratio of correct detections to total ground truth objects - Measures ability to find all objects - Range: 0-1 (higher is better) ### Expected Performance For French ID Card detection: | Metric | Target | Good | Excellent | |--------|--------|------|-----------| | mAP50 | >0.8 | >0.9 | >0.95 | | mAP50-95| >0.6 | >0.8 | >0.9 | | Precision| >0.8 | >0.9 | >0.95 | | Recall | >0.8 | >0.9 | >0.95 | ## 📈 Understanding Results ### Sample Output ``` Class Images Instances Box(P R mAP50 mAP50-95): 100%|██████████| 14/14 all 212 209 1 0.99 0.995 0.992 ``` **Interpretation:** - **Images**: 212 test images - **Instances**: 209 ground truth objects - **Box(P)**: Precision = 1.0 (100% accurate detections) - **R**: Recall = 0.99 (99% of objects found) - **mAP50**: 0.995 (excellent performance) - **mAP50-95**: 0.992 (excellent across IoU thresholds) ### Confidence vs IoU Thresholds #### Confidence Threshold Impact ```bash # High confidence (fewer detections, higher precision) python eval.py --conf 0.7 # Low confidence (more detections, lower precision) python eval.py --conf 0.1 ``` #### IoU Threshold Impact ```bash # Strict IoU (higher precision requirements) python eval.py --iou 0.7 # Lenient IoU (easier to match detections) python eval.py --iou 0.3 ``` ## 📁 Evaluation Outputs ### Results Directory Structure ``` runs/val/test_evaluation/ ├── predictions.json # Detailed predictions ├── results.png # Performance plots ├── confusion_matrix.png # Confusion matrix ├── BoxR_curve.png # Precision-Recall curve ├── labels/ # Predicted labels └── images/ # Visualization images ``` ### Key Output Files 1. **predictions.json** ```json { "metrics": { "metrics/mAP50": 0.995, "metrics/mAP50-95": 0.992, "metrics/precision": 1.0, "metrics/recall": 0.99 } } ``` 2. **results.png** - Training curves - Loss plots - Metric evolution 3. **confusion_matrix.png** - True vs predicted classifications - Error analysis ## 🔍 Advanced Evaluation ### Batch Evaluation Evaluate multiple models: ```bash # Evaluate different model sizes for size in n s m l; do python eval.py --model-size $size done ``` ### Cross-Validation ```bash # Evaluate with different data splits python eval.py --data data/data_val1.yaml python eval.py --data data/data_val2.yaml ``` ### Performance Analysis #### Speed vs Accuracy Trade-off | Model Size | Inference Time | mAP50 | Use Case | |------------|----------------|-------|----------| | n (nano) | ~2ms | 0.995 | Real-time | | s (small) | ~4ms | 0.998 | Balanced | | m (medium) | ~8ms | 0.999 | High accuracy | | l (large) | ~12ms | 0.999 | Best accuracy | ## 📊 Visualization ### Generated Plots 1. **Precision-Recall Curve** - Shows precision vs recall at different thresholds - Area under curve = mAP 2. **Confusion Matrix** - True positives, false positives, false negatives - Helps identify error patterns 3. **Training Curves** - Loss evolution during training - Metric progression ### Custom Visualizations ```python # Load evaluation results import json with open('runs/val/test_evaluation/predictions.json', 'r') as f: results = json.load(f) # Analyze specific metrics mAP50 = results['metrics']['metrics/mAP50'] precision = results['metrics']['metrics/precision'] recall = results['metrics']['metrics/recall'] ``` ## 🔧 Troubleshooting ### Common Evaluation Issues **1. Model Not Found** ```bash # Check available models ls runs/train/*/weights/ # Specify model path explicitly python eval.py --model path/to/model.pt ``` **2. Test Data Not Found** ```bash # Validate data structure python train.py --validate-only # Check data.yaml paths cat data/data.yaml ``` **3. Memory Issues** ```bash # Reduce batch size python eval.py --batch-size 8 # Use smaller model python eval.py --model-size n ``` ### Debug Commands ```bash # Check model file python -c "import torch; model = torch.load('model.pt'); print(model.keys())" # Validate data paths python -c "import yaml; data = yaml.safe_load(open('data/data.yaml')); print(data)" # Test GPU availability python -c "import torch; print(torch.cuda.is_available())" ``` ## 📋 Evaluation Checklist - [ ] Model trained successfully - [ ] Test dataset available - [ ] GPU memory sufficient - [ ] Correct model path - [ ] Appropriate thresholds set - [ ] Results directory writable ## 🎯 Best Practices ### 1. Threshold Selection ```bash # Start with default thresholds python eval.py # Adjust based on use case python eval.py --conf 0.5 --iou 0.5 # Balanced python eval.py --conf 0.7 --iou 0.7 # High precision python eval.py --conf 0.3 --iou 0.3 # High recall ``` ### 2. Model Comparison ```bash # Compare different models python eval.py --model-size n python eval.py --model-size s python eval.py --model-size m # Compare results diff runs/val/test_evaluation_n/predictions.json \ runs/val/test_evaluation_s/predictions.json ``` ### 3. Performance Monitoring ```bash # Regular evaluation python eval.py --model-size n # Log results echo "$(date): mAP50=$(grep 'mAP50' runs/val/test_evaluation/predictions.json)" >> eval_log.txt ``` ## 📈 Continuous Evaluation ### Automated Evaluation ```bash #!/bin/bash # eval_script.sh MODEL_SIZE=${1:-n} THRESHOLD=${2:-0.25} echo "Evaluating model size: $MODEL_SIZE" python eval.py --model-size $MODEL_SIZE --conf $THRESHOLD # Save results cp runs/val/test_evaluation/predictions.json \ results/eval_${MODEL_SIZE}_$(date +%Y%m%d).json ``` ### Integration with CI/CD ```yaml # .github/workflows/evaluate.yml name: Model Evaluation on: [push, pull_request] jobs: evaluate: runs-on: ubuntu-latest steps: - uses: actions/checkout@v2 - name: Evaluate Model run: | pip install -r requirements.txt python eval.py --model-size n ``` --- **Note**: Regular evaluation helps ensure model performance remains consistent over time.