340 lines
7.3 KiB
Markdown
340 lines
7.3 KiB
Markdown
![]() |
# Evaluation Guide
|
||
|
|
||
|
## Overview
|
||
|
|
||
|
This guide covers model evaluation procedures for YOLOv8 French ID Card Detection models.
|
||
|
|
||
|
## 🎯 Evaluation Process
|
||
|
|
||
|
### 1. Basic Evaluation
|
||
|
|
||
|
Evaluate the best trained model:
|
||
|
|
||
|
```bash
|
||
|
python eval.py
|
||
|
```
|
||
|
|
||
|
This will:
|
||
|
- Automatically find the best model from `runs/train/`
|
||
|
- Load the test dataset
|
||
|
- Run evaluation on test set
|
||
|
- Save results to `runs/val/test_evaluation/`
|
||
|
|
||
|
### 2. Custom Evaluation
|
||
|
|
||
|
#### Evaluate Specific Model
|
||
|
```bash
|
||
|
python eval.py --model runs/train/yolov8_n_french_id_card/weights/best.pt
|
||
|
```
|
||
|
|
||
|
#### Custom Thresholds
|
||
|
```bash
|
||
|
python eval.py --conf 0.3 --iou 0.5
|
||
|
```
|
||
|
|
||
|
#### Different Model Size
|
||
|
```bash
|
||
|
python eval.py --model-size m
|
||
|
```
|
||
|
|
||
|
## 📊 Evaluation Metrics
|
||
|
|
||
|
### Key Metrics Explained
|
||
|
|
||
|
1. **mAP50 (Mean Average Precision at IoU=0.5)**
|
||
|
- Measures precision across different recall levels
|
||
|
- IoU threshold of 0.5 (50% overlap)
|
||
|
- Range: 0-1 (higher is better)
|
||
|
|
||
|
2. **mAP50-95 (Mean Average Precision across IoU thresholds)**
|
||
|
- Average of mAP at IoU thresholds from 0.5 to 0.95
|
||
|
- More comprehensive than mAP50
|
||
|
- Range: 0-1 (higher is better)
|
||
|
|
||
|
3. **Precision**
|
||
|
- Ratio of correct detections to total detections
|
||
|
- Measures accuracy of positive predictions
|
||
|
- Range: 0-1 (higher is better)
|
||
|
|
||
|
4. **Recall**
|
||
|
- Ratio of correct detections to total ground truth objects
|
||
|
- Measures ability to find all objects
|
||
|
- Range: 0-1 (higher is better)
|
||
|
|
||
|
### Expected Performance
|
||
|
|
||
|
For French ID Card detection:
|
||
|
|
||
|
| Metric | Target | Good | Excellent |
|
||
|
|--------|--------|------|-----------|
|
||
|
| mAP50 | >0.8 | >0.9 | >0.95 |
|
||
|
| mAP50-95| >0.6 | >0.8 | >0.9 |
|
||
|
| Precision| >0.8 | >0.9 | >0.95 |
|
||
|
| Recall | >0.8 | >0.9 | >0.95 |
|
||
|
|
||
|
## 📈 Understanding Results
|
||
|
|
||
|
### Sample Output
|
||
|
|
||
|
```
|
||
|
Class Images Instances Box(P R mAP50 mAP50-95): 100%|██████████| 14/14
|
||
|
all 212 209 1 0.99 0.995 0.992
|
||
|
```
|
||
|
|
||
|
**Interpretation:**
|
||
|
- **Images**: 212 test images
|
||
|
- **Instances**: 209 ground truth objects
|
||
|
- **Box(P)**: Precision = 1.0 (100% accurate detections)
|
||
|
- **R**: Recall = 0.99 (99% of objects found)
|
||
|
- **mAP50**: 0.995 (excellent performance)
|
||
|
- **mAP50-95**: 0.992 (excellent across IoU thresholds)
|
||
|
|
||
|
### Confidence vs IoU Thresholds
|
||
|
|
||
|
#### Confidence Threshold Impact
|
||
|
```bash
|
||
|
# High confidence (fewer detections, higher precision)
|
||
|
python eval.py --conf 0.7
|
||
|
|
||
|
# Low confidence (more detections, lower precision)
|
||
|
python eval.py --conf 0.1
|
||
|
```
|
||
|
|
||
|
#### IoU Threshold Impact
|
||
|
```bash
|
||
|
# Strict IoU (higher precision requirements)
|
||
|
python eval.py --iou 0.7
|
||
|
|
||
|
# Lenient IoU (easier to match detections)
|
||
|
python eval.py --iou 0.3
|
||
|
```
|
||
|
|
||
|
## 📁 Evaluation Outputs
|
||
|
|
||
|
### Results Directory Structure
|
||
|
|
||
|
```
|
||
|
runs/val/test_evaluation/
|
||
|
├── predictions.json # Detailed predictions
|
||
|
├── results.png # Performance plots
|
||
|
├── confusion_matrix.png # Confusion matrix
|
||
|
├── BoxR_curve.png # Precision-Recall curve
|
||
|
├── labels/ # Predicted labels
|
||
|
└── images/ # Visualization images
|
||
|
```
|
||
|
|
||
|
### Key Output Files
|
||
|
|
||
|
1. **predictions.json**
|
||
|
```json
|
||
|
{
|
||
|
"metrics": {
|
||
|
"metrics/mAP50": 0.995,
|
||
|
"metrics/mAP50-95": 0.992,
|
||
|
"metrics/precision": 1.0,
|
||
|
"metrics/recall": 0.99
|
||
|
}
|
||
|
}
|
||
|
```
|
||
|
|
||
|
2. **results.png**
|
||
|
- Training curves
|
||
|
- Loss plots
|
||
|
- Metric evolution
|
||
|
|
||
|
3. **confusion_matrix.png**
|
||
|
- True vs predicted classifications
|
||
|
- Error analysis
|
||
|
|
||
|
## 🔍 Advanced Evaluation
|
||
|
|
||
|
### Batch Evaluation
|
||
|
|
||
|
Evaluate multiple models:
|
||
|
|
||
|
```bash
|
||
|
# Evaluate different model sizes
|
||
|
for size in n s m l; do
|
||
|
python eval.py --model-size $size
|
||
|
done
|
||
|
```
|
||
|
|
||
|
### Cross-Validation
|
||
|
|
||
|
```bash
|
||
|
# Evaluate with different data splits
|
||
|
python eval.py --data data/data_val1.yaml
|
||
|
python eval.py --data data/data_val2.yaml
|
||
|
```
|
||
|
|
||
|
### Performance Analysis
|
||
|
|
||
|
#### Speed vs Accuracy Trade-off
|
||
|
|
||
|
| Model Size | Inference Time | mAP50 | Use Case |
|
||
|
|------------|----------------|-------|----------|
|
||
|
| n (nano) | ~2ms | 0.995 | Real-time |
|
||
|
| s (small) | ~4ms | 0.998 | Balanced |
|
||
|
| m (medium) | ~8ms | 0.999 | High accuracy |
|
||
|
| l (large) | ~12ms | 0.999 | Best accuracy |
|
||
|
|
||
|
## 📊 Visualization
|
||
|
|
||
|
### Generated Plots
|
||
|
|
||
|
1. **Precision-Recall Curve**
|
||
|
- Shows precision vs recall at different thresholds
|
||
|
- Area under curve = mAP
|
||
|
|
||
|
2. **Confusion Matrix**
|
||
|
- True positives, false positives, false negatives
|
||
|
- Helps identify error patterns
|
||
|
|
||
|
3. **Training Curves**
|
||
|
- Loss evolution during training
|
||
|
- Metric progression
|
||
|
|
||
|
### Custom Visualizations
|
||
|
|
||
|
```python
|
||
|
# Load evaluation results
|
||
|
import json
|
||
|
with open('runs/val/test_evaluation/predictions.json', 'r') as f:
|
||
|
results = json.load(f)
|
||
|
|
||
|
# Analyze specific metrics
|
||
|
mAP50 = results['metrics']['metrics/mAP50']
|
||
|
precision = results['metrics']['metrics/precision']
|
||
|
recall = results['metrics']['metrics/recall']
|
||
|
```
|
||
|
|
||
|
## 🔧 Troubleshooting
|
||
|
|
||
|
### Common Evaluation Issues
|
||
|
|
||
|
**1. Model Not Found**
|
||
|
```bash
|
||
|
# Check available models
|
||
|
ls runs/train/*/weights/
|
||
|
|
||
|
# Specify model path explicitly
|
||
|
python eval.py --model path/to/model.pt
|
||
|
```
|
||
|
|
||
|
**2. Test Data Not Found**
|
||
|
```bash
|
||
|
# Validate data structure
|
||
|
python train.py --validate-only
|
||
|
|
||
|
# Check data.yaml paths
|
||
|
cat data/data.yaml
|
||
|
```
|
||
|
|
||
|
**3. Memory Issues**
|
||
|
```bash
|
||
|
# Reduce batch size
|
||
|
python eval.py --batch-size 8
|
||
|
|
||
|
# Use smaller model
|
||
|
python eval.py --model-size n
|
||
|
```
|
||
|
|
||
|
### Debug Commands
|
||
|
|
||
|
```bash
|
||
|
# Check model file
|
||
|
python -c "import torch; model = torch.load('model.pt'); print(model.keys())"
|
||
|
|
||
|
# Validate data paths
|
||
|
python -c "import yaml; data = yaml.safe_load(open('data/data.yaml')); print(data)"
|
||
|
|
||
|
# Test GPU availability
|
||
|
python -c "import torch; print(torch.cuda.is_available())"
|
||
|
```
|
||
|
|
||
|
## 📋 Evaluation Checklist
|
||
|
|
||
|
- [ ] Model trained successfully
|
||
|
- [ ] Test dataset available
|
||
|
- [ ] GPU memory sufficient
|
||
|
- [ ] Correct model path
|
||
|
- [ ] Appropriate thresholds set
|
||
|
- [ ] Results directory writable
|
||
|
|
||
|
## 🎯 Best Practices
|
||
|
|
||
|
### 1. Threshold Selection
|
||
|
|
||
|
```bash
|
||
|
# Start with default thresholds
|
||
|
python eval.py
|
||
|
|
||
|
# Adjust based on use case
|
||
|
python eval.py --conf 0.5 --iou 0.5 # Balanced
|
||
|
python eval.py --conf 0.7 --iou 0.7 # High precision
|
||
|
python eval.py --conf 0.3 --iou 0.3 # High recall
|
||
|
```
|
||
|
|
||
|
### 2. Model Comparison
|
||
|
|
||
|
```bash
|
||
|
# Compare different models
|
||
|
python eval.py --model-size n
|
||
|
python eval.py --model-size s
|
||
|
python eval.py --model-size m
|
||
|
|
||
|
# Compare results
|
||
|
diff runs/val/test_evaluation_n/predictions.json \
|
||
|
runs/val/test_evaluation_s/predictions.json
|
||
|
```
|
||
|
|
||
|
### 3. Performance Monitoring
|
||
|
|
||
|
```bash
|
||
|
# Regular evaluation
|
||
|
python eval.py --model-size n
|
||
|
|
||
|
# Log results
|
||
|
echo "$(date): mAP50=$(grep 'mAP50' runs/val/test_evaluation/predictions.json)" >> eval_log.txt
|
||
|
```
|
||
|
|
||
|
## 📈 Continuous Evaluation
|
||
|
|
||
|
### Automated Evaluation
|
||
|
|
||
|
```bash
|
||
|
#!/bin/bash
|
||
|
# eval_script.sh
|
||
|
|
||
|
MODEL_SIZE=${1:-n}
|
||
|
THRESHOLD=${2:-0.25}
|
||
|
|
||
|
echo "Evaluating model size: $MODEL_SIZE"
|
||
|
python eval.py --model-size $MODEL_SIZE --conf $THRESHOLD
|
||
|
|
||
|
# Save results
|
||
|
cp runs/val/test_evaluation/predictions.json \
|
||
|
results/eval_${MODEL_SIZE}_$(date +%Y%m%d).json
|
||
|
```
|
||
|
|
||
|
### Integration with CI/CD
|
||
|
|
||
|
```yaml
|
||
|
# .github/workflows/evaluate.yml
|
||
|
name: Model Evaluation
|
||
|
on: [push, pull_request]
|
||
|
|
||
|
jobs:
|
||
|
evaluate:
|
||
|
runs-on: ubuntu-latest
|
||
|
steps:
|
||
|
- uses: actions/checkout@v2
|
||
|
- name: Evaluate Model
|
||
|
run: |
|
||
|
pip install -r requirements.txt
|
||
|
python eval.py --model-size n
|
||
|
```
|
||
|
|
||
|
---
|
||
|
|
||
|
**Note**: Regular evaluation helps ensure model performance remains consistent over time.
|