Files
IDcardsGenerator/src/model/ID_cards_detector/docs/evaluation.md

340 lines
7.3 KiB
Markdown
Raw Normal View History

2025-08-06 19:03:17 +07:00
# Evaluation Guide
## Overview
This guide covers model evaluation procedures for YOLOv8 French ID Card Detection models.
## 🎯 Evaluation Process
### 1. Basic Evaluation
Evaluate the best trained model:
```bash
python eval.py
```
This will:
- Automatically find the best model from `runs/train/`
- Load the test dataset
- Run evaluation on test set
- Save results to `runs/val/test_evaluation/`
### 2. Custom Evaluation
#### Evaluate Specific Model
```bash
python eval.py --model runs/train/yolov8_n_french_id_card/weights/best.pt
```
#### Custom Thresholds
```bash
python eval.py --conf 0.3 --iou 0.5
```
#### Different Model Size
```bash
python eval.py --model-size m
```
## 📊 Evaluation Metrics
### Key Metrics Explained
1. **mAP50 (Mean Average Precision at IoU=0.5)**
- Measures precision across different recall levels
- IoU threshold of 0.5 (50% overlap)
- Range: 0-1 (higher is better)
2. **mAP50-95 (Mean Average Precision across IoU thresholds)**
- Average of mAP at IoU thresholds from 0.5 to 0.95
- More comprehensive than mAP50
- Range: 0-1 (higher is better)
3. **Precision**
- Ratio of correct detections to total detections
- Measures accuracy of positive predictions
- Range: 0-1 (higher is better)
4. **Recall**
- Ratio of correct detections to total ground truth objects
- Measures ability to find all objects
- Range: 0-1 (higher is better)
### Expected Performance
For French ID Card detection:
| Metric | Target | Good | Excellent |
|--------|--------|------|-----------|
| mAP50 | >0.8 | >0.9 | >0.95 |
| mAP50-95| >0.6 | >0.8 | >0.9 |
| Precision| >0.8 | >0.9 | >0.95 |
| Recall | >0.8 | >0.9 | >0.95 |
## 📈 Understanding Results
### Sample Output
```
Class Images Instances Box(P R mAP50 mAP50-95): 100%|██████████| 14/14
all 212 209 1 0.99 0.995 0.992
```
**Interpretation:**
- **Images**: 212 test images
- **Instances**: 209 ground truth objects
- **Box(P)**: Precision = 1.0 (100% accurate detections)
- **R**: Recall = 0.99 (99% of objects found)
- **mAP50**: 0.995 (excellent performance)
- **mAP50-95**: 0.992 (excellent across IoU thresholds)
### Confidence vs IoU Thresholds
#### Confidence Threshold Impact
```bash
# High confidence (fewer detections, higher precision)
python eval.py --conf 0.7
# Low confidence (more detections, lower precision)
python eval.py --conf 0.1
```
#### IoU Threshold Impact
```bash
# Strict IoU (higher precision requirements)
python eval.py --iou 0.7
# Lenient IoU (easier to match detections)
python eval.py --iou 0.3
```
## 📁 Evaluation Outputs
### Results Directory Structure
```
runs/val/test_evaluation/
├── predictions.json # Detailed predictions
├── results.png # Performance plots
├── confusion_matrix.png # Confusion matrix
├── BoxR_curve.png # Precision-Recall curve
├── labels/ # Predicted labels
└── images/ # Visualization images
```
### Key Output Files
1. **predictions.json**
```json
{
"metrics": {
"metrics/mAP50": 0.995,
"metrics/mAP50-95": 0.992,
"metrics/precision": 1.0,
"metrics/recall": 0.99
}
}
```
2. **results.png**
- Training curves
- Loss plots
- Metric evolution
3. **confusion_matrix.png**
- True vs predicted classifications
- Error analysis
## 🔍 Advanced Evaluation
### Batch Evaluation
Evaluate multiple models:
```bash
# Evaluate different model sizes
for size in n s m l; do
python eval.py --model-size $size
done
```
### Cross-Validation
```bash
# Evaluate with different data splits
python eval.py --data data/data_val1.yaml
python eval.py --data data/data_val2.yaml
```
### Performance Analysis
#### Speed vs Accuracy Trade-off
| Model Size | Inference Time | mAP50 | Use Case |
|------------|----------------|-------|----------|
| n (nano) | ~2ms | 0.995 | Real-time |
| s (small) | ~4ms | 0.998 | Balanced |
| m (medium) | ~8ms | 0.999 | High accuracy |
| l (large) | ~12ms | 0.999 | Best accuracy |
## 📊 Visualization
### Generated Plots
1. **Precision-Recall Curve**
- Shows precision vs recall at different thresholds
- Area under curve = mAP
2. **Confusion Matrix**
- True positives, false positives, false negatives
- Helps identify error patterns
3. **Training Curves**
- Loss evolution during training
- Metric progression
### Custom Visualizations
```python
# Load evaluation results
import json
with open('runs/val/test_evaluation/predictions.json', 'r') as f:
results = json.load(f)
# Analyze specific metrics
mAP50 = results['metrics']['metrics/mAP50']
precision = results['metrics']['metrics/precision']
recall = results['metrics']['metrics/recall']
```
## 🔧 Troubleshooting
### Common Evaluation Issues
**1. Model Not Found**
```bash
# Check available models
ls runs/train/*/weights/
# Specify model path explicitly
python eval.py --model path/to/model.pt
```
**2. Test Data Not Found**
```bash
# Validate data structure
python train.py --validate-only
# Check data.yaml paths
cat data/data.yaml
```
**3. Memory Issues**
```bash
# Reduce batch size
python eval.py --batch-size 8
# Use smaller model
python eval.py --model-size n
```
### Debug Commands
```bash
# Check model file
python -c "import torch; model = torch.load('model.pt'); print(model.keys())"
# Validate data paths
python -c "import yaml; data = yaml.safe_load(open('data/data.yaml')); print(data)"
# Test GPU availability
python -c "import torch; print(torch.cuda.is_available())"
```
## 📋 Evaluation Checklist
- [ ] Model trained successfully
- [ ] Test dataset available
- [ ] GPU memory sufficient
- [ ] Correct model path
- [ ] Appropriate thresholds set
- [ ] Results directory writable
## 🎯 Best Practices
### 1. Threshold Selection
```bash
# Start with default thresholds
python eval.py
# Adjust based on use case
python eval.py --conf 0.5 --iou 0.5 # Balanced
python eval.py --conf 0.7 --iou 0.7 # High precision
python eval.py --conf 0.3 --iou 0.3 # High recall
```
### 2. Model Comparison
```bash
# Compare different models
python eval.py --model-size n
python eval.py --model-size s
python eval.py --model-size m
# Compare results
diff runs/val/test_evaluation_n/predictions.json \
runs/val/test_evaluation_s/predictions.json
```
### 3. Performance Monitoring
```bash
# Regular evaluation
python eval.py --model-size n
# Log results
echo "$(date): mAP50=$(grep 'mAP50' runs/val/test_evaluation/predictions.json)" >> eval_log.txt
```
## 📈 Continuous Evaluation
### Automated Evaluation
```bash
#!/bin/bash
# eval_script.sh
MODEL_SIZE=${1:-n}
THRESHOLD=${2:-0.25}
echo "Evaluating model size: $MODEL_SIZE"
python eval.py --model-size $MODEL_SIZE --conf $THRESHOLD
# Save results
cp runs/val/test_evaluation/predictions.json \
results/eval_${MODEL_SIZE}_$(date +%Y%m%d).json
```
### Integration with CI/CD
```yaml
# .github/workflows/evaluate.yml
name: Model Evaluation
on: [push, pull_request]
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Evaluate Model
run: |
pip install -r requirements.txt
python eval.py --model-size n
```
---
**Note**: Regular evaluation helps ensure model performance remains consistent over time.