# Evaluation Guide

## Overview

This guide covers model evaluation procedures for YOLOv8 French ID Card Detection models.

## 🎯 Evaluation Process

### 1. Basic Evaluation

Evaluate the best trained model:

```bash
python eval.py
```

This will:
- Automatically find the best model from `runs/train/`
- Load the test dataset
- Run evaluation on test set
- Save results to `runs/val/test_evaluation/`

### 2. Custom Evaluation

#### Evaluate Specific Model
```bash
python eval.py --model runs/train/yolov8_n_french_id_card/weights/best.pt
```

#### Custom Thresholds
```bash
python eval.py --conf 0.3 --iou 0.5
```

#### Different Model Size
```bash
python eval.py --model-size m
```

## 📊 Evaluation Metrics

### Key Metrics Explained

1. **mAP50 (Mean Average Precision at IoU=0.5)**
   - Measures precision across different recall levels
   - IoU threshold of 0.5 (50% overlap)
   - Range: 0-1 (higher is better)

2. **mAP50-95 (Mean Average Precision across IoU thresholds)**
   - Average of mAP at IoU thresholds from 0.5 to 0.95
   - More comprehensive than mAP50
   - Range: 0-1 (higher is better)

3. **Precision**
   - Ratio of correct detections to total detections
   - Measures accuracy of positive predictions
   - Range: 0-1 (higher is better)

4. **Recall**
   - Ratio of correct detections to total ground truth objects
   - Measures ability to find all objects
   - Range: 0-1 (higher is better)

### Expected Performance

For French ID Card detection:

| Metric | Target | Good | Excellent |
|--------|--------|------|-----------|
| mAP50  | >0.8   | >0.9 | >0.95     |
| mAP50-95| >0.6   | >0.8 | >0.9      |
| Precision| >0.8   | >0.9 | >0.95     |
| Recall | >0.8   | >0.9 | >0.95     |

## 📈 Understanding Results

### Sample Output

```
Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 14/14
  all        212        209          1       0.99      0.995      0.992
```

**Interpretation:**
- **Images**: 212 test images
- **Instances**: 209 ground truth objects
- **Box(P)**: Precision = 1.0 (100% accurate detections)
- **R**: Recall = 0.99 (99% of objects found)
- **mAP50**: 0.995 (excellent performance)
- **mAP50-95**: 0.992 (excellent across IoU thresholds)

### Confidence vs IoU Thresholds

#### Confidence Threshold Impact
```bash
# High confidence (fewer detections, higher precision)
python eval.py --conf 0.7

# Low confidence (more detections, lower precision)
python eval.py --conf 0.1
```

#### IoU Threshold Impact
```bash
# Strict IoU (higher precision requirements)
python eval.py --iou 0.7

# Lenient IoU (easier to match detections)
python eval.py --iou 0.3
```

## 📁 Evaluation Outputs

### Results Directory Structure

```
runs/val/test_evaluation/
├── predictions.json      # Detailed predictions
├── results.png          # Performance plots
├── confusion_matrix.png  # Confusion matrix
├── BoxR_curve.png      # Precision-Recall curve
├── labels/             # Predicted labels
└── images/             # Visualization images
```

### Key Output Files

1. **predictions.json**
   ```json
   {
     "metrics": {
       "metrics/mAP50": 0.995,
       "metrics/mAP50-95": 0.992,
       "metrics/precision": 1.0,
       "metrics/recall": 0.99
     }
   }
   ```

2. **results.png**
   - Training curves
   - Loss plots
   - Metric evolution

3. **confusion_matrix.png**
   - True vs predicted classifications
   - Error analysis

## 🔍 Advanced Evaluation

### Batch Evaluation

Evaluate multiple models:

```bash
# Evaluate different model sizes
for size in n s m l; do
    python eval.py --model-size $size
done
```

### Cross-Validation

```bash
# Evaluate with different data splits
python eval.py --data data/data_val1.yaml
python eval.py --data data/data_val2.yaml
```

### Performance Analysis

#### Speed vs Accuracy Trade-off

| Model Size | Inference Time | mAP50 | Use Case |
|------------|----------------|-------|----------|
| n (nano)   | ~2ms          | 0.995 | Real-time |
| s (small)  | ~4ms          | 0.998 | Balanced |
| m (medium) | ~8ms          | 0.999 | High accuracy |
| l (large)  | ~12ms         | 0.999 | Best accuracy |

## 📊 Visualization

### Generated Plots

1. **Precision-Recall Curve**
   - Shows precision vs recall at different thresholds
   - Area under curve = mAP

2. **Confusion Matrix**
   - True positives, false positives, false negatives
   - Helps identify error patterns

3. **Training Curves**
   - Loss evolution during training
   - Metric progression

### Custom Visualizations

```python
# Load evaluation results
import json
with open('runs/val/test_evaluation/predictions.json', 'r') as f:
    results = json.load(f)

# Analyze specific metrics
mAP50 = results['metrics']['metrics/mAP50']
precision = results['metrics']['metrics/precision']
recall = results['metrics']['metrics/recall']
```

## 🔧 Troubleshooting

### Common Evaluation Issues

**1. Model Not Found**
```bash
# Check available models
ls runs/train/*/weights/

# Specify model path explicitly
python eval.py --model path/to/model.pt
```

**2. Test Data Not Found**
```bash
# Validate data structure
python train.py --validate-only

# Check data.yaml paths
cat data/data.yaml
```

**3. Memory Issues**
```bash
# Reduce batch size
python eval.py --batch-size 8

# Use smaller model
python eval.py --model-size n
```

### Debug Commands

```bash
# Check model file
python -c "import torch; model = torch.load('model.pt'); print(model.keys())"

# Validate data paths
python -c "import yaml; data = yaml.safe_load(open('data/data.yaml')); print(data)"

# Test GPU availability
python -c "import torch; print(torch.cuda.is_available())"
```

## 📋 Evaluation Checklist

- [ ] Model trained successfully
- [ ] Test dataset available
- [ ] GPU memory sufficient
- [ ] Correct model path
- [ ] Appropriate thresholds set
- [ ] Results directory writable

## 🎯 Best Practices

### 1. Threshold Selection

```bash
# Start with default thresholds
python eval.py

# Adjust based on use case
python eval.py --conf 0.5 --iou 0.5  # Balanced
python eval.py --conf 0.7 --iou 0.7  # High precision
python eval.py --conf 0.3 --iou 0.3  # High recall
```

### 2. Model Comparison

```bash
# Compare different models
python eval.py --model-size n
python eval.py --model-size s
python eval.py --model-size m

# Compare results
diff runs/val/test_evaluation_n/predictions.json \
     runs/val/test_evaluation_s/predictions.json
```

### 3. Performance Monitoring

```bash
# Regular evaluation
python eval.py --model-size n

# Log results
echo "$(date): mAP50=$(grep 'mAP50' runs/val/test_evaluation/predictions.json)" >> eval_log.txt
```

## 📈 Continuous Evaluation

### Automated Evaluation

```bash
#!/bin/bash
# eval_script.sh

MODEL_SIZE=${1:-n}
THRESHOLD=${2:-0.25}

echo "Evaluating model size: $MODEL_SIZE"
python eval.py --model-size $MODEL_SIZE --conf $THRESHOLD

# Save results
cp runs/val/test_evaluation/predictions.json \
   results/eval_${MODEL_SIZE}_$(date +%Y%m%d).json
```

### Integration with CI/CD

```yaml
# .github/workflows/evaluate.yml
name: Model Evaluation
on: [push, pull_request]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - name: Evaluate Model
        run: |
          pip install -r requirements.txt
          python eval.py --model-size n
```

---

**Note**: Regular evaluation helps ensure model performance remains consistent over time.