IDcardsGenerator/src/model/ID_cards_detector/docs/evaluation.md

# Evaluation Guide

## Overview

This guide covers model evaluation procedures for YOLOv8 French ID Card Detection models.

## 🎯 Evaluation Process

### 1. Basic Evaluation

Evaluate the best trained model:

```bash
python eval.py
```

This will:
- Automatically find the best model from `runs/train/`
- Load the test dataset
- Run evaluation on test set
- Save results to `runs/val/test_evaluation/`

### 2. Custom Evaluation

#### Evaluate Specific Model
```bash
python eval.py --model runs/train/yolov8_n_french_id_card/weights/best.pt
```

#### Custom Thresholds
```bash
python eval.py --conf 0.3 --iou 0.5
```

#### Different Model Size
```bash
python eval.py --model-size m
```

## 📊 Evaluation Metrics

### Key Metrics Explained

1. **mAP50 (Mean Average Precision at IoU=0.5)**
   - Measures precision across different recall levels
   - IoU threshold of 0.5 (50% overlap)
   - Range: 0-1 (higher is better)

2. **mAP50-95 (Mean Average Precision across IoU thresholds)**
   - Average of mAP at IoU thresholds from 0.5 to 0.95
   - More comprehensive than mAP50
   - Range: 0-1 (higher is better)

3. **Precision**
   - Ratio of correct detections to total detections
   - Measures accuracy of positive predictions
   - Range: 0-1 (higher is better)

4. **Recall**
   - Ratio of correct detections to total ground truth objects
   - Measures ability to find all objects
   - Range: 0-1 (higher is better)

### Expected Performance

For French ID Card detection:

| Metric | Target | Good | Excellent |
|--------|--------|------|-----------|
| mAP50  | >0.8   | >0.9 | >0.95     |
| mAP50-95| >0.6   | >0.8 | >0.9      |
| Precision| >0.8   | >0.9 | >0.95     |
| Recall | >0.8   | >0.9 | >0.95     |

## 📈 Understanding Results

### Sample Output

```
Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 14/14
  all        212        209          1       0.99      0.995      0.992
```

**Interpretation:**
- **Images**: 212 test images
- **Instances**: 209 ground truth objects
- **Box(P)**: Precision = 1.0 (100% accurate detections)
- **R**: Recall = 0.99 (99% of objects found)
- **mAP50**: 0.995 (excellent performance)
- **mAP50-95**: 0.992 (excellent across IoU thresholds)

### Confidence vs IoU Thresholds

#### Confidence Threshold Impact
```bash
# High confidence (fewer detections, higher precision)
python eval.py --conf 0.7

# Low confidence (more detections, lower precision)
python eval.py --conf 0.1
```

#### IoU Threshold Impact
```bash
# Strict IoU (higher precision requirements)
python eval.py --iou 0.7

# Lenient IoU (easier to match detections)
python eval.py --iou 0.3
```

## 📁 Evaluation Outputs

### Results Directory Structure

```
runs/val/test_evaluation/
├── predictions.json      # Detailed predictions
├── results.png          # Performance plots
├── confusion_matrix.png  # Confusion matrix
├── BoxR_curve.png      # Precision-Recall curve
├── labels/             # Predicted labels
└── images/             # Visualization images
```

### Key Output Files

1. **predictions.json**
   ```json
   {
     "metrics": {
       "metrics/mAP50": 0.995,
       "metrics/mAP50-95": 0.992,
       "metrics/precision": 1.0,
       "metrics/recall": 0.99
     }
   }
   ```

2. **results.png**
   - Training curves
   - Loss plots
   - Metric evolution

3. **confusion_matrix.png**
   - True vs predicted classifications
   - Error analysis

## 🔍 Advanced Evaluation

### Batch Evaluation

Evaluate multiple models:

```bash
# Evaluate different model sizes
for size in n s m l; do
    python eval.py --model-size $size
done
```

### Cross-Validation

```bash
# Evaluate with different data splits
python eval.py --data data/data_val1.yaml
python eval.py --data data/data_val2.yaml
```

### Performance Analysis

#### Speed vs Accuracy Trade-off

| Model Size | Inference Time | mAP50 | Use Case |
|------------|----------------|-------|----------|
| n (nano)   | ~2ms          | 0.995 | Real-time |
| s (small)  | ~4ms          | 0.998 | Balanced |
| m (medium) | ~8ms          | 0.999 | High accuracy |
| l (large)  | ~12ms         | 0.999 | Best accuracy |

## 📊 Visualization

### Generated Plots

1. **Precision-Recall Curve**
   - Shows precision vs recall at different thresholds
   - Area under curve = mAP

2. **Confusion Matrix**
   - True positives, false positives, false negatives
   - Helps identify error patterns

3. **Training Curves**
   - Loss evolution during training
   - Metric progression

### Custom Visualizations

```python
# Load evaluation results
import json
with open('runs/val/test_evaluation/predictions.json', 'r') as f:
    results = json.load(f)

# Analyze specific metrics
mAP50 = results['metrics']['metrics/mAP50']
precision = results['metrics']['metrics/precision']
recall = results['metrics']['metrics/recall']
```

## 🔧 Troubleshooting

### Common Evaluation Issues

**1. Model Not Found**
```bash
# Check available models
ls runs/train/*/weights/

# Specify model path explicitly
python eval.py --model path/to/model.pt
```

**2. Test Data Not Found**
```bash
# Validate data structure
python train.py --validate-only

# Check data.yaml paths
cat data/data.yaml
```

**3. Memory Issues**
```bash
# Reduce batch size
python eval.py --batch-size 8

# Use smaller model
python eval.py --model-size n
```

### Debug Commands

```bash
# Check model file
python -c "import torch; model = torch.load('model.pt'); print(model.keys())"

# Validate data paths
python -c "import yaml; data = yaml.safe_load(open('data/data.yaml')); print(data)"

# Test GPU availability
python -c "import torch; print(torch.cuda.is_available())"
```

## 📋 Evaluation Checklist

- [ ] Model trained successfully
- [ ] Test dataset available
- [ ] GPU memory sufficient
- [ ] Correct model path
- [ ] Appropriate thresholds set
- [ ] Results directory writable

## 🎯 Best Practices

### 1. Threshold Selection

```bash
# Start with default thresholds
python eval.py

# Adjust based on use case
python eval.py --conf 0.5 --iou 0.5  # Balanced
python eval.py --conf 0.7 --iou 0.7  # High precision
python eval.py --conf 0.3 --iou 0.3  # High recall
```

### 2. Model Comparison

```bash
# Compare different models
python eval.py --model-size n
python eval.py --model-size s
python eval.py --model-size m

# Compare results
diff runs/val/test_evaluation_n/predictions.json \
     runs/val/test_evaluation_s/predictions.json
```

### 3. Performance Monitoring

```bash
# Regular evaluation
python eval.py --model-size n

# Log results
echo "$(date): mAP50=$(grep 'mAP50' runs/val/test_evaluation/predictions.json)" >> eval_log.txt
```

## 📈 Continuous Evaluation

### Automated Evaluation

```bash
#!/bin/bash
# eval_script.sh

MODEL_SIZE=${1:-n}
THRESHOLD=${2:-0.25}

echo "Evaluating model size: $MODEL_SIZE"
python eval.py --model-size $MODEL_SIZE --conf $THRESHOLD

# Save results
cp runs/val/test_evaluation/predictions.json \
   results/eval_${MODEL_SIZE}_$(date +%Y%m%d).json
```

### Integration with CI/CD

```yaml
# .github/workflows/evaluate.yml
name: Model Evaluation
on: [push, pull_request]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - name: Evaluate Model
        run: |
          pip install -r requirements.txt
          python eval.py --model-size n
```

---

**Note**: Regular evaluation helps ensure model performance remains consistent over time.
update ID_cards_detector 2025-08-06 19:03:17 +07:00			`# Evaluation Guide`

			`## Overview`

			`This guide covers model evaluation procedures for YOLOv8 French ID Card Detection models.`

			`## 🎯 Evaluation Process`

			`### 1. Basic Evaluation`

			`Evaluate the best trained model:`

			```bash
			`python eval.py`
			```

			`This will:`
			- Automatically find the best model from `runs/train/`
			`- Load the test dataset`
			`- Run evaluation on test set`
			- Save results to `runs/val/test_evaluation/`

			`### 2. Custom Evaluation`

			`#### Evaluate Specific Model`
			```bash
			`python eval.py --model runs/train/yolov8_n_french_id_card/weights/best.pt`
			```

			`#### Custom Thresholds`
			```bash
			`python eval.py --conf 0.3 --iou 0.5`
			```

			`#### Different Model Size`
			```bash
			`python eval.py --model-size m`
			```

			`## 📊 Evaluation Metrics`

			`### Key Metrics Explained`

			`1. mAP50 (Mean Average Precision at IoU=0.5)`
			`- Measures precision across different recall levels`
			`- IoU threshold of 0.5 (50% overlap)`
			`- Range: 0-1 (higher is better)`

			`2. mAP50-95 (Mean Average Precision across IoU thresholds)`
			`- Average of mAP at IoU thresholds from 0.5 to 0.95`
			`- More comprehensive than mAP50`
			`- Range: 0-1 (higher is better)`

			`3. Precision`
			`- Ratio of correct detections to total detections`
			`- Measures accuracy of positive predictions`
			`- Range: 0-1 (higher is better)`

			`4. Recall`
			`- Ratio of correct detections to total ground truth objects`
			`- Measures ability to find all objects`
			`- Range: 0-1 (higher is better)`

			`### Expected Performance`

			`For French ID Card detection:`

			`\| Metric \| Target \| Good \| Excellent \|`
			`\|--------\|--------\|------\|-----------\|`
			`\| mAP50 \| >0.8 \| >0.9 \| >0.95 \|`
			`\| mAP50-95\| >0.6 \| >0.8 \| >0.9 \|`
			`\| Precision\| >0.8 \| >0.9 \| >0.95 \|`
			`\| Recall \| >0.8 \| >0.9 \| >0.95 \|`

			`## 📈 Understanding Results`

			`### Sample Output`

			```
			`Class Images Instances Box(P R mAP50 mAP50-95): 100%\|██████████\| 14/14`
			`all 212 209 1 0.99 0.995 0.992`
			```

			`Interpretation:`
			`- Images: 212 test images`
			`- Instances: 209 ground truth objects`
			`- Box(P): Precision = 1.0 (100% accurate detections)`
			`- R: Recall = 0.99 (99% of objects found)`
			`- mAP50: 0.995 (excellent performance)`
			`- mAP50-95: 0.992 (excellent across IoU thresholds)`

			`### Confidence vs IoU Thresholds`

			`#### Confidence Threshold Impact`
			```bash
			`# High confidence (fewer detections, higher precision)`
			`python eval.py --conf 0.7`

			`# Low confidence (more detections, lower precision)`
			`python eval.py --conf 0.1`
			```

			`#### IoU Threshold Impact`
			```bash
			`# Strict IoU (higher precision requirements)`
			`python eval.py --iou 0.7`

			`# Lenient IoU (easier to match detections)`
			`python eval.py --iou 0.3`
			```

			`## 📁 Evaluation Outputs`

			`### Results Directory Structure`

			```
			`runs/val/test_evaluation/`
			`├── predictions.json # Detailed predictions`
			`├── results.png # Performance plots`
			`├── confusion_matrix.png # Confusion matrix`
			`├── BoxR_curve.png # Precision-Recall curve`
			`├── labels/ # Predicted labels`
			`└── images/ # Visualization images`
			```

			`### Key Output Files`

			`1. predictions.json`
			```json
			`{`
			`"metrics": {`
			`"metrics/mAP50": 0.995,`
			`"metrics/mAP50-95": 0.992,`
			`"metrics/precision": 1.0,`
			`"metrics/recall": 0.99`
			`}`
			`}`
			```

			`2. results.png`
			`- Training curves`
			`- Loss plots`
			`- Metric evolution`

			`3. confusion_matrix.png`
			`- True vs predicted classifications`
			`- Error analysis`

			`## 🔍 Advanced Evaluation`

			`### Batch Evaluation`

			`Evaluate multiple models:`

			```bash
			`# Evaluate different model sizes`
			`for size in n s m l; do`
			`python eval.py --model-size $size`
			`done`
			```

			`### Cross-Validation`

			```bash
			`# Evaluate with different data splits`
			`python eval.py --data data/data_val1.yaml`
			`python eval.py --data data/data_val2.yaml`
			```

			`### Performance Analysis`

			`#### Speed vs Accuracy Trade-off`

			`\| Model Size \| Inference Time \| mAP50 \| Use Case \|`
			`\|------------\|----------------\|-------\|----------\|`
			`\| n (nano) \| ~2ms \| 0.995 \| Real-time \|`
			`\| s (small) \| ~4ms \| 0.998 \| Balanced \|`
			`\| m (medium) \| ~8ms \| 0.999 \| High accuracy \|`
			`\| l (large) \| ~12ms \| 0.999 \| Best accuracy \|`

			`## 📊 Visualization`

			`### Generated Plots`

			`1. Precision-Recall Curve`
			`- Shows precision vs recall at different thresholds`
			`- Area under curve = mAP`

			`2. Confusion Matrix`
			`- True positives, false positives, false negatives`
			`- Helps identify error patterns`

			`3. Training Curves`
			`- Loss evolution during training`
			`- Metric progression`

			`### Custom Visualizations`

			```python
			`# Load evaluation results`
			`import json`
			`with open('runs/val/test_evaluation/predictions.json', 'r') as f:`
			`results = json.load(f)`

			`# Analyze specific metrics`
			`mAP50 = results['metrics']['metrics/mAP50']`
			`precision = results['metrics']['metrics/precision']`
			`recall = results['metrics']['metrics/recall']`
			```

			`## 🔧 Troubleshooting`

			`### Common Evaluation Issues`

			`1. Model Not Found`
			```bash
			`# Check available models`
			`ls runs/train/*/weights/`

			`# Specify model path explicitly`
			`python eval.py --model path/to/model.pt`
			```

			`2. Test Data Not Found`
			```bash
			`# Validate data structure`
			`python train.py --validate-only`

			`# Check data.yaml paths`
			`cat data/data.yaml`
			```

			`3. Memory Issues`
			```bash
			`# Reduce batch size`
			`python eval.py --batch-size 8`

			`# Use smaller model`
			`python eval.py --model-size n`
			```

			`### Debug Commands`

			```bash
			`# Check model file`
			`python -c "import torch; model = torch.load('model.pt'); print(model.keys())"`

			`# Validate data paths`
			`python -c "import yaml; data = yaml.safe_load(open('data/data.yaml')); print(data)"`

			`# Test GPU availability`
			`python -c "import torch; print(torch.cuda.is_available())"`
			```

			`## 📋 Evaluation Checklist`

			`- [ ] Model trained successfully`
			`- [ ] Test dataset available`
			`- [ ] GPU memory sufficient`
			`- [ ] Correct model path`
			`- [ ] Appropriate thresholds set`
			`- [ ] Results directory writable`

			`## 🎯 Best Practices`

			`### 1. Threshold Selection`

			```bash
			`# Start with default thresholds`
			`python eval.py`

			`# Adjust based on use case`
			`python eval.py --conf 0.5 --iou 0.5 # Balanced`
			`python eval.py --conf 0.7 --iou 0.7 # High precision`
			`python eval.py --conf 0.3 --iou 0.3 # High recall`
			```

			`### 2. Model Comparison`

			```bash
			`# Compare different models`
			`python eval.py --model-size n`
			`python eval.py --model-size s`
			`python eval.py --model-size m`

			`# Compare results`
			`diff runs/val/test_evaluation_n/predictions.json \`
			`runs/val/test_evaluation_s/predictions.json`
			```

			`### 3. Performance Monitoring`

			```bash
			`# Regular evaluation`
			`python eval.py --model-size n`

			`# Log results`
			`echo "$(date): mAP50=$(grep 'mAP50' runs/val/test_evaluation/predictions.json)" >> eval_log.txt`
			```

			`## 📈 Continuous Evaluation`

			`### Automated Evaluation`

			```bash
			`#!/bin/bash`
			`# eval_script.sh`

			`MODEL_SIZE=${1:-n}`
			`THRESHOLD=${2:-0.25}`

			`echo "Evaluating model size: $MODEL_SIZE"`
			`python eval.py --model-size $MODEL_SIZE --conf $THRESHOLD`

			`# Save results`
			`cp runs/val/test_evaluation/predictions.json \`
			`results/eval_${MODEL_SIZE}_$(date +%Y%m%d).json`
			```

			`### Integration with CI/CD`

			```yaml
			`# .github/workflows/evaluate.yml`
			`name: Model Evaluation`
			`on: [push, pull_request]`

			`jobs:`
			`evaluate:`
			`runs-on: ubuntu-latest`
			`steps:`
			`- uses: actions/checkout@v2`
			`- name: Evaluate Model`
			`run: \|`
			`pip install -r requirements.txt`
			`python eval.py --model-size n`
			```

			`---`

			`Note: Regular evaluation helps ensure model performance remains consistent over time.`