Files
IDcardsGenerator/src/model/ID_cards_detector/docs/evaluation.md
Nguyễn Phước Thành 4ee14f17d3 update ID_cards_detector
2025-08-06 19:03:17 +07:00

7.3 KiB

Evaluation Guide

Overview

This guide covers model evaluation procedures for YOLOv8 French ID Card Detection models.

🎯 Evaluation Process

1. Basic Evaluation

Evaluate the best trained model:

python eval.py

This will:

  • Automatically find the best model from runs/train/
  • Load the test dataset
  • Run evaluation on test set
  • Save results to runs/val/test_evaluation/

2. Custom Evaluation

Evaluate Specific Model

python eval.py --model runs/train/yolov8_n_french_id_card/weights/best.pt

Custom Thresholds

python eval.py --conf 0.3 --iou 0.5

Different Model Size

python eval.py --model-size m

📊 Evaluation Metrics

Key Metrics Explained

  1. mAP50 (Mean Average Precision at IoU=0.5)

    • Measures precision across different recall levels
    • IoU threshold of 0.5 (50% overlap)
    • Range: 0-1 (higher is better)
  2. mAP50-95 (Mean Average Precision across IoU thresholds)

    • Average of mAP at IoU thresholds from 0.5 to 0.95
    • More comprehensive than mAP50
    • Range: 0-1 (higher is better)
  3. Precision

    • Ratio of correct detections to total detections
    • Measures accuracy of positive predictions
    • Range: 0-1 (higher is better)
  4. Recall

    • Ratio of correct detections to total ground truth objects
    • Measures ability to find all objects
    • Range: 0-1 (higher is better)

Expected Performance

For French ID Card detection:

Metric Target Good Excellent
mAP50 >0.8 >0.9 >0.95
mAP50-95 >0.6 >0.8 >0.9
Precision >0.8 >0.9 >0.95
Recall >0.8 >0.9 >0.95

📈 Understanding Results

Sample Output

Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 14/14
  all        212        209          1       0.99      0.995      0.992

Interpretation:

  • Images: 212 test images
  • Instances: 209 ground truth objects
  • Box(P): Precision = 1.0 (100% accurate detections)
  • R: Recall = 0.99 (99% of objects found)
  • mAP50: 0.995 (excellent performance)
  • mAP50-95: 0.992 (excellent across IoU thresholds)

Confidence vs IoU Thresholds

Confidence Threshold Impact

# High confidence (fewer detections, higher precision)
python eval.py --conf 0.7

# Low confidence (more detections, lower precision)
python eval.py --conf 0.1

IoU Threshold Impact

# Strict IoU (higher precision requirements)
python eval.py --iou 0.7

# Lenient IoU (easier to match detections)
python eval.py --iou 0.3

📁 Evaluation Outputs

Results Directory Structure

runs/val/test_evaluation/
├── predictions.json      # Detailed predictions
├── results.png          # Performance plots
├── confusion_matrix.png  # Confusion matrix
├── BoxR_curve.png      # Precision-Recall curve
├── labels/             # Predicted labels
└── images/             # Visualization images

Key Output Files

  1. predictions.json

    {
      "metrics": {
        "metrics/mAP50": 0.995,
        "metrics/mAP50-95": 0.992,
        "metrics/precision": 1.0,
        "metrics/recall": 0.99
      }
    }
    
  2. results.png

    • Training curves
    • Loss plots
    • Metric evolution
  3. confusion_matrix.png

    • True vs predicted classifications
    • Error analysis

🔍 Advanced Evaluation

Batch Evaluation

Evaluate multiple models:

# Evaluate different model sizes
for size in n s m l; do
    python eval.py --model-size $size
done

Cross-Validation

# Evaluate with different data splits
python eval.py --data data/data_val1.yaml
python eval.py --data data/data_val2.yaml

Performance Analysis

Speed vs Accuracy Trade-off

Model Size Inference Time mAP50 Use Case
n (nano) ~2ms 0.995 Real-time
s (small) ~4ms 0.998 Balanced
m (medium) ~8ms 0.999 High accuracy
l (large) ~12ms 0.999 Best accuracy

📊 Visualization

Generated Plots

  1. Precision-Recall Curve

    • Shows precision vs recall at different thresholds
    • Area under curve = mAP
  2. Confusion Matrix

    • True positives, false positives, false negatives
    • Helps identify error patterns
  3. Training Curves

    • Loss evolution during training
    • Metric progression

Custom Visualizations

# Load evaluation results
import json
with open('runs/val/test_evaluation/predictions.json', 'r') as f:
    results = json.load(f)

# Analyze specific metrics
mAP50 = results['metrics']['metrics/mAP50']
precision = results['metrics']['metrics/precision']
recall = results['metrics']['metrics/recall']

🔧 Troubleshooting

Common Evaluation Issues

1. Model Not Found

# Check available models
ls runs/train/*/weights/

# Specify model path explicitly
python eval.py --model path/to/model.pt

2. Test Data Not Found

# Validate data structure
python train.py --validate-only

# Check data.yaml paths
cat data/data.yaml

3. Memory Issues

# Reduce batch size
python eval.py --batch-size 8

# Use smaller model
python eval.py --model-size n

Debug Commands

# Check model file
python -c "import torch; model = torch.load('model.pt'); print(model.keys())"

# Validate data paths
python -c "import yaml; data = yaml.safe_load(open('data/data.yaml')); print(data)"

# Test GPU availability
python -c "import torch; print(torch.cuda.is_available())"

📋 Evaluation Checklist

  • Model trained successfully
  • Test dataset available
  • GPU memory sufficient
  • Correct model path
  • Appropriate thresholds set
  • Results directory writable

🎯 Best Practices

1. Threshold Selection

# Start with default thresholds
python eval.py

# Adjust based on use case
python eval.py --conf 0.5 --iou 0.5  # Balanced
python eval.py --conf 0.7 --iou 0.7  # High precision
python eval.py --conf 0.3 --iou 0.3  # High recall

2. Model Comparison

# Compare different models
python eval.py --model-size n
python eval.py --model-size s
python eval.py --model-size m

# Compare results
diff runs/val/test_evaluation_n/predictions.json \
     runs/val/test_evaluation_s/predictions.json

3. Performance Monitoring

# Regular evaluation
python eval.py --model-size n

# Log results
echo "$(date): mAP50=$(grep 'mAP50' runs/val/test_evaluation/predictions.json)" >> eval_log.txt

📈 Continuous Evaluation

Automated Evaluation

#!/bin/bash
# eval_script.sh

MODEL_SIZE=${1:-n}
THRESHOLD=${2:-0.25}

echo "Evaluating model size: $MODEL_SIZE"
python eval.py --model-size $MODEL_SIZE --conf $THRESHOLD

# Save results
cp runs/val/test_evaluation/predictions.json \
   results/eval_${MODEL_SIZE}_$(date +%Y%m%d).json

Integration with CI/CD

# .github/workflows/evaluate.yml
name: Model Evaluation
on: [push, pull_request]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - name: Evaluate Model
        run: |
          pip install -r requirements.txt
          python eval.py --model-size n

Note: Regular evaluation helps ensure model performance remains consistent over time.