Files
2025-09-11 09:39:02 +00:00

6.2 KiB

Document Field Extraction Evaluation Results

Overview

This document presents the evaluation results for document field extraction using different preprocessing approaches. The evaluation was conducted on a dataset of 56 document samples with various field types commonly found in identity documents.

Evaluation Metrics

The evaluation uses standard information extraction metrics:

  • Precision: Ratio of correctly extracted fields to total extracted fields
  • Recall: Ratio of correctly extracted fields to total ground truth fields
  • F1-Score: Harmonic mean of precision and recall
  • Accuracy: Overall field-level accuracy
  • TP: True Positives (correctly extracted fields)
  • FP: False Positives (incorrectly extracted fields)
  • FN: False Negatives (missed fields)

Preprocessing Approaches

1. No Preprocessing (Baseline)

  • Configuration: Raw images without any preprocessing
  • Performance:
    • Micro Precision: 79.0%
    • Micro Recall: 68.7%
    • Micro F1: 73.5%
    • Field Accuracy: 68.7%

2. Crop

  • Configuration: Content-aware cropping (no shadow removal)
  • Performance:
    • Micro Precision: 94.8%
    • Micro Recall: 89.9%
    • Micro F1: 92.3% (+18.8% improvement)
    • Field Accuracy: 89.9%

3. Crop + PaddleOCR + Shadow Removal

  • Configuration: Cropping with PaddleOCR document processing and shadow removal
  • Performance:
    • Micro Precision: 93.6%
    • Micro Recall: 89.4%
    • Micro F1: 91.5% (+18.0% improvement)
    • Field Accuracy: 89.4%

4. Crop + PaddleOCR + Shadow Removal + Cache

  • Configuration: Cropping with PaddleOCR, shadow removal, and caching
  • Performance:
    • Micro Precision: 92.5%
    • Micro Recall: 88.3%
    • Micro F1: 90.3% (+16.8% improvement)
    • Field Accuracy: 88.3%

5. Crop + Shadow Removal + Cache

  • Configuration: Cropping with shadow removal and caching
  • Performance:
    • Micro Precision: 93.6%
    • Micro Recall: 88.5%
    • Micro F1: 91.0% (+17.5% improvement)
    • Field Accuracy: 88.5%

Field-Level Performance Analysis

High-Performance Fields

Fields that consistently perform well across all approaches:

Field Best F1 Best Approach Performance Trend
Gender 85.1% Crop + PaddleOCR Consistent improvement
Birth Date 80.5% Crop + PaddleOCR Strong improvement
Document Type 85.4% Crop + PaddleOCR Significant improvement
Surname 82.9% Crop + PaddleOCR Consistent improvement

Medium-Performance Fields

Fields with moderate improvement:

Field Best F1 Best Approach Performance Trend
Birth Place 83.4% Crop Only Good improvement
Expiry Date 78.5% Crop + PaddleOCR Moderate improvement
Issue Date 69.3% Crop + Shadow + Cache Variable performance
Address 44.4% Crop + PaddleOCR Limited improvement

Low-Performance Fields

Fields that remain challenging:

Field Best F1 Best Approach Notes
MRZ Lines 41.8% Crop + Shadow + Cache Complex OCR patterns
Personal Number 40.0% Crop + PaddleOCR + Cache Small text, variable format
Issue Place 50.0% Crop + PaddleOCR + Cache Handwritten text challenges

Zero-Performance Fields

Fields that consistently fail across all approaches:

  • Recto/Verso: Document side detection
  • Code: Encoded information
  • Height: Physical measurements
  • Type: Document classification

Key Findings

1. Preprocessing Impact

  • Cropping alone delivers the strongest overall boost (+18.8 F1 pts vs. baseline)
  • PaddleOCR + Shadow Removal is highly competitive (up to +18.0 F1 pts)
  • Caching has minimal to moderate impact on accuracy

2. Field Type Sensitivity

  • Structured fields (dates, numbers) benefit most from preprocessing
  • Text fields (names, addresses) show moderate improvement
  • Complex fields (MRZ, codes) remain challenging

3. Processing Pipeline Efficiency

  • Crop currently provides the best overall F1 in this evaluation
  • Crop + PaddleOCR + Shadow Removal is close and benefits some fields
  • Caching shows minimal gains; use for speed, not accuracy

Recommendations

For Production Use

  1. Use Crop as the primary preprocessing step
  2. Focus optimization on high-value fields (dates, document types, names)
  3. Consider field-specific preprocessing strategies for challenging fields

For Further Research

  1. Investigate MRZ line extraction techniques
  2. Explore advanced OCR methods for handwritten text
  3. Develop specialized preprocessing for low-performance fields

Performance Targets

  • Overall F1: Target 65%+ (currently 60.7% best)
  • Field Accuracy: Target 50%+ (currently 43.5% best)
  • Critical Fields: Ensure 80%+ F1 for dates and document types

Technical Details

Dataset Characteristics

  • Total Samples: 56 documents
  • Field Types: 25+ different field categories
  • Document Types: Identity documents, permits, certificates
  • Image Quality: Variable (scanned, photographed, digital)

Evaluation Methodology

  • Ground Truth: Manually annotated field boundaries and text
  • Evaluation: Field-level precision, recall, and F1 calculation
  • Aggregation: Micro-averaging across all fields and samples

Preprocessing Pipeline

  1. Image Input: Raw document images
  2. Cropping: Content area detection and extraction
  3. Document Processing: PaddleOCR unwarping and orientation
  4. Shadow Removal: Optional DocShadow processing
  5. Field Extraction: OCR-based text extraction
  6. Post-processing: Field validation and formatting

Conclusion

The evaluation demonstrates that preprocessing significantly improves document field extraction performance. The Crop + PaddleOCR approach provides the best balance of performance and complexity, achieving a 14.1% improvement in F1-score over the baseline. While some fields remain challenging, the overall pipeline shows strong potential for production deployment with further field-specific optimizations.


Last Updated: August 2024
Evaluation Dataset: 56 document samples
Total Fields Evaluated: 900+ field instances