# Document Field Extraction Evaluation Results ## Overview This document presents the evaluation results for document field extraction using different preprocessing approaches. The evaluation was conducted on a dataset of 56 document samples with various field types commonly found in identity documents. ## Evaluation Metrics The evaluation uses standard information extraction metrics: - **Precision**: Ratio of correctly extracted fields to total extracted fields - **Recall**: Ratio of correctly extracted fields to total ground truth fields - **F1-Score**: Harmonic mean of precision and recall - **Accuracy**: Overall field-level accuracy - **TP**: True Positives (correctly extracted fields) - **FP**: False Positives (incorrectly extracted fields) - **FN**: False Negatives (missed fields) ## Preprocessing Approaches ### 1. No Preprocessing (Baseline) - **Configuration**: Raw images without any preprocessing - **Performance**: - Micro Precision: 79.0% - Micro Recall: 68.7% - Micro F1: 73.5% - Field Accuracy: 68.7% ### 2. Crop - **Configuration**: Content-aware cropping (no shadow removal) - **Performance**: - Micro Precision: 94.8% - Micro Recall: 89.9% - Micro F1: 92.3% (+18.8% improvement) - Field Accuracy: 89.9% ### 3. Crop + PaddleOCR + Shadow Removal - **Configuration**: Cropping with PaddleOCR document processing and shadow removal - **Performance**: - Micro Precision: 93.6% - Micro Recall: 89.4% - Micro F1: 91.5% (+18.0% improvement) - Field Accuracy: 89.4% ### 4. Crop + PaddleOCR + Shadow Removal + Cache - **Configuration**: Cropping with PaddleOCR, shadow removal, and caching - **Performance**: - Micro Precision: 92.5% - Micro Recall: 88.3% - Micro F1: 90.3% (+16.8% improvement) - Field Accuracy: 88.3% ### 5. Crop + Shadow Removal + Cache - **Configuration**: Cropping with shadow removal and caching - **Performance**: - Micro Precision: 93.6% - Micro Recall: 88.5% - Micro F1: 91.0% (+17.5% improvement) - Field Accuracy: 88.5% ## Field-Level Performance Analysis ### High-Performance Fields Fields that consistently perform well across all approaches: | Field | Best F1 | Best Approach | Performance Trend | |-------|----------|---------------|-------------------| | **Gender** | 85.1% | Crop + PaddleOCR | Consistent improvement | | **Birth Date** | 80.5% | Crop + PaddleOCR | Strong improvement | | **Document Type** | 85.4% | Crop + PaddleOCR | Significant improvement | | **Surname** | 82.9% | Crop + PaddleOCR | Consistent improvement | ### Medium-Performance Fields Fields with moderate improvement: | Field | Best F1 | Best Approach | Performance Trend | |-------|----------|---------------|-------------------| | **Birth Place** | 83.4% | Crop Only | Good improvement | | **Expiry Date** | 78.5% | Crop + PaddleOCR | Moderate improvement | | **Issue Date** | 69.3% | Crop + Shadow + Cache | Variable performance | | **Address** | 44.4% | Crop + PaddleOCR | Limited improvement | ### Low-Performance Fields Fields that remain challenging: | Field | Best F1 | Best Approach | Notes | |-------|----------|---------------|-------| | **MRZ Lines** | 41.8% | Crop + Shadow + Cache | Complex OCR patterns | | **Personal Number** | 40.0% | Crop + PaddleOCR + Cache | Small text, variable format | | **Issue Place** | 50.0% | Crop + PaddleOCR + Cache | Handwritten text challenges | ### Zero-Performance Fields Fields that consistently fail across all approaches: - **Recto/Verso**: Document side detection - **Code**: Encoded information - **Height**: Physical measurements - **Type**: Document classification ## Key Findings ### 1. Preprocessing Impact - **Cropping alone** delivers the strongest overall boost (+18.8 F1 pts vs. baseline) - **PaddleOCR + Shadow Removal** is highly competitive (up to +18.0 F1 pts) - **Caching** has minimal to moderate impact on accuracy ### 2. Field Type Sensitivity - **Structured fields** (dates, numbers) benefit most from preprocessing - **Text fields** (names, addresses) show moderate improvement - **Complex fields** (MRZ, codes) remain challenging ### 3. Processing Pipeline Efficiency - **Crop** currently provides the best overall F1 in this evaluation - **Crop + PaddleOCR + Shadow Removal** is close and benefits some fields - **Caching** shows minimal gains; use for speed, not accuracy ## Recommendations ### For Production Use 1. **Use Crop** as the primary preprocessing step 2. **Focus optimization** on high-value fields (dates, document types, names) 3. **Consider field-specific** preprocessing strategies for challenging fields ### For Further Research 1. **Investigate MRZ line** extraction techniques 2. **Explore advanced OCR** methods for handwritten text 3. **Develop specialized** preprocessing for low-performance fields ### Performance Targets - **Overall F1**: Target 65%+ (currently 60.7% best) - **Field Accuracy**: Target 50%+ (currently 43.5% best) - **Critical Fields**: Ensure 80%+ F1 for dates and document types ## Technical Details ### Dataset Characteristics - **Total Samples**: 56 documents - **Field Types**: 25+ different field categories - **Document Types**: Identity documents, permits, certificates - **Image Quality**: Variable (scanned, photographed, digital) ### Evaluation Methodology - **Ground Truth**: Manually annotated field boundaries and text - **Evaluation**: Field-level precision, recall, and F1 calculation - **Aggregation**: Micro-averaging across all fields and samples ### Preprocessing Pipeline 1. **Image Input**: Raw document images 2. **Cropping**: Content area detection and extraction 3. **Document Processing**: PaddleOCR unwarping and orientation 4. **Shadow Removal**: Optional DocShadow processing 5. **Field Extraction**: OCR-based text extraction 6. **Post-processing**: Field validation and formatting ## Conclusion The evaluation demonstrates that preprocessing significantly improves document field extraction performance. The **Crop + PaddleOCR** approach provides the best balance of performance and complexity, achieving a 14.1% improvement in F1-score over the baseline. While some fields remain challenging, the overall pipeline shows strong potential for production deployment with further field-specific optimizations. --- *Last Updated: August 2024* *Evaluation Dataset: 56 document samples* *Total Fields Evaluated: 900+ field instances*