Files

162 lines
6.2 KiB
Markdown
Raw Permalink Normal View History

2025-09-11 09:39:02 +00:00
# Document Field Extraction Evaluation Results
## Overview
This document presents the evaluation results for document field extraction using different preprocessing approaches. The evaluation was conducted on a dataset of 56 document samples with various field types commonly found in identity documents.
## Evaluation Metrics
The evaluation uses standard information extraction metrics:
- **Precision**: Ratio of correctly extracted fields to total extracted fields
- **Recall**: Ratio of correctly extracted fields to total ground truth fields
- **F1-Score**: Harmonic mean of precision and recall
- **Accuracy**: Overall field-level accuracy
- **TP**: True Positives (correctly extracted fields)
- **FP**: False Positives (incorrectly extracted fields)
- **FN**: False Negatives (missed fields)
## Preprocessing Approaches
### 1. No Preprocessing (Baseline)
- **Configuration**: Raw images without any preprocessing
- **Performance**:
- Micro Precision: 79.0%
- Micro Recall: 68.7%
- Micro F1: 73.5%
- Field Accuracy: 68.7%
### 2. Crop
- **Configuration**: Content-aware cropping (no shadow removal)
- **Performance**:
- Micro Precision: 94.8%
- Micro Recall: 89.9%
- Micro F1: 92.3% (+18.8% improvement)
- Field Accuracy: 89.9%
### 3. Crop + PaddleOCR + Shadow Removal
- **Configuration**: Cropping with PaddleOCR document processing and shadow removal
- **Performance**:
- Micro Precision: 93.6%
- Micro Recall: 89.4%
- Micro F1: 91.5% (+18.0% improvement)
- Field Accuracy: 89.4%
### 4. Crop + PaddleOCR + Shadow Removal + Cache
- **Configuration**: Cropping with PaddleOCR, shadow removal, and caching
- **Performance**:
- Micro Precision: 92.5%
- Micro Recall: 88.3%
- Micro F1: 90.3% (+16.8% improvement)
- Field Accuracy: 88.3%
### 5. Crop + Shadow Removal + Cache
- **Configuration**: Cropping with shadow removal and caching
- **Performance**:
- Micro Precision: 93.6%
- Micro Recall: 88.5%
- Micro F1: 91.0% (+17.5% improvement)
- Field Accuracy: 88.5%
## Field-Level Performance Analysis
### High-Performance Fields
Fields that consistently perform well across all approaches:
| Field | Best F1 | Best Approach | Performance Trend |
|-------|----------|---------------|-------------------|
| **Gender** | 85.1% | Crop + PaddleOCR | Consistent improvement |
| **Birth Date** | 80.5% | Crop + PaddleOCR | Strong improvement |
| **Document Type** | 85.4% | Crop + PaddleOCR | Significant improvement |
| **Surname** | 82.9% | Crop + PaddleOCR | Consistent improvement |
### Medium-Performance Fields
Fields with moderate improvement:
| Field | Best F1 | Best Approach | Performance Trend |
|-------|----------|---------------|-------------------|
| **Birth Place** | 83.4% | Crop Only | Good improvement |
| **Expiry Date** | 78.5% | Crop + PaddleOCR | Moderate improvement |
| **Issue Date** | 69.3% | Crop + Shadow + Cache | Variable performance |
| **Address** | 44.4% | Crop + PaddleOCR | Limited improvement |
### Low-Performance Fields
Fields that remain challenging:
| Field | Best F1 | Best Approach | Notes |
|-------|----------|---------------|-------|
| **MRZ Lines** | 41.8% | Crop + Shadow + Cache | Complex OCR patterns |
| **Personal Number** | 40.0% | Crop + PaddleOCR + Cache | Small text, variable format |
| **Issue Place** | 50.0% | Crop + PaddleOCR + Cache | Handwritten text challenges |
### Zero-Performance Fields
Fields that consistently fail across all approaches:
- **Recto/Verso**: Document side detection
- **Code**: Encoded information
- **Height**: Physical measurements
- **Type**: Document classification
## Key Findings
### 1. Preprocessing Impact
- **Cropping alone** delivers the strongest overall boost (+18.8 F1 pts vs. baseline)
- **PaddleOCR + Shadow Removal** is highly competitive (up to +18.0 F1 pts)
- **Caching** has minimal to moderate impact on accuracy
### 2. Field Type Sensitivity
- **Structured fields** (dates, numbers) benefit most from preprocessing
- **Text fields** (names, addresses) show moderate improvement
- **Complex fields** (MRZ, codes) remain challenging
### 3. Processing Pipeline Efficiency
- **Crop** currently provides the best overall F1 in this evaluation
- **Crop + PaddleOCR + Shadow Removal** is close and benefits some fields
- **Caching** shows minimal gains; use for speed, not accuracy
## Recommendations
### For Production Use
1. **Use Crop** as the primary preprocessing step
2. **Focus optimization** on high-value fields (dates, document types, names)
3. **Consider field-specific** preprocessing strategies for challenging fields
### For Further Research
1. **Investigate MRZ line** extraction techniques
2. **Explore advanced OCR** methods for handwritten text
3. **Develop specialized** preprocessing for low-performance fields
### Performance Targets
- **Overall F1**: Target 65%+ (currently 60.7% best)
- **Field Accuracy**: Target 50%+ (currently 43.5% best)
- **Critical Fields**: Ensure 80%+ F1 for dates and document types
## Technical Details
### Dataset Characteristics
- **Total Samples**: 56 documents
- **Field Types**: 25+ different field categories
- **Document Types**: Identity documents, permits, certificates
- **Image Quality**: Variable (scanned, photographed, digital)
### Evaluation Methodology
- **Ground Truth**: Manually annotated field boundaries and text
- **Evaluation**: Field-level precision, recall, and F1 calculation
- **Aggregation**: Micro-averaging across all fields and samples
### Preprocessing Pipeline
1. **Image Input**: Raw document images
2. **Cropping**: Content area detection and extraction
3. **Document Processing**: PaddleOCR unwarping and orientation
4. **Shadow Removal**: Optional DocShadow processing
5. **Field Extraction**: OCR-based text extraction
6. **Post-processing**: Field validation and formatting
## Conclusion
The evaluation demonstrates that preprocessing significantly improves document field extraction performance. The **Crop + PaddleOCR** approach provides the best balance of performance and complexity, achieving a 14.1% improvement in F1-score over the baseline. While some fields remain challenging, the overall pipeline shows strong potential for production deployment with further field-specific optimizations.
---
*Last Updated: August 2024*
*Evaluation Dataset: 56 document samples*
*Total Fields Evaluated: 900+ field instances*