Merge pull request #22 from Yuliang-Liu/dev

Dev
2024-01-18 21:28:34 +08:00
parent 9f0ffdf13c 35f7cc99cb
commit fecf79eb32
501 changed files with 479803 additions and 4082266 deletions
@@ -0,0 +1,50 @@
+# On the Hidden Mystery of OCR in Large Multimodal Models 
+<img src="./images/all_data.png" width="96%" height="96%">
+
+> Large models have recently played a dominant role in natural language processing and multimodal vision-language learning. However, their effectiveness in text-related visual tasks remains relatively unexplored. In this paper, we  conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks including Text Recognition, Scene Text-Centric Visual Question Answering (VQA), Document-Oriented VQA, Key Information Extraction (KIE), and Handwritten Mathematical Expression Recognition (HMER). To facilitate the assessment of Optical Character Recognition (OCR) capabilities in Large Multimodal Models, we propose OCRBench, a comprehensive evaluation benchmark. Our study encompasses 29 datasets, making it the most comprehensive OCR evaluation benchmark available. Furthermore, our study reveals both the strengths and weaknesses of these models, particularly in handling multilingual text, handwritten text, non-semantic text, and mathematical expression recognition. Most importantly, the baseline results showcased in this study could provide a foundational framework for the conception and assessment of innovative strategies targeted at enhancing zero-shot multimodal techniques.
+
+**[Project Page [This Page]](https://github.com/Yuliang-Liu/MultimodalOCR)** | **[Paper](https://arxiv.org/abs/2305.07895)** | **[OCRBench Leaderboard](http://27.18.7.167:7682/)** |
+
+# Data
+| Data | Link | Description |
+| --- | --- | --- |
+| Full Test Json | [Full Test](./OCRBench/FullTest.json) | This file contains the test data used in Table 1 and Table 2 from [Paper](https://arxiv.org/abs/2305.07895). |
+| OCRBench Json | [OCRBench](./OCRBench/OCRBench.json) | This file contains the test data in OCRBench used in Table3 from [Paper](https://arxiv.org/abs/2305.07895). |
+| All Test Images |[All Images](https://drive.google.com/file/d/1U5AtLoJ7FrJe9yfcbssfeLmlKb7dTosc/view?usp=drive_link) | This file contains all the testing images used in [Paper](https://arxiv.org/abs/2305.07895), including OCRBench Images.|
+| OCRBench Images | [OCRBench Images](https://drive.google.com/file/d/1a3VRJx3V3SdOmPr7499Ky0Ug8AwqGUHO/view?usp=drive_link) | This file only contains the images used in OCRBench. |
+| Test Results | [Test Results](https://drive.google.com/drive/folders/15XlHCuNTavI1Ihqm4G7u3J34BHpkaqyE?usp=drive_link) | This file file contains the result files for the test models. |
+
+# OCRBench
+
+OCRBench is a comprehensive evaluation benchmark designed to assess the OCR capabilities of Large Multimodal Models. It comprises five components: Text Recognition, SceneText-Centric VQA, Document-Oriented VQA, Key Information Extraction, and Handwritten Mathematical Expression Recognition. The benchmark includes 1000 question-answer pairs, and all the answers undergo manual verification and correction to ensure a more precise evaluation. 
+
+You can find the results of Large Multimodal Models in **[OCRBench Leaderboard](http://27.18.7.167:7682/)**, if you would like to include your model in the OCRBench leaderboard, please follow the evaluation instructions provided below and feel free to contact us via email at zhangli123@hust.edu.cn. We will update the leaderboard in time.
+
+<img src="./images/GPT4V_Gemini.png" width="96%" height="96%">
+
+# Evaluation
+The test code for evaluating models in the paper can be found in [scripts](./scripts). If you want to evaluate other models, please edit the "TODO" things in [example](./example.py).
+
+Example evaluation scripts:
+```python
+
+python ./scripts/monkey.py --image_folder ./data --OCRBench_file ./OCRBench/OCRBench.json --save_name Monkey_OCRBench --num_workers GPU_Nums # Test on OCRBench
+python ./scripts/monkey.py --image_folder ./data --OCRBench_file ./OCRBench/FullTest.json --save_name Monkey_FullTest --num_workers GPU_Nums # Full Test
+
+```
+
+# Citation
+If you wish to refer to the baseline results published here, please use the following BibTeX entries:
+```BibTeX
+@misc{liu2024hidden,
+      title={On the Hidden Mystery of OCR in Large Multimodal Models}, 
+      author={Yuliang Liu and Zhang Li and Biao Yang and Chunyuan Li and Xucheng Yin and Cheng-lin Liu and Lianwen Jin and Xiang Bai},
+      year={2024},
+      eprint={2305.07895},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV}
+}
+```
+
+
+
@@ -1,102 +0,0 @@
-[[arXiv 2305.07895]](https://arxiv.org/pdf/2305.07895.pdf) On the Hidden Mystery of OCR in Large Multimodal Models.
-
-We conducted a comprehensive study of existing publicly available multimodal models, evaluating their performance in text recognition (document text, artistic text, handwritten text, scene text), text-based visual question answering (document text, scene text, and bilingual text), key information extraction (receipts, documents, and nutrition facts) and handwritten mathematical expression recognition. The baseline results showcased in this study could provide a foundational framework for the conception and assessment of innovative strategies targeted at enhancing zero-shot multimodal techniques. Online evaluation DEMO is available at this [link](http://124.220.17.244:7860/). 
-
-# Results
-
-Results are available in answer_save folder. It should be noted that for BLIP2OPT, when using the inference code on Hugging Face, the accuracy of text recognition is high, but the model outputs nothing for the VQA tasks. Conversely, when using the LAVIS library for inference, the accuracy of text recognition is low, while the VQA accuracy is normal. We believe that the inference process of BLIP2OPT still needs to be optimized. In our experiments, we take the maximum value of the two methods as the final result.
-
-![table](https://github.com/echo840/MultimodalOCR/assets/87795401/b7cb6ab7-2e6c-462c-84ae-41b9d209ce48)
-
-Visualization results
-![修改](https://github.com/echo840/MultimodalOCR/assets/87795401/b74ff847-534c-49ca-a31e-8f8854380a34)
-
-![Multilingualism](https://github.com/echo840/MultimodalOCR/assets/87795401/8bf5c8ab-bec7-4b77-b2bb-7a319975a762)
-
-
-# Data Download
-| Data file | Size |
-| --- | ---: |
-|[text recognition](https://pan.baidu.com/s/1Ba950d94u8RQmtqvkLBk-A) code:iwyn | 1.37GB |
-|[STVQA](https://rrc.cvc.uab.es/?ch=11&com=downloads) End-to-End Task-3 and Training images|1.88GB|
-|[ocrVQA](https://drive.google.com/drive/folders/1_GYPY5UkUy7HIcR0zq3ZCFgeZN7BAfm_)|--|
-|[textVQA](https://textvqa.org/dataset/) val set|6.6GB|
-|[docVQA](https://rrc.cvc.uab.es/?ch=17&com=downloads) Task 1 Validation set|0.8GB|
-|[ESTVQA](https://cloudstor.aarnet.edu.au/plus/s/LSishuuSE5DBKJp)|5.2GB|
-|[SROIE](https://rrc.cvc.uab.es/?ch=13&com=downloads)|0.19GB|
-|[FUNSD](https://guillaumejaume.github.io/FUNSD/download/)|16MB|
-|[POIE](https://drive.google.com/file/d/1eEMNiVeLlD-b08XW_GfAGfPmmII-GDYs/view)|0.43GB|
-|[HME100K](https://ai.100tal.com/openData/formulaRecognition)|0.69GB|
-|[Google cloud](https://drive.google.com/drive/folders/1plgZf4XIuiOGjx4b17E1rvTA2UKpZRe1?usp=drive_link)|9.38GB|
-
-
-We assume that your symlinked `data` directory has the following structure:
-
-```
-data
-|_ IC13_857
-|_ IC15_1811
-|_ ...
-|_ ESTVQA
-|_ textVQA
-|_ ...
-|_ FUNSD
-|_ POIE
-```
-
-
-# Usage
-
-eval on all datasets
-```Shell
-python eval.py --model_name LLaVA --eval_all
-```
-
-eval on one dataset
-```Shell
-python eval.py --model_name LLaVA --eval_textVQA
-```
-```Shell
-python eval.py --model_name LLaVA --eval_ocr --ocr_dataset_name "ct80 IIIT5K"
-```
-The results will be saved at answer folder.
-
-If you want to add a new model, please write its inference function under the folder "models", and update the get_model function in eval.py. An example inference code is as follows：
-
-```Shell
-import torch
-from PIL import Image
-from lavis.models import load_model_and_preprocess
-from ..process import pad_image, resize_image
-class lavis:
-    def __init__(self, model_name, model_type, device) -> None:
-        model, vis_processors, txt_processors = load_model_and_preprocess(name = model_name, model_type = model_type, is_eval=True, device=device)
-        self.model_name = model_name
-        self.model = model
-        self.vis_processors = vis_processors
-        self.txt_processors = txt_processors
-        self.device = device
-    def generate(self, image, question, name='resize'):
-        if 'opt' in self.model_name:
-            prompt = f'Question: {question} Answer:'
-        elif 't5' in self.model_name:
-            prompt = f'Question: {question} Short answer:'
-        else:
-            prompt = f'Question: {question} Answer:'
-        image = Image.open(image).convert("RGB")
-        if name == "pad":
-            image = pad_image(image, (224,224))
-        elif name == "resize":
-            image = resize_image(image, (224,224))
-        image = self.vis_processors["eval"](image).unsqueeze(0).to(self.device)
-        prompt = self.txt_processors["eval"](prompt)
-        answer = self.model.predict_answers(samples={"image": image, "text_input": prompt}, inference_method="generate", max_len=48, min_len=1)[0]
-        return answer
-```
-
-# Related Projects
- [LLaVA](https://github.com/haotian-liu/LLaVA.git)
- [MiniGPT4](https://github.com/Vision-CAIR/MiniGPT-4.git)
- [mPLUG-Owl](https://github.com/X-PLUG/mPLUG-Owl.git)
- [OpenFlamingo](https://github.com/mlfoundations/open_flamingo.git)
- [LAVIS](https://github.com/salesforce/LAVIS.git)
@@ -1,26 +0,0 @@
-{
-    "textVQA": 0.2886,
-    "docVQA": 0.044868199663488505,
-    "ocrVQA": 0.1136,
-    "STVQA": 0.2208,
-    "ESTVQA_EN": 0.3348,
-    "ESTVQA_CN": 0.0016,
-    "SROIE":0.0011985617259288853,
-    "IIIT5K": 0.641,
-    "svt": 0.6769706336939721,
-    "IC13_857": 0.7071178529754959,
-    "IC15_1811": 0.5897294312534511,
-    "svtp": 0.6294573643410852,
-    "ct80": 0.6111111111111112,
-    "cocotext": 0.4171382376717866,
-    "ctw": 0.5089058524173028,
-    "totaltext": 0.5243071331213085,
-    "HOST": 0.47392384105960267,
-    "WOST": 0.5525662251655629,
-    "WordArt": 0.6260754467240238,
-    "FUNSD": 0.01020408163265306,
-    "HME": 0.0004,
-    "POIE": 0.0208827717133365,
-    "IAM": 0.504,
-    "ReCTS": 0.0
-}
@@ -1,43 +0,0 @@
-{
-    "IIIT5K": 0.4186666666666667,
-    "svt": 0.42040185471406494,
-    "IC13_857": 0.40956826137689617,
-    "IC15_1811": 0.36554389839867474,
-    "svtp": 0.4232558139534884,
-    "ct80": 0.4965277777777778,
-    "cocotext": 0.22685933710590137,
-    "ctw": 0.3816793893129771,
-    "totaltext": 0.3493866424352567,
-    "HOST": 0.29387417218543044,
-    "WOST": 0.34644039735099336,
-    "WordArt": 0.47319655857048315,
-    "ESTVQA_CN": 0.001,
-    "ESTVQA_EN": 0.2836,
-    "SROIE": 0.00039952057530962844,
-    "ocrVQA": 0.1152,
-    "STVQA": 0.1402,
-    "textVQA": 0.1872,
-    "docVQA": 0.029725182277061134,
-    "FUNSD": 0.011904761904761904,
-    "HME": 0.0,
-    "POIE": 0.013130833728840373,
-    "IAM": 0.23933333333333334,
-    "ReCTS": 0.0
-}
-
-{
-    "IIIT5K": 0.48,
-    "svt": 0.5038639876352395,
-    "IC13_857": 0.48891481913652274,
-    "IC15_1811": 0.4218663721700718,
-    "svtp": 0.5038759689922481,
-    "ct80": 0.5729166666666666,
-    "cocotext": 0.2625303152789006,
-    "ctw": 0.41857506361323155,
-    "totaltext": 0.4057246706042708,
-    "HOST": 0.34519867549668876,
-    "WOST": 0.4105960264900662,
-    "WordArt": 0.514228987425546,
-    "IAM": 0.289,
-    "help":"we replace all special chars with \" \" instead of \"\", which is useful for MiniGPT4."
-}
@@ -1,16 +0,0 @@
-{
-    "IIIT5K": 0.48,
-    "svt": 0.5038639876352395,
-    "IC13_857": 0.48891481913652274,
-    "IC15_1811": 0.4218663721700718,
-    "svtp": 0.5038759689922481,
-    "ct80": 0.5729166666666666,
-    "cocotext":0.2625303152789006,
-    "ctw": 0.41857506361323155,
-    "totaltext": 0.4057246706042708,
-    "HOST": 0.34519867549668876,
-    "WOST": 0.4105960264900662,
-    "WordArt":0.514228987425546,
-    "IAM": 0.289,
-    "help":"we replace all special chars with \" \" instead of \"\", which is useful for MiniGPT4."
-}
@@ -1,25 +0,0 @@
-{   "IIIT5K":0.682,
-    "svt":0.741885626,
-    "IC13_857":0.740956826,
-    "IC15_1811":0.636112645,
-    "svtp":0.734883721,
-    "ct80":0.677083333,
-    "cocotext":0.455234438,
-    "ctw":0.539440204,
-    "totaltext":0.578373467,
-    "HOST":0.481788079,
-    "WOST": 0.605546357615894,
-    "WordArt": 0.6062210456651225,
-    "textVQA": 0.2908,
-    "docVQA": 0.050476724621424565,
-    "ocrVQA": 0.2782,
-    "STVQA": 0.1932,
-    "ESTVQA_CN": 0.0026,
-    "ESTVQA_EN": 0.282,
-    "SROIE": 0.0011985617259288853,
-    "FUNSD": 0.008503401360544218,
-    "POIE": 0.021199177345356746,
-    "HME": 0.0,
-    "IAM": 0.4553333333333333,
-    "ReCTS": 0.0
-}
--- a/Show More
+++ b/Show More