Files
MultimodalOCR/Readme.md
lz 3213a65d96 add readme (#10)
* Update Readme.md

* Update Readme.md

* Update Readme.md

* Update Readme.md

* Update Readme.md

* Update Readme.md

* Update Readme.md

* Update Readme.md

* Update Readme.md

* Update Readme.md

* Update Readme.md

* Update Readme.md

* Update Readme.md

* remove submodule

* add mPLUG MiniGPT4

* Update Readme.md

* Update Readme.md

* Update Readme.md

---------

Co-authored-by: Yuliang Liu <34134635+Yuliang-Liu@users.noreply.github.com>
2023-06-01 09:57:03 +08:00

91 lines
2.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

[Paper](https://arxiv.org/pdf/2305.07895.pdf).
# Results
Results are available in answer_save folder.
![image](https://github.com/echo840/MultimodalOCR/assets/87795401/523e0421-7eca-4d15-89f1-3f7348321055)
Visualization results
![rvk](https://github.com/echo840/MultimodalOCR/assets/87795401/21982aba-d063-4a52-a045-8d16e0e98f71)
# Data Download
| Data file | Size |
| --- | ---: |
|[text recognition](https://pan.baidu.com/s/1Ba950d94u8RQmtqvkLBk-A) code:iwyn | 1.37GB |
TextVQA, KIE and HME will be updated soon.
We assume that your symlinked `data` directory has the following structure:
```
data
|_ IC13_857
|_ IC15_1811
|_ ...
|_ ESTVQA
|_ textVQA
|_ ...
|_ FUNSD
|_ POIE
```
# Usage
eval on all datasets
```Shell
python eval.py --model_name LLaVA --eval_all
```
eval on one dataset
```Shell
python eval.py --model_name LLaVA --eval_textVQA
```
```Shell
python eval.py --model_name LLaVA --eval_ocr --ocr_dataset_name "ct80 IIIT5K"
```
The results will be saved at answer folder.
If you want to add a new model, please write its inference function under the folder "models", and update the get_model function in eval.py. An example inference code is as follows
```Shell
import torch
from PIL import Image
from lavis.models import load_model_and_preprocess
from ..process import pad_image, resize_image
class lavis:
def __init__(self, model_name, model_type, device) -> None:
model, vis_processors, txt_processors = load_model_and_preprocess(name = model_name, model_type = model_type, is_eval=True, device=device)
self.model_name = model_name
self.model = model
self.vis_processors = vis_processors
self.txt_processors = txt_processors
self.device = device
def generate(self, image, question, name='resize'):
if 'opt' in self.model_name:
prompt = f'Question: {question} Answer:'
elif 't5' in self.model_name:
prompt = f'Question: {question} Short answer:'
else:
prompt = f'Question: {question} Answer:'
image = Image.open(image).convert("RGB")
if name == "pad":
image = pad_image(image, (224,224))
elif name == "resize":
image = resize_image(image, (224,224))
image = self.vis_processors["eval"](image).unsqueeze(0).to(self.device)
prompt = self.txt_processors["eval"](prompt)
answer = self.model.predict_answers(samples={"image": image, "text_input": prompt}, inference_method="generate", max_len=48, min_len=1)[0]
return answer
```
# Related Projects
- [LLaVA](https://github.com/haotian-liu/LLaVA.git)
- [MiniGPT4](https://github.com/Vision-CAIR/MiniGPT-4.git)
- [mPLUG-Owl](https://github.com/X-PLUG/mPLUG-Owl.git)
- [OpenFlamingo](https://github.com/mlfoundations/open_flamingo.git)
- [Lavis](https://github.com/salesforce/LAVIS.git)