2023-06-02 15:48:06 +08:00
2023-05-27 17:21:39 +08:00
2023-05-27 17:21:39 +08:00
2023-06-01 09:57:03 +08:00
2023-05-27 17:21:39 +08:00
2023-06-02 15:48:06 +08:00

Paper. The online evaluation pipeline is scheduled to release.

Results

Results are available in answer_save folder. It should be noted that for BLIP2OPT, when using the inference code on Hugging Face, the accuracy of text recognition is high, but the model outputs nothing for the VQA tasks. Conversely, when using the LAVIS library for inference, the accuracy of text recognition is low, while the VQA accuracy is normal. We believe that the inference process of BLIP2OPT still needs to be optimized. In our experiments, we take the maximum value of the two methods as the final result.

image

Visualization results

rvk

Data Download

Data file Size
text recognition code:iwyn 1.37GB
STVQA End-to-End Task-3 and Training images 1.88GB
ocrVQA --
textVQA val set 6.6GB
docVQA Task 1 Validation set 0.8GB
ESTVQA 5.2GB
SROIE Task 3 test set 0.19GB
FUNSD 16MB
POIE 0.43GB
HME100K 0.69GB

TextVQA, KIE and HME will be updated soon.

We assume that your symlinked data directory has the following structure:

data
|_ IC13_857
|_ IC15_1811
|_ ...
|_ ESTVQA
|_ textVQA
|_ ...
|_ FUNSD
|_ POIE

Usage

eval on all datasets

python eval.py --model_name LLaVA --eval_all

eval on one dataset

python eval.py --model_name LLaVA --eval_textVQA
python eval.py --model_name LLaVA --eval_ocr --ocr_dataset_name "ct80 IIIT5K"

The results will be saved at answer folder.

If you want to add a new model, please write its inference function under the folder "models", and update the get_model function in eval.py. An example inference code is as follows

import torch
from PIL import Image
from lavis.models import load_model_and_preprocess
from ..process import pad_image, resize_image
class lavis:
    def __init__(self, model_name, model_type, device) -> None:
        model, vis_processors, txt_processors = load_model_and_preprocess(name = model_name, model_type = model_type, is_eval=True, device=device)
        self.model_name = model_name
        self.model = model
        self.vis_processors = vis_processors
        self.txt_processors = txt_processors
        self.device = device
    def generate(self, image, question, name='resize'):
        if 'opt' in self.model_name:
            prompt = f'Question: {question} Answer:'
        elif 't5' in self.model_name:
            prompt = f'Question: {question} Short answer:'
        else:
            prompt = f'Question: {question} Answer:'
        image = Image.open(image).convert("RGB")
        if name == "pad":
            image = pad_image(image, (224,224))
        elif name == "resize":
            image = resize_image(image, (224,224))
        image = self.vis_processors["eval"](image).unsqueeze(0).to(self.device)
        prompt = self.txt_processors["eval"](prompt)
        answer = self.model.predict_answers(samples={"image": image, "text_input": prompt}, inference_method="generate", max_len=48, min_len=1)[0]
        return answer

Related Projects

Description
Languages
Python 100%