pipeline

VLM model → feature extraction → clustering → data filtering → fine-tuning → LoRA export → LLVM serve → inference → evaluation

001 VLM model → feature extraction

cd extract

run clustering_example_qwen notebook

002 clustering grid search

cd cluster

# dbscan
python auto_cluster.py --embeddings_path /home/nguyendc/sonnh/embedding-clustering/extract/embeddings_factures_osteopathie_1k_qwen.json --method dbscan

# gmm
python gmm_extensive.py --embeddings_path /home/nguyendc/sonnh/embedding-clustering/extract/embeddings_factures_osteopathie_1k_qwen.json
  • Cluster Result will save at:
    • cluster/dbscan_results.json
    • cluster/gmm_final_results.json

003 filter data

cd filter
bash run_filter.sh
  • EMBEDDINGS_PATH: path to the embedding file generated in step 001

  • CLUSTERING_RESULTS_PATH: path to the file containing the clustering information generated in step 002

  • OUTPUT_PATH: path to save the retained files

  • selection_ratio: proportion of data to be passed into the filter

  • center_ratio: proportion of center points to take

  • border_ratio: proportion of boundary points to take

004 create data from full data and filter data

cd filter

python3 create_label_data.py
  • dbscan_results_path: OUTPUT_PATH produced in step 003
  • label_data_path: ground truth after being filtered from the filtered data, has the same format as the full dataset, used for finetuning the VLM model

visual and check data

cd check_filter
bash run.sh
Description
This repo is used for clustering document embeddings using vison encoders from well pretrained VLM models (LayoutLM, QwenVL, etc.)
Readme 2 MiB
Languages
Jupyter Notebook 54.8%
Python 45.2%