lphatnguyen/embedding-clustering

Go to file

trungkienbkhn 878310a551 update source code and pipeline

2025-09-04 14:39:02 +00:00

update source code and pipeline

2025-09-04 14:39:02 +00:00

update source code and pipeline

2025-09-04 14:39:02 +00:00

check visison extract model

2025-09-02 15:01:50 +00:00

update source code and pipeline

2025-09-04 14:39:02 +00:00

[Add] Clustering - adaptive selection

2025-07-11 14:20:43 +00:00

.gitignore

update source code and pipeline

2025-09-04 14:39:02 +00:00

clustering_example.ipynb

[Add] Adaptive clustering

2025-07-10 20:35:46 +00:00

clustering_layoutlm.ipynb

[Add] Adaptive clustering

2025-07-10 20:35:46 +00:00

command.sh

update source code and pipeline

2025-09-04 14:39:02 +00:00

output_file.txt

[Add] Adaptive clustering - add files

2025-07-10 20:36:23 +00:00

README.md

update source code and pipeline

2025-09-04 14:39:02 +00:00

README.md

pipeline

VLM model → feature extraction → clustering → data filtering → fine-tuning → LoRA export → LLVM serve → inference → evaluation

001 VLM model → feature extraction

cd extract

run clustering_example_qwen notebook

002 clustering grid search

cd cluster

# dbscan
python auto_cluster.py --embeddings_path /home/nguyendc/sonnh/embedding-clustering/extract/embeddings_factures_osteopathie_1k_qwen.json --method dbscan

# gmm
python gmm_extensive.py --embeddings_path /home/nguyendc/sonnh/embedding-clustering/extract/embeddings_factures_osteopathie_1k_qwen.json

Cluster Result will save at:
- cluster/dbscan_results.json
- cluster/gmm_final_results.json

003 filter data

cd filter
bash run_filter.sh

EMBEDDINGS_PATH: path to the embedding file generated in step 001
CLUSTERING_RESULTS_PATH: path to the file containing the clustering information generated in step 002
OUTPUT_PATH: path to save the retained files
selection_ratio: proportion of data to be passed into the filter
center_ratio: proportion of center points to take
border_ratio: proportion of boundary points to take

004 create data from full data and filter data

cd filter

python3 create_label_data.py

dbscan_results_path: OUTPUT_PATH produced in step 003
label_data_path: ground truth after being filtered from the filtered data, has the same format as the full dataset, used for finetuning the VLM model

visual and check data

cd check_filter
bash run.sh