# pipeline VLM model → feature extraction → clustering → data filtering → fine-tuning → LoRA export → LLVM serve → inference → evaluation # 001 VLM model → feature extraction ```bash cd extract run clustering_example_qwen notebook ``` # 002 clustering grid search ```bash cd cluster # dbscan python auto_cluster.py --embeddings_path /home/nguyendc/sonnh/embedding-clustering/extract/embeddings_factures_osteopathie_1k_qwen.json --method dbscan # gmm python gmm_extensive.py --embeddings_path /home/nguyendc/sonnh/embedding-clustering/extract/embeddings_factures_osteopathie_1k_qwen.json ``` - Cluster Result will save at: - cluster/dbscan_results.json - cluster/gmm_final_results.json # 003 filter data ```bash cd filter bash run_filter.sh ``` - EMBEDDINGS_PATH: path to the embedding file generated in step 001 - CLUSTERING_RESULTS_PATH: path to the file containing the clustering information generated in step 002 - OUTPUT_PATH: path to save the retained files - selection_ratio: proportion of data to be passed into the filter - center_ratio: proportion of center points to take - border_ratio: proportion of boundary points to take # 004 create data from full data and filter data ```bash cd filter python3 create_label_data.py ``` - dbscan_results_path: OUTPUT_PATH produced in step 003 - label_data_path: ground truth after being filtered from the filtered data, has the same format as the full dataset, used for finetuning the VLM model # visual and check data ```bash cd check_filter bash run.sh ```