Files
embedding-clustering/README.md

62 lines
1.5 KiB
Markdown
Raw Permalink Normal View History

2025-09-04 14:39:02 +00:00
# pipeline
VLM model → feature extraction → clustering → data filtering → fine-tuning → LoRA export → LLVM serve → inference → evaluation
# 001 VLM model → feature extraction
```bash
cd extract
run clustering_example_qwen notebook
```
# 002 clustering grid search
```bash
cd cluster
# dbscan
python auto_cluster.py --embeddings_path /home/nguyendc/sonnh/embedding-clustering/extract/embeddings_factures_osteopathie_1k_qwen.json --method dbscan
# gmm
python gmm_extensive.py --embeddings_path /home/nguyendc/sonnh/embedding-clustering/extract/embeddings_factures_osteopathie_1k_qwen.json
```
- Cluster Result will save at:
- cluster/dbscan_results.json
- cluster/gmm_final_results.json
# 003 filter data
```bash
cd filter
bash run_filter.sh
```
- EMBEDDINGS_PATH: path to the embedding file generated in step 001
- CLUSTERING_RESULTS_PATH: path to the file containing the clustering information generated in step 002
- OUTPUT_PATH: path to save the retained files
- selection_ratio: proportion of data to be passed into the filter
- center_ratio: proportion of center points to take
- border_ratio: proportion of boundary points to take
# 004 create data from full data and filter data
```bash
cd filter
python3 create_label_data.py
```
- dbscan_results_path: OUTPUT_PATH produced in step 003
- label_data_path: ground truth after being filtered from the filtered data, has the same format as the full dataset, used for finetuning the VLM model
# visual and check data
```bash
cd check_filter
bash run.sh
```