embedding-clustering/README.md

# pipeline

VLM model → feature extraction → clustering → data filtering → fine-tuning → LoRA export → LLVM serve → inference → evaluation

# 001 VLM model → feature extraction
```bash
cd extract

run clustering_example_qwen notebook
```

# 002 clustering grid search
```bash
cd cluster

# dbscan
python auto_cluster.py --embeddings_path /home/nguyendc/sonnh/embedding-clustering/extract/embeddings_factures_osteopathie_1k_qwen.json --method dbscan

# gmm
python gmm_extensive.py --embeddings_path /home/nguyendc/sonnh/embedding-clustering/extract/embeddings_factures_osteopathie_1k_qwen.json
```

- Cluster Result will save at: 
    - cluster/dbscan_results.json
    - cluster/gmm_final_results.json


# 003 filter data

```bash
cd filter
bash run_filter.sh
```

- EMBEDDINGS_PATH: path to the embedding file generated in step 001
- CLUSTERING_RESULTS_PATH: path to the file containing the clustering information generated in step 002
- OUTPUT_PATH: path to save the retained files

- selection_ratio: proportion of data to be passed into the filter 
- center_ratio: proportion of center points to take 
- border_ratio: proportion of boundary points to take


# 004 create data from full data and filter data

```bash
cd filter

python3 create_label_data.py
```

- dbscan_results_path: OUTPUT_PATH produced in step 003
- label_data_path: ground truth after being filtered from the filtered data, has the same format as the full dataset, used for finetuning the VLM model


# visual and check data

```bash
cd check_filter
bash run.sh
```
update source code and pipeline 2025-09-04 14:39:02 +00:00			`# pipeline`

			`VLM model → feature extraction → clustering → data filtering → fine-tuning → LoRA export → LLVM serve → inference → evaluation`

			`# 001 VLM model → feature extraction`
			```bash
			`cd extract`

			`run clustering_example_qwen notebook`
			```

			`# 002 clustering grid search`
			```bash
			`cd cluster`

			`# dbscan`
			`python auto_cluster.py --embeddings_path /home/nguyendc/sonnh/embedding-clustering/extract/embeddings_factures_osteopathie_1k_qwen.json --method dbscan`

			`# gmm`
			`python gmm_extensive.py --embeddings_path /home/nguyendc/sonnh/embedding-clustering/extract/embeddings_factures_osteopathie_1k_qwen.json`
			```

			`- Cluster Result will save at:`
			`- cluster/dbscan_results.json`
			`- cluster/gmm_final_results.json`


			`# 003 filter data`

			```bash
			`cd filter`
			`bash run_filter.sh`
			```

			`- EMBEDDINGS_PATH: path to the embedding file generated in step 001`
			`- CLUSTERING_RESULTS_PATH: path to the file containing the clustering information generated in step 002`
			`- OUTPUT_PATH: path to save the retained files`

			`- selection_ratio: proportion of data to be passed into the filter`
			`- center_ratio: proportion of center points to take`
			`- border_ratio: proportion of boundary points to take`


			`# 004 create data from full data and filter data`

			```bash
			`cd filter`

			`python3 create_label_data.py`
			```

			`- dbscan_results_path: OUTPUT_PATH produced in step 003`
			`- label_data_path: ground truth after being filtered from the filtered data, has the same format as the full dataset, used for finetuning the VLM model`


			`# visual and check data`

			```bash
			`cd check_filter`
			`bash run.sh`
			```