62 lines
1.5 KiB
Markdown
62 lines
1.5 KiB
Markdown
# pipeline
|
|
|
|
VLM model → feature extraction → clustering → data filtering → fine-tuning → LoRA export → LLVM serve → inference → evaluation
|
|
|
|
# 001 VLM model → feature extraction
|
|
```bash
|
|
cd extract
|
|
|
|
run clustering_example_qwen notebook
|
|
```
|
|
|
|
# 002 clustering grid search
|
|
```bash
|
|
cd cluster
|
|
|
|
# dbscan
|
|
python auto_cluster.py --embeddings_path /home/nguyendc/sonnh/embedding-clustering/extract/embeddings_factures_osteopathie_1k_qwen.json --method dbscan
|
|
|
|
# gmm
|
|
python gmm_extensive.py --embeddings_path /home/nguyendc/sonnh/embedding-clustering/extract/embeddings_factures_osteopathie_1k_qwen.json
|
|
```
|
|
|
|
- Cluster Result will save at:
|
|
- cluster/dbscan_results.json
|
|
- cluster/gmm_final_results.json
|
|
|
|
|
|
# 003 filter data
|
|
|
|
```bash
|
|
cd filter
|
|
bash run_filter.sh
|
|
```
|
|
|
|
- EMBEDDINGS_PATH: path to the embedding file generated in step 001
|
|
- CLUSTERING_RESULTS_PATH: path to the file containing the clustering information generated in step 002
|
|
- OUTPUT_PATH: path to save the retained files
|
|
|
|
- selection_ratio: proportion of data to be passed into the filter
|
|
- center_ratio: proportion of center points to take
|
|
- border_ratio: proportion of boundary points to take
|
|
|
|
|
|
# 004 create data from full data and filter data
|
|
|
|
```bash
|
|
cd filter
|
|
|
|
python3 create_label_data.py
|
|
```
|
|
|
|
- dbscan_results_path: OUTPUT_PATH produced in step 003
|
|
- label_data_path: ground truth after being filtered from the filtered data, has the same format as the full dataset, used for finetuning the VLM model
|
|
|
|
|
|
# visual and check data
|
|
|
|
```bash
|
|
cd check_filter
|
|
bash run.sh
|
|
```
|