update source code and pipeline
This commit is contained in:
61
README.md
61
README.md
@@ -0,0 +1,61 @@
|
||||
# pipeline
|
||||
|
||||
VLM model → feature extraction → clustering → data filtering → fine-tuning → LoRA export → LLVM serve → inference → evaluation
|
||||
|
||||
# 001 VLM model → feature extraction
|
||||
```bash
|
||||
cd extract
|
||||
|
||||
run clustering_example_qwen notebook
|
||||
```
|
||||
|
||||
# 002 clustering grid search
|
||||
```bash
|
||||
cd cluster
|
||||
|
||||
# dbscan
|
||||
python auto_cluster.py --embeddings_path /home/nguyendc/sonnh/embedding-clustering/extract/embeddings_factures_osteopathie_1k_qwen.json --method dbscan
|
||||
|
||||
# gmm
|
||||
python gmm_extensive.py --embeddings_path /home/nguyendc/sonnh/embedding-clustering/extract/embeddings_factures_osteopathie_1k_qwen.json
|
||||
```
|
||||
|
||||
- Cluster Result will save at:
|
||||
- cluster/dbscan_results.json
|
||||
- cluster/gmm_final_results.json
|
||||
|
||||
|
||||
# 003 filter data
|
||||
|
||||
```bash
|
||||
cd filter
|
||||
bash run_filter.sh
|
||||
```
|
||||
|
||||
- EMBEDDINGS_PATH: path to the embedding file generated in step 001
|
||||
- CLUSTERING_RESULTS_PATH: path to the file containing the clustering information generated in step 002
|
||||
- OUTPUT_PATH: path to save the retained files
|
||||
|
||||
- selection_ratio: proportion of data to be passed into the filter
|
||||
- center_ratio: proportion of center points to take
|
||||
- border_ratio: proportion of boundary points to take
|
||||
|
||||
|
||||
# 004 create data from full data and filter data
|
||||
|
||||
```bash
|
||||
cd filter
|
||||
|
||||
python3 create_label_data.py
|
||||
```
|
||||
|
||||
- dbscan_results_path: OUTPUT_PATH produced in step 003
|
||||
- label_data_path: ground truth after being filtered from the filtered data, has the same format as the full dataset, used for finetuning the VLM model
|
||||
|
||||
|
||||
# visual and check data
|
||||
|
||||
```bash
|
||||
cd check_filter
|
||||
bash run.sh
|
||||
```
|
||||
|
Reference in New Issue
Block a user