update source code and pipeline
This commit is contained in:
140
cluster/log_gmm_extensive_update.txt
Normal file
140
cluster/log_gmm_extensive_update.txt
Normal file
@@ -0,0 +1,140 @@
|
||||
nohup: ignoring input
|
||||
Loading embeddings from /home/nguyendc/sonnh/embedding-clustering/extract/embeddings_factures_osteopathie_1k_qwen.json...
|
||||
Loaded 2800 samples with embedding dimension 2048
|
||||
|
||||
======================================================================
|
||||
RUNNING GAUSSIAN MIXTURE MODEL CLUSTERING WITH OPTIMIZED GRID SEARCH
|
||||
======================================================================
|
||||
Optimized parameter combinations:
|
||||
- n_components: 21 values [2, 3, 4, 5, 6, 8, 10, 11, 14, 17, 20, 23, 26, 29, 32, 35, 38, 41, 44, 47, 50]
|
||||
- covariance_types: 2 options ['tied', 'spherical']
|
||||
- reg_covar: 3 values [1e-05, 0.0001, 0.001]
|
||||
- n_init: 2 values [1, 5]
|
||||
- init_params: 2 options ['kmeans', 'k-means++']
|
||||
- max_iter: 2 values [100, 300]
|
||||
Total combinations: 1008 (optimized for speed)
|
||||
Estimated runtime: 8.4 minutes
|
||||
This should be much faster...
|
||||
|
||||
n_components=2, cov=tied, init=kmeans: BIC=6521812.14, AIC=-5960170.38, silhouette=0.3692
|
||||
n_components=3, cov=tied, init=kmeans: BIC=6511443.85, AIC=-5982704.34, silhouette=0.3756
|
||||
n_components=3, cov=tied, init=kmeans: BIC=6511443.85, AIC=-5982704.34, silhouette=0.3756
|
||||
n_components=3, cov=tied, init=kmeans: BIC=6511443.85, AIC=-5982704.34, silhouette=0.3756
|
||||
n_components=3, cov=tied, init=kmeans: BIC=6511443.85, AIC=-5982704.34, silhouette=0.3756
|
||||
n_components=4, cov=tied, init=kmeans: BIC=6514783.32, AIC=-5991530.55, silhouette=0.3110
|
||||
Progress: 50/1008 (5.0%) - Best scores so far: BIC=6511443.85, Silhouette=0.376
|
||||
n_components=4, cov=tied, init=kmeans: BIC=6514783.32, AIC=-5991530.55, silhouette=0.3110
|
||||
n_components=4, cov=tied, init=kmeans: BIC=6514783.32, AIC=-5991530.55, silhouette=0.3110
|
||||
n_components=4, cov=tied, init=kmeans: BIC=6514783.32, AIC=-5991530.55, silhouette=0.3110
|
||||
n_components=5, cov=tied, init=kmeans: BIC=6520503.08, AIC=-5997976.48, silhouette=0.3163
|
||||
n_components=5, cov=tied, init=kmeans: BIC=6520503.08, AIC=-5997976.48, silhouette=0.3163
|
||||
n_components=5, cov=tied, init=kmeans: BIC=6520503.08, AIC=-5997976.48, silhouette=0.3163
|
||||
n_components=5, cov=tied, init=kmeans: BIC=6520503.08, AIC=-5997976.48, silhouette=0.3163
|
||||
Progress: 100/1008 (9.9%) - Best scores so far: BIC=6511443.85, Silhouette=0.376
|
||||
Progress: 150/1008 (14.9%) - Best scores so far: BIC=6511443.85, Silhouette=0.376
|
||||
Progress: 200/1008 (19.8%) - Best scores so far: BIC=6511443.85, Silhouette=0.376
|
||||
Progress: 250/1008 (24.8%) - Best scores so far: BIC=6511443.85, Silhouette=0.376
|
||||
Progress: 300/1008 (29.8%) - Best scores so far: BIC=6511443.85, Silhouette=0.376
|
||||
Progress: 350/1008 (34.7%) - Best scores so far: BIC=6511443.85, Silhouette=0.376
|
||||
Progress: 400/1008 (39.7%) - Best scores so far: BIC=6511443.85, Silhouette=0.376
|
||||
Progress: 450/1008 (44.6%) - Best scores so far: BIC=6511443.85, Silhouette=0.376
|
||||
Progress: 500/1008 (49.6%) - Best scores so far: BIC=6511443.85, Silhouette=0.376
|
||||
Progress: 550/1008 (54.6%) - Best scores so far: BIC=6511443.85, Silhouette=0.376
|
||||
Progress: 600/1008 (59.5%) - Best scores so far: BIC=6511443.85, Silhouette=0.376
|
||||
Progress: 650/1008 (64.5%) - Best scores so far: BIC=6511443.85, Silhouette=0.376
|
||||
Progress: 700/1008 (69.4%) - Best scores so far: BIC=6511443.85, Silhouette=0.376
|
||||
Progress: 750/1008 (74.4%) - Best scores so far: BIC=6511443.85, Silhouette=0.376
|
||||
Progress: 800/1008 (79.4%) - Best scores so far: BIC=6511443.85, Silhouette=0.376
|
||||
Progress: 850/1008 (84.3%) - Best scores so far: BIC=6511443.85, Silhouette=0.376
|
||||
Progress: 900/1008 (89.3%) - Best scores so far: BIC=6511443.85, Silhouette=0.376
|
||||
Progress: 950/1008 (94.2%) - Best scores so far: BIC=6511443.85, Silhouette=0.376
|
||||
Progress: 1000/1008 (99.2%) - Best scores so far: BIC=6511443.85, Silhouette=0.376
|
||||
Progress: 1008/1008 (100.0%) - Best scores so far: BIC=6511443.85, Silhouette=0.376
|
||||
|
||||
======================================================================
|
||||
GAUSSIAN MIXTURE MODEL GRID SEARCH ANALYSIS
|
||||
======================================================================
|
||||
Total parameter combinations tested: 1008
|
||||
Combinations with valid clustering: 1008
|
||||
|
||||
Model Selection Metrics:
|
||||
Best BIC score: 6511443.85
|
||||
Best AIC score: -6295231.48
|
||||
Best Log-Likelihood: 1910.09
|
||||
|
||||
Clustering Quality Metrics:
|
||||
Best silhouette score: 0.3757
|
||||
Mean silhouette score: 0.0287
|
||||
Best Calinski-Harabasz score: 1331.69
|
||||
Best Davies-Bouldin score: 0.6762
|
||||
|
||||
Top 5 results by BIC (lower is better):
|
||||
n_comp=3, cov=tied: BIC=6511443.85, AIC=-5982704.34
|
||||
n_comp=3, cov=tied: BIC=6511443.85, AIC=-5982704.34
|
||||
n_comp=3, cov=tied: BIC=6511443.85, AIC=-5982704.34
|
||||
n_comp=3, cov=tied: BIC=6511443.85, AIC=-5982704.34
|
||||
n_comp=4, cov=tied: BIC=6514783.32, AIC=-5991530.55
|
||||
|
||||
Top 5 results by AIC (lower is better):
|
||||
n_comp=50, cov=tied: BIC=6770703.71, AIC=-6295231.48
|
||||
n_comp=50, cov=tied: BIC=6770703.71, AIC=-6295231.48
|
||||
n_comp=50, cov=tied: BIC=6779928.76, AIC=-6286006.43
|
||||
n_comp=50, cov=tied: BIC=6779928.76, AIC=-6286006.43
|
||||
n_comp=47, cov=tied: BIC=6755535.12, AIC=-6273903.03
|
||||
|
||||
Top 5 results by Silhouette Score:
|
||||
n_comp=3, cov=spherical: silhouette=0.3757
|
||||
n_comp=3, cov=spherical: silhouette=0.3757
|
||||
n_comp=3, cov=spherical: silhouette=0.3757
|
||||
n_comp=3, cov=spherical: silhouette=0.3757
|
||||
n_comp=3, cov=spherical: silhouette=0.3757
|
||||
|
||||
Component count analysis (top 10 by BIC):
|
||||
3.0 components: BIC=6511443.85, AIC=-5982704.34, silhouette=0.3757
|
||||
4.0 components: BIC=6514783.32, AIC=-5991530.55, silhouette=0.3110
|
||||
5.0 components: BIC=6520503.08, AIC=-5997976.48, silhouette=0.3163
|
||||
2.0 components: BIC=6521812.14, AIC=-5960170.38, silhouette=0.3693
|
||||
6.0 components: BIC=6526215.27, AIC=-6004429.97, silhouette=0.2485
|
||||
8.0 components: BIC=6529704.08, AIC=-6025272.52, silhouette=0.2680
|
||||
10.0 components: BIC=6538644.29, AIC=-6040663.67, silhouette=0.2706
|
||||
11.0 components: BIC=6546208.81, AIC=-6045264.84, silhouette=0.2580
|
||||
14.0 components: BIC=6563001.35, AIC=-6064969.34, silhouette=0.2241
|
||||
17.0 components: BIC=6580862.17, AIC=-6083605.55, silhouette=0.2109
|
||||
|
||||
📁 SAVING DETAILED RESULTS...
|
||||
==============================
|
||||
Detailed grid search results saved to: gmm_grid_search_detailed_20250805_150635.json
|
||||
Grid search summary CSV saved to: gmm_grid_search_summary_20250805_150635.csv
|
||||
|
||||
Best GMM result by BIC:
|
||||
Parameters: {'n_components': 3, 'covariance_type': 'tied', 'reg_covar': 1e-05, 'n_init': 1, 'init_params': 'kmeans', 'max_iter': 100}
|
||||
BIC score: 6511443.85
|
||||
|
||||
Best GMM result by AIC:
|
||||
Parameters: {'n_components': 50, 'covariance_type': 'tied', 'reg_covar': 1e-05, 'n_init': 5, 'init_params': 'kmeans', 'max_iter': 100}
|
||||
AIC score: -6295231.48
|
||||
|
||||
Best GMM result by Silhouette:
|
||||
Parameters: {'n_components': 3, 'covariance_type': 'spherical', 'reg_covar': 1e-05, 'n_init': 1, 'init_params': 'kmeans', 'max_iter': 100}
|
||||
Silhouette score: 0.3757
|
||||
Visualization saved as 'gmm_clustering_results.png'
|
||||
Final clustering results (bic) saved to: gmm_final_results_bic_20250805_150636.json
|
||||
Final clustering results (aic) saved to: gmm_final_results_aic_20250805_150636.json
|
||||
Traceback (most recent call last):
|
||||
File "/home/nguyendc/sonnh/embedding-clustering/cluster/gmm_extensive.py", line 649, in <module>
|
||||
main()
|
||||
File "/home/nguyendc/sonnh/embedding-clustering/cluster/gmm_extensive.py", line 643, in main
|
||||
clustering.save_clustering_results(results)
|
||||
File "/home/nguyendc/sonnh/embedding-clustering/cluster/gmm_extensive.py", line 617, in save_clustering_results
|
||||
json.dump({
|
||||
File "/usr/lib/python3.10/json/__init__.py", line 179, in dump
|
||||
for chunk in iterable:
|
||||
File "/usr/lib/python3.10/json/encoder.py", line 431, in _iterencode
|
||||
yield from _iterencode_dict(o, _current_indent_level)
|
||||
File "/usr/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
|
||||
yield from chunks
|
||||
File "/usr/lib/python3.10/json/encoder.py", line 438, in _iterencode
|
||||
o = _default(o)
|
||||
File "/usr/lib/python3.10/json/encoder.py", line 179, in default
|
||||
raise TypeError(f'Object of type {o.__class__.__name__} '
|
||||
TypeError: Object of type float32 is not JSON serializable
|
Reference in New Issue
Block a user