[Init] Init easy distill for Knowledge distillation

This commit is contained in:
2025-08-07 08:38:26 +00:00
parent 2f21aaae17
commit 0637599c3a
19 changed files with 170614 additions and 3 deletions

174
easydistill/mmkd/infer.log Normal file
View File

@@ -0,0 +1,174 @@
INFO 08-03 20:27:56 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 08-03 20:27:56 [__init__.py:239] Automatically detected platform cuda.
2025-08-03 20:27:58,078 - INFO - Generating distillation data from the teacher model!
2025-08-03 20:27:58,384 - INFO - Loading processor & vLLM model from Qwen/Qwen2.5-VL-32B-Instruct
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
2025-08-03 20:28:00,580 - INFO - Initial eos_token_id 151645 from tokenizer
2025-08-03 20:28:00,580 - INFO - processor.tokenizer eos_token: <|im_end|>, eos_token_id: 151645
INFO 08-03 20:28:09 [config.py:717] This model supports multiple tasks: {'reward', 'classify', 'score', 'generate', 'embed'}. Defaulting to 'generate'.
INFO 08-03 20:28:09 [config.py:2003] Chunked prefill is enabled with max_num_batched_tokens=16384.
INFO 08-03 20:28:11 [core.py:58] Initializing a V1 LLM engine (v0.8.5) with config: model='Qwen/Qwen2.5-VL-32B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-VL-32B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=16000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=Qwen/Qwen2.5-VL-32B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
WARNING 08-03 20:28:12 [utils.py:2522] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x72908ff5c0d0>
INFO 08-03 20:28:13 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 08-03 20:28:13 [cuda.py:221] Using Flash Attention backend on V1 engine.
WARNING 08-03 20:28:20 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
INFO 08-03 20:28:20 [gpu_model_runner.py:1329] Starting to load model Qwen/Qwen2.5-VL-32B-Instruct...
WARNING 08-03 20:28:20 [vision.py:93] Current `vllm-flash-attn` has a bug inside vision module, so we use xformers backend instead. You can run `pip install flash-attn` to use flash-attention backend.
INFO 08-03 20:28:20 [config.py:3614] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 392, 400, 408, 416, 424, 432, 440, 448, 456, 464, 472, 480, 488, 496, 504, 512] is overridden by config [512, 384, 256, 128, 4, 2, 1, 392, 264, 136, 8, 400, 272, 144, 16, 408, 280, 152, 24, 416, 288, 160, 32, 424, 296, 168, 40, 432, 304, 176, 48, 440, 312, 184, 56, 448, 320, 192, 64, 456, 328, 200, 72, 464, 336, 208, 80, 472, 344, 216, 88, 120, 480, 352, 248, 224, 96, 488, 504, 360, 232, 104, 496, 368, 240, 112, 376]
INFO 08-03 20:28:21 [weight_utils.py:265] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/18 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 6% Completed | 1/18 [00:01<00:18, 1.10s/it]
Loading safetensors checkpoint shards: 11% Completed | 2/18 [00:01<00:12, 1.23it/s]
Loading safetensors checkpoint shards: 17% Completed | 3/18 [00:02<00:14, 1.02it/s]
Loading safetensors checkpoint shards: 22% Completed | 4/18 [00:04<00:14, 1.07s/it]
Loading safetensors checkpoint shards: 28% Completed | 5/18 [00:05<00:14, 1.12s/it]
Loading safetensors checkpoint shards: 33% Completed | 6/18 [00:06<00:13, 1.15s/it]
Loading safetensors checkpoint shards: 39% Completed | 7/18 [00:07<00:12, 1.17s/it]
Loading safetensors checkpoint shards: 44% Completed | 8/18 [00:08<00:11, 1.17s/it]
Loading safetensors checkpoint shards: 50% Completed | 9/18 [00:10<00:10, 1.18s/it]
Loading safetensors checkpoint shards: 56% Completed | 10/18 [00:11<00:09, 1.19s/it]
Loading safetensors checkpoint shards: 61% Completed | 11/18 [00:12<00:08, 1.19s/it]
Loading safetensors checkpoint shards: 67% Completed | 12/18 [00:13<00:07, 1.20s/it]
Loading safetensors checkpoint shards: 72% Completed | 13/18 [00:15<00:06, 1.23s/it]
Loading safetensors checkpoint shards: 78% Completed | 14/18 [00:16<00:04, 1.25s/it]
Loading safetensors checkpoint shards: 83% Completed | 15/18 [00:17<00:03, 1.26s/it]
Loading safetensors checkpoint shards: 89% Completed | 16/18 [00:18<00:02, 1.26s/it]
Loading safetensors checkpoint shards: 94% Completed | 17/18 [00:19<00:01, 1.17s/it]
Loading safetensors checkpoint shards: 100% Completed | 18/18 [00:21<00:00, 1.18s/it]
Loading safetensors checkpoint shards: 100% Completed | 18/18 [00:21<00:00, 1.17s/it]
INFO 08-03 20:28:42 [loader.py:458] Loading weights took 21.13 seconds
INFO 08-03 20:28:42 [gpu_model_runner.py:1347] Model loading took 62.4365 GiB and 21.912121 seconds
INFO 08-03 20:28:46 [gpu_model_runner.py:1620] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
INFO 08-03 20:29:09 [backends.py:420] Using cache directory: /home/nguyendc/.cache/vllm/torch_compile_cache/1fe259ecb1/rank_0_0 for vLLM's torch.compile
INFO 08-03 20:29:09 [backends.py:430] Dynamo bytecode transform time: 19.39 s
INFO 08-03 20:29:22 [backends.py:118] Directly load the compiled graph(s) for shape None from the cache, took 11.165 s
INFO 08-03 20:29:24 [monitor.py:33] torch.compile takes 19.39 s in total
INFO 08-03 20:29:29 [kv_cache_utils.py:634] GPU KV cache size: 38,016 tokens
INFO 08-03 20:29:29 [kv_cache_utils.py:637] Maximum concurrency for 16,000 tokens per request: 2.38x
INFO 08-03 20:30:08 [gpu_model_runner.py:1686] Graph capturing finished in 39 secs, took 0.96 GiB
INFO 08-03 20:30:08 [core.py:159] init engine (profile, create kv cache, warmup model) took 86.30 seconds
INFO 08-03 20:30:12 [core_client.py:439] Core engine process 0 ready.
2025-08-03 20:30:12,647 - INFO - Qwen2.5-VL vLLM model loaded successfully
Generating responses: 0%| | 0/40 [00:00<?, ?it/s]
Processed prompts: 0%| | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:12<00:00, 12.96s/it, est. speed input: 272.51 toks/s, output: 20.14 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:12<00:00, 12.96s/it, est. speed input: 272.51 toks/s, output: 20.14 toks/s]
Generating responses: 2%|▎ | 1/40 [00:16<10:29, 16.13s/it]
Processed prompts: 0%| | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:07<00:00, 7.37s/it, est. speed input: 333.81 toks/s, output: 20.48 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:07<00:00, 7.37s/it, est. speed input: 333.81 toks/s, output: 20.48 toks/s]
Generating responses: 5%|▌ | 2/40 [00:23<06:58, 11.01s/it]
Processed prompts: 0%| | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:07<00:00, 7.54s/it, est. speed input: 364.77 toks/s, output: 20.02 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:07<00:00, 7.54s/it, est. speed input: 364.77 toks/s, output: 20.02 toks/s]
Generating responses: 8%|▊ | 3/40 [00:31<05:50, 9.47s/it]
Processed prompts: 0%| | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:07<00:00, 7.43s/it, est. speed input: 343.57 toks/s, output: 20.31 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:07<00:00, 7.43s/it, est. speed input: 343.57 toks/s, output: 20.31 toks/s]
Generating responses: 10%|█ | 4/40 [00:38<05:12, 8.69s/it]
Processed prompts: 0%| | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:08<00:00, 8.27s/it, est. speed input: 564.53 toks/s, output: 18.27 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:08<00:00, 8.27s/it, est. speed input: 564.53 toks/s, output: 18.27 toks/s]
Generating responses: 12%|█▎ | 5/40 [00:47<05:02, 8.64s/it]
Processed prompts: 0%| | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:07<00:00, 7.35s/it, est. speed input: 307.93 toks/s, output: 20.56 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:07<00:00, 7.35s/it, est. speed input: 307.93 toks/s, output: 20.56 toks/s]
Generating responses: 15%|█▌ | 6/40 [00:54<04:39, 8.21s/it]
Processed prompts: 0%| | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:08<00:00, 8.26s/it, est. speed input: 565.20 toks/s, output: 18.29 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:08<00:00, 8.26s/it, est. speed input: 565.20 toks/s, output: 18.29 toks/s]
Generating responses: 18%|█▊ | 7/40 [01:03<04:34, 8.32s/it]
Processed prompts: 0%| | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:07<00:00, 7.53s/it, est. speed input: 363.87 toks/s, output: 20.05 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:07<00:00, 7.53s/it, est. speed input: 363.87 toks/s, output: 20.05 toks/s]
Generating responses: 20%|██ | 8/40 [01:10<04:19, 8.10s/it]
Processed prompts: 0%| | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:08<00:00, 8.25s/it, est. speed input: 565.62 toks/s, output: 18.30 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:08<00:00, 8.25s/it, est. speed input: 565.62 toks/s, output: 18.30 toks/s]
Generating responses: 22%|██▎ | 9/40 [01:19<04:15, 8.24s/it]
Processed prompts: 0%| | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:07<00:00, 7.63s/it, est. speed input: 395.25 toks/s, output: 19.80 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:07<00:00, 7.63s/it, est. speed input: 395.25 toks/s, output: 19.80 toks/s]
Generating responses: 25%|██▌ | 10/40 [01:27<04:02, 8.08s/it]
Processed prompts: 0%| | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:17<00:00, 17.76s/it, est. speed input: 293.12 toks/s, output: 20.05 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:17<00:00, 17.76s/it, est. speed input: 293.12 toks/s, output: 20.05 toks/s]
Generating responses: 28%|██▊ | 11/40 [01:45<05:22, 11.13s/it]
Processed prompts: 0%| | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:12<00:00, 12.45s/it, est. speed input: 276.12 toks/s, output: 20.48 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:12<00:00, 12.45s/it, est. speed input: 276.12 toks/s, output: 20.48 toks/s]
Generating responses: 30%|███ | 12/40 [01:57<05:24, 11.57s/it]
Processed prompts: 0%| | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:15<00:00, 15.51s/it, est. speed input: 226.26 toks/s, output: 20.76 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:15<00:00, 15.51s/it, est. speed input: 226.26 toks/s, output: 20.76 toks/s]
Generating responses: 32%|███▎ | 13/40 [02:13<05:45, 12.81s/it]
Processed prompts: 0%| | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:12<00:00, 12.27s/it, est. speed input: 278.40 toks/s, output: 20.45 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:12<00:00, 12.27s/it, est. speed input: 278.40 toks/s, output: 20.45 toks/s]
Generating responses: 35%|███▌ | 14/40 [02:25<05:29, 12.69s/it]
Processed prompts: 0%| | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]