distillation/recipes/distilqwen_series/distillqwen2.5-r1/README.md

# DistilQwen2.5-R1: training distilled reasonin models based on CoTs generated by Deepseek-R1

## Brief Introduction

As large language models (LLMs) evolve toward deep reasoning capabilities, deploying them in resource-constrained environments (e.g., mobile devices, edge computing) remains challenging. The DistilQwen2.5-R1 series addresses this by transferring reasoning capabilities from ultra-large models (e.g., DeepSeek-R1) to compact models through innovative distillation techniques, achieving high performance while reducing computational costs.

## Data Generation Detailed Steps

### 1. Generate Thinking Dataset

Distillqwen-r1 is trained using chain-of-thought data distilled from deepseek-r1. We provide the system prompts used for distilling the R1 data and the system prompts used for training qwen2.5. You can use the current system prompts to call Deepseek-R1 to generate your own data and train the model.

```json
{
    "system": "Your role as an assistant involves thoroughly exploring questions through a systematic long thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, reassessment, reflection, backtracing, and iteration to develop well-considered thinking process. Please structure your response into two main sections: Thought and Solution. In the Thought section, detail your reasoning process using the specified format: <|begin_of_thought|> {thought with steps separated with '\n\n'} <|end_of_thought|> Each step should include detailed considerations such as analisying questions, summarizing relevant findings, brainstorming new ideas, verifying the accuracy of the current steps, refining any errors, and revisiting previous steps. In the Solution section, based on various attempts, explorations, and reflections from the Thought section, systematically present the final solution that you deem correct. The solution should remain a logical, accurate, concise expression style and detail necessary step needed to reach the conclusion, formatted as follows: <|begin_of_solution|> {final formatted, precise, and clear solution} <|end_of_solution|> Now, try to solve the following question through the above guidelines:"
}
```

### 2. Determine the Difficulty Level

Critiquing the CoT qualities according to the cognitive capabilities of smaller models. You can use the current system prompts using QwQ-32B to determine the difficulty level of the CoTs.

```json
{
    "system": "You are a highly capable evaluator. Your task is to assess the given reasoning process from the perspective of a small language model (e.g., 7B). Specifically, determine whether the reasoning process provides sufficient detail for a small model to solve the problem, or whether it is too simplistic (i.e., lacking critical details) or too complex (i.e., containing unnecessary or confusing steps). Difficulty Definitions (from the perspective of a small model): - Easy: The reasoning process is overly simplistic relative to the problem's difficulty; it omits essential details that a small model needs to solve the problem. - Medium: The reasoning process is appropriately balanced, offering enough detailed guidance. - Hard: The reasoning process is overly complex, with extraneous or convoluted steps that could hinder a small model's ability to follow it. Output Format: You must output exactly one word: easy, medium, or hard. Do NOT provide any additional text, explanation."
}
```

### 3. Rethinking and Refining these CoTs

Rethinking and refining these CoTs based on the critiques using following prompts:

#### easy
```json
{
    "system": "You are a helpful assistant who is highly skilled at extending reasoning processes. Given a problem, its answer, and its reasoning process, your task is to extend the reasoning process by adding necessary details and intermediate steps so that a small language model (e.g., a 7B model) can follow the extended reasoning process to solve the problem. If the original reasoning process is divided into multiple steps separated by two newline characters, your output must preserve this formatting. You must output ONLY the extended reasoning process with no additional explanation or commentary."
}
```

#### hard
```json
{
    "system": "You are a helpful assistant who is highly skilled at simplifying reasoning processes. Given a problem, its answer, and its reasoning process, your task is to simplify the reasoning process so that a small language model (e.g., a 7B model) can reliably follow the steps to solve the problem. If the original reasoning process is divided into multiple steps separated by two newline characters, your output must preserve this formatting. You must output ONLY the simplified reasoning process with no additional explanation or commentary."
}
```

The training dataset is in JSON format, exemplified by entries such as:

```json
[
    {
        "instruction": "The ratio of the number of molar teeth in the human upper jaw at the age of 6 is 2:1 compared to number of incisors teeth. There are total 8 incisors in the human mouth...",
        "output": "<|begin_of_thought|>## Step 1: Determine the total number of incisors in the upper jaw...\n<|end_of_thought|>\n<|begin_of_solution|>The final answer is: \\boxed{8}<|end_of_solution|>"
    }
]
```

## Model Training Guidelines

### 1. Black-Box KD

The black-box KD process follows a supervised learning paradigm, utilizing enhanced instruction-response pairs as training samples. Through this approach, the student model can effectively absorb and understand the knowledge imparted by the larger model, even with a limited number of parameters. This method not only boosts the student model's ability to tackle tasks but also enables it to perform better in multi-task scenarios. Because we have already obtained the teacher's responses in the dataset, we need to run the training job only:

```bash
python easydistill/kd/train.py --config=distilqwen2.5-r1_stage1.json
```

Plese refer to the config file `distilqwen2.5-r1.json` in the current folder. If you need to run the job in a distributed mode, use `accelerate` to run the job.

### 2. CogPO

CogPO (Cognitive Preference Optimization) is a novel algorithm designed to enhance the reasoning abilities of small language models (LLMs) by aligning their reasoning processes with their inherent cognitive capacities. 

Key aspects of CogPO:
- Extends Direct Preference Optimization (DPO) with cognitive alignment
- Introduces three specialized "mini-tasks" with different preference gaps
- Dynamically adjusts optimization strength (β values) based on reasoning complexity
- Works synergistically with the CRV (Critique-Rethink-Verify) system

You can run the CogPO by:

```bash
accelerate launch --num_processes n --config_file multi_gpu.yaml cogpo.py --config distilqwen2.5-r1_stage2.json
```

The dataset is in JSON format, exemplified by entries such as:
```json
{
    "prompt": "Ellie has 8 pairs of shoes. Riley has 3 fewer. How many pairs of shoes do they have in all?", 
    "chosen": "<think>Identify the number of pairs of shoes Ellie has. According to the problem statement, Ellie has 8 pairs of shoes.\n Next, determine the number of pairs of shoes Riley has. The problem states that Riley has 3 fewer pairs than Ellie. To find out how many pairs Riley has, subtract 3 from the number of pairs Ellie has: 8 - 3 = 5. So, Riley has 5 pairs of shoes.\n Now, calculate the total number of pairs of shoes both Ellie and Riley have together. To do this, add the number of pairs Ellie has to the number of pairs Riley has: 8 (Ellie's pairs) + 5 (Riley's pairs) = 13 pairs. This step is crucial because it combines the information about both individuals to give the overall total.\n The total number of pairs of shoes they have in all is 13. Thus, the final answer is 13. Each step in the reasoning process is designed to help understand and solve the problem effectively, showing how the information about each individual's shoe count leads to finding the combined total.</think>\boxed{13}", 
    "rejected": "<think>Identify the number of pairs of shoes Ellie has. Ellie has 8 pairs of shoes as stated in the problem. Determine how many pairs of shoes Riley has. Since Riley has 3 fewer pairs than Ellie, we mistakenly add 3 to Ellie's pairs instead of subtracting, giving us 8 + 3 = 11 pairs of shoes for Riley. Calculate the total number of pairs of shoes they both have. Add Ellie's and Riley's pairs together: 8 + 11. The total pairs of shoes is 19. The final answer is thus \boxed{19}.</think>\boxed{13}", 
    "beta": 0.5
}
```

## Model Download

We have open-sourced our distilled models on both HuggingFace and ModelScope. The available models are named `alibaba-pai/DistilQwen2.5-R1-3B`, `alibaba-pai/DistilQwen2.5-R1-7B`, `alibaba-pai/DistilQwen2.5-R1-14B`, and `alibaba-pai/DistilQwen2.5-R1-32B`.

For example, users can download these models from HuggingFace using the following code:


```python
from huggingface_hub import snapshot_download

# Download the 3B model
model_name = "alibaba-pai/DistilQwen2.5-R1-3B"
snapshot_download(repo_id=model_name, cache_dir="./DistilQwen2.5-R1-3B/")

# Download the 7B model
model_name = "alibaba-pai/DistilQwen2.5-R1-7B"
snapshot_download(repo_id=model_name, cache_dir="./DistilQwen2.5-R1-7B/")

# Download the 14B model
model_name = "alibaba-pai/DistilQwen2.5-R1-14B"
snapshot_download(repo_id=model_name, cache_dir="./DistilQwen2.5-R1-14B/")

# Download the 32B model
model_name = "alibaba-pai/DistilQwen2.5-R1-32B"
snapshot_download(repo_id=model_name, cache_dir="./DistilQwen2.5-R1-32B/")
```


## Performance

We compared DistilQwen2.5-R1 series with leading reasoning models across four benchmarks:

### 7B Model Comparison
| Model                          | Training Data Size | AIME2024 | MATH-500 | GPQA Diamond | LiveCodeBench V2 |
|--------------------------------|--------------------|----------|----------|--------------|------------------|
| DeepSeek-R1-Distill-Qwen-7B    | 800k               | 55.5     | 92.8     | 49.1         | -                |
| Bespoke-Stratos-7B             | 17k                | 20.0     | 82.0     | 37.8         | 36.1             |
| OpenThinker-7B                 | 114k               | 31.3     | 83.0     | 42.4         | 39.9             |
| **DistilQwen2.5-R1-7B**        | 105k               | 43.33    | 88.4     | 42.93        | 46.38            |

### 32B Model Comparison
| Model                          | Training Data Size | AIME2024 | MATH-500 | GPQA Diamond | LiveCodeBench V2 |
|--------------------------------|--------------------|----------|----------|--------------|------------------|
| DeepSeek-R1-Distill-Qwen-32B   | 800k               | 72.6     | 94.3     | 62.1         | -                |
| Sky-T1-32B-Preview             | 17k                | 43.3     | 86.4     | 56.8         | -                |
| OpenThinker-32B                | 114k               | 66.0     | 90.6     | 61.6         | 68.9             |
| **DistilQwen2.5-R1-32B**       | 105k               | 70.0     | 93.8     | 62.12        | 65.95            |
init commit 2025-05-27 18:55:46 +08:00			`# DistilQwen2.5-R1: training distilled reasonin models based on CoTs generated by Deepseek-R1`

			`## Brief Introduction`

			`As large language models (LLMs) evolve toward deep reasoning capabilities, deploying them in resource-constrained environments (e.g., mobile devices, edge computing) remains challenging. The DistilQwen2.5-R1 series addresses this by transferring reasoning capabilities from ultra-large models (e.g., DeepSeek-R1) to compact models through innovative distillation techniques, achieving high performance while reducing computational costs.`

			`## Data Generation Detailed Steps`

			`### 1. Generate Thinking Dataset`

			`Distillqwen-r1 is trained using chain-of-thought data distilled from deepseek-r1. We provide the system prompts used for distilling the R1 data and the system prompts used for training qwen2.5. You can use the current system prompts to call Deepseek-R1 to generate your own data and train the model.`

			```json
			`{`
			"system": "Your role as an assistant involves thoroughly exploring questions through a systematic long thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, reassessment, reflection, backtracing, and iteration to develop well-considered thinking process. Please structure your response into two main sections: Thought and Solution. In the Thought section, detail your reasoning process using the specified format: <\|begin_of_thought\|> {thought with steps separated with '\n\n'} <\|end_of_thought\|> Each step should include detailed considerations such as analisying questions, summarizing relevant findings, brainstorming new ideas, verifying the accuracy of the current steps, refining any errors, and revisiting previous steps. In the Solution section, based on various attempts, explorations, and reflections from the Thought section, systematically present the final solution that you deem correct. The solution should remain a logical, accurate, concise expression style and detail necessary step needed to reach the conclusion, formatted as follows: <\|begin_of_solution\|> {final formatted, precise, and clear solution} <\|end_of_solution\|> Now, try to solve the following question through the above guidelines:"
			`}`
			```

			`### 2. Determine the Difficulty Level`

			`Critiquing the CoT qualities according to the cognitive capabilities of smaller models. You can use the current system prompts using QwQ-32B to determine the difficulty level of the CoTs.`

			```json
			`{`
			"system": "You are a highly capable evaluator. Your task is to assess the given reasoning process from the perspective of a small language model (e.g., 7B). Specifically, determine whether the reasoning process provides sufficient detail for a small model to solve the problem, or whether it is too simplistic (i.e., lacking critical details) or too complex (i.e., containing unnecessary or confusing steps). Difficulty Definitions (from the perspective of a small model): - Easy: The reasoning process is overly simplistic relative to the problem's difficulty; it omits essential details that a small model needs to solve the problem. - Medium: The reasoning process is appropriately balanced, offering enough detailed guidance. - Hard: The reasoning process is overly complex, with extraneous or convoluted steps that could hinder a small model's ability to follow it. Output Format: You must output exactly one word: easy, medium, or hard. Do NOT provide any additional text, explanation."
			`}`
			```

			`### 3. Rethinking and Refining these CoTs`

			`Rethinking and refining these CoTs based on the critiques using following prompts:`

			`#### easy`
			```json
			`{`
			"system": "You are a helpful assistant who is highly skilled at extending reasoning processes. Given a problem, its answer, and its reasoning process, your task is to extend the reasoning process by adding necessary details and intermediate steps so that a small language model (e.g., a 7B model) can follow the extended reasoning process to solve the problem. If the original reasoning process is divided into multiple steps separated by two newline characters, your output must preserve this formatting. You must output ONLY the extended reasoning process with no additional explanation or commentary."
			`}`
			```

			`#### hard`
			```json
			`{`
			"system": "You are a helpful assistant who is highly skilled at simplifying reasoning processes. Given a problem, its answer, and its reasoning process, your task is to simplify the reasoning process so that a small language model (e.g., a 7B model) can reliably follow the steps to solve the problem. If the original reasoning process is divided into multiple steps separated by two newline characters, your output must preserve this formatting. You must output ONLY the simplified reasoning process with no additional explanation or commentary."
			`}`
			```

			`The training dataset is in JSON format, exemplified by entries such as:`

			```json
			`[`
			`{`
			`"instruction": "The ratio of the number of molar teeth in the human upper jaw at the age of 6 is 2:1 compared to number of incisors teeth. There are total 8 incisors in the human mouth...",`
			`"output": "<\|begin_of_thought\|>## Step 1: Determine the total number of incisors in the upper jaw...\n<\|end_of_thought\|>\n<\|begin_of_solution\|>The final answer is: \\boxed{8}<\|end_of_solution\|>"`
			`}`
			`]`
			```

			`## Model Training Guidelines`

			`### 1. Black-Box KD`

			The black-box KD process follows a supervised learning paradigm, utilizing enhanced instruction-response pairs as training samples. Through this approach, the student model can effectively absorb and understand the knowledge imparted by the larger model, even with a limited number of parameters. This method not only boosts the student model's ability to tackle tasks but also enables it to perform better in multi-task scenarios. Because we have already obtained the teacher's responses in the dataset, we need to run the training job only:

			```bash
			`python easydistill/kd/train.py --config=distilqwen2.5-r1_stage1.json`
			```

			Plese refer to the config file `distilqwen2.5-r1.json` in the current folder. If you need to run the job in a distributed mode, use `accelerate` to run the job.

			`### 2. CogPO`

			`CogPO (Cognitive Preference Optimization) is a novel algorithm designed to enhance the reasoning abilities of small language models (LLMs) by aligning their reasoning processes with their inherent cognitive capacities.`

			`Key aspects of CogPO:`
			`- Extends Direct Preference Optimization (DPO) with cognitive alignment`
			`- Introduces three specialized "mini-tasks" with different preference gaps`
			`- Dynamically adjusts optimization strength (β values) based on reasoning complexity`
			`- Works synergistically with the CRV (Critique-Rethink-Verify) system`

			`You can run the CogPO by:`

			```bash
			`accelerate launch --num_processes n --config_file multi_gpu.yaml cogpo.py --config distilqwen2.5-r1_stage2.json`
			```

			`The dataset is in JSON format, exemplified by entries such as:`
			```json
			`{`
			`"prompt": "Ellie has 8 pairs of shoes. Riley has 3 fewer. How many pairs of shoes do they have in all?",`
			"chosen": "<think>Identify the number of pairs of shoes Ellie has. According to the problem statement, Ellie has 8 pairs of shoes.\n Next, determine the number of pairs of shoes Riley has. The problem states that Riley has 3 fewer pairs than Ellie. To find out how many pairs Riley has, subtract 3 from the number of pairs Ellie has: 8 - 3 = 5. So, Riley has 5 pairs of shoes.\n Now, calculate the total number of pairs of shoes both Ellie and Riley have together. To do this, add the number of pairs Ellie has to the number of pairs Riley has: 8 (Ellie's pairs) + 5 (Riley's pairs) = 13 pairs. This step is crucial because it combines the information about both individuals to give the overall total.\n The total number of pairs of shoes they have in all is 13. Thus, the final answer is 13. Each step in the reasoning process is designed to help understand and solve the problem effectively, showing how the information about each individual's shoe count leads to finding the combined total.</think>\boxed{13}",
			"rejected": "<think>Identify the number of pairs of shoes Ellie has. Ellie has 8 pairs of shoes as stated in the problem. Determine how many pairs of shoes Riley has. Since Riley has 3 fewer pairs than Ellie, we mistakenly add 3 to Ellie's pairs instead of subtracting, giving us 8 + 3 = 11 pairs of shoes for Riley. Calculate the total number of pairs of shoes they both have. Add Ellie's and Riley's pairs together: 8 + 11. The total pairs of shoes is 19. The final answer is thus \boxed{19}.</think>\boxed{13}",
			`"beta": 0.5`
			`}`
			```

			`## Model Download`

			We have open-sourced our distilled models on both HuggingFace and ModelScope. The available models are named `alibaba-pai/DistilQwen2.5-R1-3B`, `alibaba-pai/DistilQwen2.5-R1-7B`, `alibaba-pai/DistilQwen2.5-R1-14B`, and `alibaba-pai/DistilQwen2.5-R1-32B`.

			`For example, users can download these models from HuggingFace using the following code:`


			```python
			`from huggingface_hub import snapshot_download`

			`# Download the 3B model`
			`model_name = "alibaba-pai/DistilQwen2.5-R1-3B"`
			`snapshot_download(repo_id=model_name, cache_dir="./DistilQwen2.5-R1-3B/")`

			`# Download the 7B model`
			`model_name = "alibaba-pai/DistilQwen2.5-R1-7B"`
			`snapshot_download(repo_id=model_name, cache_dir="./DistilQwen2.5-R1-7B/")`

			`# Download the 14B model`
			`model_name = "alibaba-pai/DistilQwen2.5-R1-14B"`
			`snapshot_download(repo_id=model_name, cache_dir="./DistilQwen2.5-R1-14B/")`

			`# Download the 32B model`
			`model_name = "alibaba-pai/DistilQwen2.5-R1-32B"`
			`snapshot_download(repo_id=model_name, cache_dir="./DistilQwen2.5-R1-32B/")`
			```


			`## Performance`

			`We compared DistilQwen2.5-R1 series with leading reasoning models across four benchmarks:`

			`### 7B Model Comparison`
			`\| Model \| Training Data Size \| AIME2024 \| MATH-500 \| GPQA Diamond \| LiveCodeBench V2 \|`
			`\|--------------------------------\|--------------------\|----------\|----------\|--------------\|------------------\|`
			`\| DeepSeek-R1-Distill-Qwen-7B \| 800k \| 55.5 \| 92.8 \| 49.1 \| - \|`
			`\| Bespoke-Stratos-7B \| 17k \| 20.0 \| 82.0 \| 37.8 \| 36.1 \|`
			`\| OpenThinker-7B \| 114k \| 31.3 \| 83.0 \| 42.4 \| 39.9 \|`
			`\| DistilQwen2.5-R1-7B \| 105k \| 43.33 \| 88.4 \| 42.93 \| 46.38 \|`

			`### 32B Model Comparison`
			`\| Model \| Training Data Size \| AIME2024 \| MATH-500 \| GPQA Diamond \| LiveCodeBench V2 \|`
			`\|--------------------------------\|--------------------\|----------\|----------\|--------------\|------------------\|`
			`\| DeepSeek-R1-Distill-Qwen-32B \| 800k \| 72.6 \| 94.3 \| 62.1 \| - \|`
			`\| Sky-T1-32B-Preview \| 17k \| 43.3 \| 86.4 \| 56.8 \| - \|`
			`\| OpenThinker-32B \| 114k \| 66.0 \| 90.6 \| 61.6 \| 68.9 \|`
			`\| DistilQwen2.5-R1-32B \| 105k \| 70.0 \| 93.8 \| 62.12 \| 65.95 \|`