distillation/recipes/open_datasets/distilqwen_datasets.md

# DistilQwen-100k/DistilQwen-1M: High-Quality Instruction-Tuning Datasets  

## Overview  
To empower community developers in enhancing the **instruction-following capabilities** of large language models (LLMs), we open-source **`DistilQwen-100k`** and **`DistilQwen-1M`**, subsets of the training data used for the **DistilQwen model series**. The datasets provide diverse, high-quality samples to improve model performance in key areas.  

## Dataset Features  
- **Scale**: **100 thousand**/**1 million** meticulously distilled entries.  
- **Coverage**: Balanced mix of:  
  - **Mathematics**  
  - **Code generation & understanding**  
  - **Knowledge-based QA**  
  - **Instruction following**  
  - **Creative generation**  
- **Purpose**: Optimized for **instruction tuning**, helping models retain generalization while adapting to downstream tasks.  

## Use Cases  
- **Fine-tuning LLMs**: Mitigate *catastrophic forgetting* by combining with custom datasets.  
- **Multi-task learning**: Improve coherence in mathematical reasoning, coding, and creative tasks.  
- **Research**: Study distillation techniques or instruction-tuning efficacy.  

## Use the Datasets
```python
from datasets import load_dataset

# Login using e.g. `huggingface-cli login` to access this dataset
ds = load_dataset("alibaba-pai/DistilQwen_100k")
ds = load_dataset("alibaba-pai/DistilQwen_1M")
```

## Reference

For more detailed information about the dataset construction process, we encourage you to refer to our paper:

- **DistilQwen2.5: Industrial Practices of Training Distilled Open Lightweight Language Models**  
  Chengyu Wang, Junbing Yan, Yuanhao Yue, Jun Huang  
  [arXiv:2504.15027](https://arxiv.org/abs/2504.15027)

You can cite the paper using the following citation format:

```bibtex
@misc{wang2025distilqwen25industrialpracticestraining,
      title={DistilQwen2.5: Industrial Practices of Training Distilled Open Lightweight Language Models}, 
      author={Chengyu Wang and Junbing Yan and Yuanhao Yue and Jun Huang},
      year={2025},
      eprint={2504.15027},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2504.15027}
}
```
init commit 2025-05-27 18:55:46 +08:00			`# DistilQwen-100k/DistilQwen-1M: High-Quality Instruction-Tuning Datasets`

			`## Overview`
			To empower community developers in enhancing the instruction-following capabilities of large language models (LLMs), we open-source `DistilQwen-100k` and `DistilQwen-1M`, subsets of the training data used for the DistilQwen model series. The datasets provide diverse, high-quality samples to improve model performance in key areas.

			`## Dataset Features`
			`- Scale: 100 thousand/1 million meticulously distilled entries.`
			`- Coverage: Balanced mix of:`
			`- Mathematics`
			`- Code generation & understanding`
			`- Knowledge-based QA`
			`- Instruction following`
			`- Creative generation`
			`- Purpose: Optimized for instruction tuning, helping models retain generalization while adapting to downstream tasks.`

			`## Use Cases`
			`- Fine-tuning LLMs: Mitigate catastrophic forgetting by combining with custom datasets.`
			`- Multi-task learning: Improve coherence in mathematical reasoning, coding, and creative tasks.`
			`- Research: Study distillation techniques or instruction-tuning efficacy.`

			`## Use the Datasets`
			```python
			`from datasets import load_dataset`

			# Login using e.g. `huggingface-cli login` to access this dataset
			`ds = load_dataset("alibaba-pai/DistilQwen_100k")`
			`ds = load_dataset("alibaba-pai/DistilQwen_1M")`
			```

			`## Reference`

			`For more detailed information about the dataset construction process, we encourage you to refer to our paper:`

			`- DistilQwen2.5: Industrial Practices of Training Distilled Open Lightweight Language Models`
			`Chengyu Wang, Junbing Yan, Yuanhao Yue, Jun Huang`
			`[arXiv:2504.15027](https://arxiv.org/abs/2504.15027)`

			`You can cite the paper using the following citation format:`

			```bibtex
			`@misc{wang2025distilqwen25industrialpracticestraining,`
			`title={DistilQwen2.5: Industrial Practices of Training Distilled Open Lightweight Language Models},`
			`author={Chengyu Wang and Junbing Yan and Yuanhao Yue and Jun Huang},`
			`year={2025},`
			`eprint={2504.15027},`
			`archivePrefix={arXiv},`
			`primaryClass={cs.CL},`
			`url={https://arxiv.org/abs/2504.15027}`
			`}`
			```