DistilQwen-100k/DistilQwen-1M: High-Quality Instruction-Tuning Datasets

Overview

To empower community developers in enhancing the instruction-following capabilities of large language models (LLMs), we open-source DistilQwen-100k and DistilQwen-1M, subsets of the training data used for the DistilQwen model series. The datasets provide diverse, high-quality samples to improve model performance in key areas.

Dataset Features

Scale: 100 thousand/1 million meticulously distilled entries.
Coverage: Balanced mix of:
- Mathematics
- Code generation & understanding
- Knowledge-based QA
- Instruction following
- Creative generation
Purpose: Optimized for instruction tuning, helping models retain generalization while adapting to downstream tasks.

Use Cases

Fine-tuning LLMs: Mitigate catastrophic forgetting by combining with custom datasets.
Multi-task learning: Improve coherence in mathematical reasoning, coding, and creative tasks.
Research: Study distillation techniques or instruction-tuning efficacy.

Use the Datasets

from datasets import load_dataset

# Login using e.g. `huggingface-cli login` to access this dataset
ds = load_dataset("alibaba-pai/DistilQwen_100k")
ds = load_dataset("alibaba-pai/DistilQwen_1M")

Reference

For more detailed information about the dataset construction process, we encourage you to refer to our paper:

DistilQwen2.5: Industrial Practices of Training Distilled Open Lightweight Language Models
Chengyu Wang, Junbing Yan, Yuanhao Yue, Jun Huang
arXiv:2504.15027

You can cite the paper using the following citation format:

@misc{wang2025distilqwen25industrialpracticestraining,
      title={DistilQwen2.5: Industrial Practices of Training Distilled Open Lightweight Language Models}, 
      author={Chengyu Wang and Junbing Yan and Yuanhao Yue and Jun Huang},
      year={2025},
      eprint={2504.15027},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2504.15027}
}

2.1 KiB Raw Blame History