Files
distillation/recipes/open_datasets/distilqwen_datasets.md
2025-05-27 18:55:46 +08:00

2.1 KiB

DistilQwen-100k/DistilQwen-1M: High-Quality Instruction-Tuning Datasets

Overview

To empower community developers in enhancing the instruction-following capabilities of large language models (LLMs), we open-source DistilQwen-100k and DistilQwen-1M, subsets of the training data used for the DistilQwen model series. The datasets provide diverse, high-quality samples to improve model performance in key areas.

Dataset Features

  • Scale: 100 thousand/1 million meticulously distilled entries.
  • Coverage: Balanced mix of:
    • Mathematics
    • Code generation & understanding
    • Knowledge-based QA
    • Instruction following
    • Creative generation
  • Purpose: Optimized for instruction tuning, helping models retain generalization while adapting to downstream tasks.

Use Cases

  • Fine-tuning LLMs: Mitigate catastrophic forgetting by combining with custom datasets.
  • Multi-task learning: Improve coherence in mathematical reasoning, coding, and creative tasks.
  • Research: Study distillation techniques or instruction-tuning efficacy.

Use the Datasets

from datasets import load_dataset

# Login using e.g. `huggingface-cli login` to access this dataset
ds = load_dataset("alibaba-pai/DistilQwen_100k")
ds = load_dataset("alibaba-pai/DistilQwen_1M")

Reference

For more detailed information about the dataset construction process, we encourage you to refer to our paper:

  • DistilQwen2.5: Industrial Practices of Training Distilled Open Lightweight Language Models
    Chengyu Wang, Junbing Yan, Yuanhao Yue, Jun Huang
    arXiv:2504.15027

You can cite the paper using the following citation format:

@misc{wang2025distilqwen25industrialpracticestraining,
      title={DistilQwen2.5: Industrial Practices of Training Distilled Open Lightweight Language Models}, 
      author={Chengyu Wang and Junbing Yan and Yuanhao Yue and Jun Huang},
      year={2025},
      eprint={2504.15027},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2504.15027}
}