2.9 KiB
OmniThought: A Large-Scale Chain-of-Thought Dataset for Advancing Large Reasoning Models
Overview
The rise of Large Reasoning Models (LRMs) has revolutionized Natural Language Processing (NLP), enabling breakthroughs in complex tasks like mathematical problem-solving and code generation. These models rely on Chain-of-Thought (CoT) processes to mimic human-like reasoning. However, progress in LRMs is limited by the scarcity of high-quality, large-scale CoT datasets—existing resources often lack:
- Diverse reasoning problems with well-structured CoT processes.
- Multi-teacher distillation to ensure reasoning quality.
- Fine-grained annotations describing CoT properties.
To bridge this gap, we introduce OmniThought
, a 2-million-scale CoT dataset generated and validated by two powerful LRMs. Each CoT process is annotated with:
- Reasoning Verbosity (RV): Measures the optimal verbosity of reasoning steps.
- Cognitive Difficulty (CD): Assesses the complexity of reasoning for model comprehension.
We also propose a self-reliant pipeline for dataset curation, ensuring high-quality reasoning traces.
Key Features
✅ 2 million high-quality CoT processes covering diverse reasoning tasks.
✅ RV-CD scores to guide model training for better reasoning performance.
✅ Multi-teacher distillation for robust and coherent reasoning paths.
✅ Optimized for LRM training—improves reasoning ability and output quality.
Experiments & Results
Extensive experiments with Qwen2.5 models (various sizes) confirm that:
- Training with RV-CD scores enhances LRM reasoning effectiveness.
- Models trained on
OmniThought
achieve stronger reasoning abilities with optimal CoT length and difficulty.
Based on this dataset, we release a series of high-performance LRMs with superior reasoning capabilities.
Use the Datasets
from datasets import load_dataset
# Login using e.g. `huggingface-cli login` to access this dataset
ds = load_dataset("alibaba-pai/OmniThought")
Reference
For more detailed information about the dataset construction process, we encourage you to refer to our paper:
- Reasoning with OmniThought: A Large CoT Dataset with Verbosity and Cognitive Difficulty Annotations
Wenrui Cai, Chengyu Wang, Junbing Yan, Jun Huang, Xiangzhong Fang arXiv:2505.10937
You can cite the paper using the following citation format:
@misc{cai2025reasoningomnithoughtlargecot,
title={Reasoning with OmniThought: A Large CoT Dataset with Verbosity and Cognitive Difficulty Annotations},
author={Wenrui Cai and Chengyu Wang and Junbing Yan and Jun Huang and Xiangzhong Fang},
year={2025},
eprint={2505.10937},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.10937}
}