init commit

2025-05-27 18:55:46 +08:00
parent 6f52a67249
commit 25caa8a90a
65 changed files with 4893 additions and 1 deletions
--- a/recipes/open_datasets/omni_thought.md
+++ b/recipes/open_datasets/omni_thought.md
@@ -0,0 +1,58 @@
+# OmniThought: A Large-Scale Chain-of-Thought Dataset for Advancing Large Reasoning Models  
+
+## Overview  
+The rise of **Large Reasoning Models (LRMs)** has revolutionized **Natural Language Processing (NLP)**, enabling breakthroughs in complex tasks like **mathematical problem-solving** and **code generation**. These models rely on **Chain-of-Thought (CoT)** processes to mimic human-like reasoning. However, progress in LRMs is limited by the scarcity of **high-quality, large-scale CoT datasets**—existing resources often lack:  
+- **Diverse reasoning problems** with well-structured CoT processes.  
+- **Multi-teacher distillation** to ensure reasoning quality.  
+- **Fine-grained annotations** describing CoT properties.  
+
+To bridge this gap, we introduce **`OmniThought`**, a **2-million-scale CoT dataset** generated and validated by **two powerful LRMs**. Each CoT process is annotated with:  
+- **Reasoning Verbosity (RV)**: Measures the optimal verbosity of reasoning steps.  
+- **Cognitive Difficulty (CD)**: Assesses the complexity of reasoning for model comprehension.  
+
+We also propose a **self-reliant pipeline** for dataset curation, ensuring high-quality reasoning traces.  
+
+## Key Features  
+✅ **2 million high-quality CoT processes** covering diverse reasoning tasks.  
+✅ **RV-CD scores** to guide model training for better reasoning performance.  
+✅ **Multi-teacher distillation** for robust and coherent reasoning paths.  
+✅ **Optimized for LRM training**—improves reasoning ability and output quality.  
+
+## Experiments & Results  
+Extensive experiments with **Qwen2.5 models** (various sizes) confirm that:  
+- Training with **RV-CD scores** enhances **LRM reasoning effectiveness**.  
+- Models trained on `OmniThought` achieve **stronger reasoning abilities** with **optimal CoT length and difficulty**.  
+
+Based on this dataset, we release **a series of high-performance LRMs** with superior reasoning capabilities.  
+
+## Use the Datasets
+```python
+from datasets import load_dataset
+
+# Login using e.g. `huggingface-cli login` to access this dataset
+ds = load_dataset("alibaba-pai/OmniThought")
+```
+
+
+
+## Reference
+
+For more detailed information about the dataset construction process, we encourage you to refer to our paper:
+
+- **Reasoning with OmniThought: A Large CoT Dataset with Verbosity and Cognitive Difficulty Annotations**  
+  Wenrui Cai, Chengyu Wang, Junbing Yan, Jun Huang, Xiangzhong Fang
+  [arXiv:2505.10937](https://arxiv.org/abs/2505.10937)
+
+You can cite the paper using the following citation format:
+
+```bibtex
+@misc{cai2025reasoningomnithoughtlargecot,
+      title={Reasoning with OmniThought: A Large CoT Dataset with Verbosity and Cognitive Difficulty Annotations}, 
+      author={Wenrui Cai and Chengyu Wang and Junbing Yan and Jun Huang and Xiangzhong Fang},
+      year={2025},
+      eprint={2505.10937},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2505.10937} 
+}
+```