image: image
Credit: HIGHER EDUCATON PRESS
This study introduces CELLM (Chinese Education Large Language Model), a specialized 1.5B-parameter open-source LLM designed specifically for Chinese educational applications. The research addresses two critical gaps in current LLM development: (1) the lack of transparency in training processes among existing open-source models, and (2) the scarcity of high-quality Chinese educational datasets compared to English counterparts.
The core innovation lies in developing a fully transparent training pipeline with two key components. First, the authors curated Chinese-fineweb-edu-v2, a domain-specific pretraining corpus combining multiple Chinese educational resources (25.4% industry corpus, 18.6% safety corpus, etc.). Second, they created a novel multi-turn dialogue translation framework that successfully converted 258,000 English instructional entries into Chinese with 97.7% accuracy, significantly expanding available Chinese educational data.
Technical implementation adopts a causal-decoder architecture with grouped-query attention (GQA) and rotary positional encoding (RoPE), optimized for educational contexts. The model demonstrates particular strength in humanities (26.77% accuracy on C-Eval-humanities) and social sciences (26.35% on C-Eval-social-science), though shows limitations in STEM domains (21.48% on C-Eval-stem) and programming tasks (0.6 score on mbpp benchmark).
Notably, the paper provides complete architectural transparency-detailing everything from vocabulary size (151,936 tokens) to training parameters (33.6B pretraining tokens, 16B fine-tuning tokens). This open approach, combined with the release of all models, data, and code, establishes CELLM as a foundational resource for Chinese educational LLM research, while setting performance baselines across 11 evaluation datasets including C-Eval, CMMLU and MMLU.
The work represents a significant step toward democratizing educational LLM development in non-English contexts, though acknowledges current limitations in model scale (1.5B parameters) compared to commercial counterparts. Future directions include expanding pretraining data and exploring alignment techniques to enhance STEM performance.
Journal
Frontiers of Digital Education
Method of Research
Experimental study
Subject of Research
Not applicable
Article Title
An Open-Source Large Language Model for Chinese Education Research
Article Publication Date
20-Jun-2025