WayraPPL: Accelerating Curriculum Learning with Knowledge-Distilled Perplexity Models
Abstract
Curriculum learning has re-emerged as a key strategy for data-efficient LLM pre-training, but it faces a bottleneck: probabilistic sequence difficulty is computationally intensive to evaluate at web scale. Teacher-model perplexity provides a faithful signal yet is prohibitively expensive, while n-gram proxies are fast but poorly aligned with modern autoregressive objectives. We introduce WayraPPL (wayra means wind in Quechua), a 55M-parameter multilingual student trained to approximate teacher perplexity in English, Spanish, and Portuguese. Distilled from a 1B-parameter LLaMA-3 teacher, WayraPPL achieves a 0.94 Spearman correlation while operating at approximately 10.5 times higher throughput and using four times less RAM memory (2 GB vs. 8 GB). This enables annotating 100 million documents in 0.94 days instead of 10.3 days with the teacher. When used to construct curricula, WayraPPL improves pre-training compute–quality trade-offs compared to classical n-gram filtering (KenLM) and sequence-length ordering. WayraPPL also retains the teacher’s difficulty signal with high reliability, avoiding the misalignments introduced by n-gram or length-based heuristics while remaining practical for large-scale deployment. We release model weights, training data, and system optimizations to democratize perplexity-driven curriculum learning for budget-constrained teams building LLMs.