ENLIVEN-1000: A Comprehensive Revitalization Framework for 1000+Endangered Languages via Broad-Coverage LID and LLM-Augmented MT
Abstract
We present ENLIVEN-1000, a unified framework for endangered and low-resource language revitalization that integrates broad-coverage language identification (LID), machine translation (MT), and LLM-generated synthetic data—aimed at expanding safe, equitable NLP support for communities historically excluded from mainstream tools. We compile a text corpus for 1,154 languages (1,069 endangered or low-resource) from public sources and train a fastText-based LID model over this set. The LID system achieves high detection quality with F1 ≈ 0.99 and a false-positive rate ≈ 3×10−6, substantially broadening reliable coverage beyond existing solutions. Focusing on five diverse endangered languages—Carpathian Romani, Chuj, Sunwar, Kapingamarangi, and Inuktitut—we fine-tune a 600M-parameter NLLB-200 model for translation. Our fine-tuned models outperform zero-shot baselines and proxy models trained on related high-resource languages in both directions (endangered → English and English → endangered). We also use GPT-4o to generate synthetic parallel data, showing that augmenting limited real data with LLM-generated text yields substantial MT improvements. These results illustrate a practical path to scaling NLP support to hundreds of under-resourced languages. We conclude with implications for language revitalization and an ethics discussion about working with endangered-language communities.