Timezone: »

The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset
Hugo Laurençon · Lucile Saulnier · Thomas Wang · Christopher Akiki · Albert Villanova del Moral · Teven Le Scao · Leandro Von Werra · Chenghao Mou · Eduardo González Ponferrada · Huu Nguyen · Jörg Frohberg · Mario Šaško · Quentin Lhoest · Angelina McMillan-Major · Gerard Dupont · Stella Biderman · Anna Rogers · Loubna Ben allal · Francesco De Toni · Giada Pistilli · Olivier Nguyen · Somaieh Nikpoor · Maraim Masoud · Pierre Colombo · Javier de la Rosa · Paulo Villegas · Tristan Thrush · Shayne Longpre · Sebastian Nagel · Leon Weber · Manuel Muñoz · Jian Zhu · Daniel Van Strien · Zaid Alyafeai · Khalid Almubarak · Minh Chien Vu · Itziar Gonzalez-Dios · Aitor Soroa · Kyle Lo · Manan Dey · Pedro Ortiz Suarez · Aaron Gokaslan · Shamik Bose · David Adelani · Long Phan · Hieu Tran · Ian Yu · Suhas Pai · Jenny Chim · Violette Lepercq · Suzana Ilic · Margaret Mitchell · Sasha Alexandra Luccioni · Yacine Jernite

Tue Nov 29 02:00 PM -- 04:00 PM (PST) @ Hall J #1012

As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, was formed with the goal of researching and training large language models as a values-driven undertaking, putting issues of ethics, harm, and governance in the foreground. This paper documents the data creation and curation efforts undertaken by BigScience to assemble the Responsible Open-science Open-collaboration Text Sources (ROOTS) corpus, a 1.6TB dataset spanning 59 languages that was used to train the 176-billion-parameter BigScience Large Open-science Open-access Multilingual (BLOOM) language model. We further release a large initial subset of the corpus and analyses thereof, and hope to empower large-scale monolingual and multilingual modeling projects with both the data and the processing tools, as well as stimulate research around this large multilingual corpus.

Author Information

Hugo Laurençon (Hugging Face)
Lucile Saulnier (Hugging Face)
Thomas Wang (Hugging Face)
Christopher Akiki (Leipzig University)
Albert Villanova del Moral (CNRS)
Teven Le Scao (Hugging Face)
Leandro Von Werra (ETHZ - ETH Zurich)
Chenghao Mou (Docusign, Inc.)
Eduardo González Ponferrada (Apple)
Huu Nguyen
Jörg Frohberg (Humboldt Universität Berlin)
Mario Šaško
Quentin Lhoest (Hugging Face)
Angelina McMillan-Major (University of Washington)
Gerard Dupont (Mavenoid)
Stella Biderman (EleutherAI)
Anna Rogers (University of Copenhagen)
Loubna Ben allal (Hugging Face)
Francesco De Toni (University of Western Australia)
Giada Pistilli (Sorbonne Université & CNRS)
Olivier Nguyen (Twitch)
Somaieh Nikpoor (Government)
Maraim Masoud (NA)

A recent graduate in Machine Learning.

Pierre Colombo (MICS CentraleSupelec)
Javier de la Rosa (National Library of Norway)
Paulo Villegas (Telefonica Research)
Tristan Thrush (Hugging Face)
Tristan Thrush

I'm a research engineer at Hugging Face. Previously, I was a research associate at Facebook AI Research, supervised by Douwe Kiela and then Adina Williams. Before that, I was a research associate at MIT Brain and Cognitive Sciences, supervised by Roger Levy. I Received my MEng in Computer Science with a concentration in Artificial Intelligence under Patrick Winston at the MIT Computer Science and Artificial Intelligence Lab. I received my BS also at MIT in Computer Science, with a minor in Linguistics and a minor in Math. While I was a student, I did research with the Perception Systems Group at NASA's Jet Propulsion Lab. My topics of research include natural language processing, dataset creation, model evaluation, and multimodal models.

Shayne Longpre (Massachusetts Institute of Technology)
Sebastian Nagel
Leon Weber (LMU Munich)
Manuel Muñoz
Jian Zhu (University of British Columbia)
Daniel Van Strien (British Library)
Zaid Alyafeai (King Fahad University of Petroleum and Minerals)
Khalid Almubarak
Minh Chien Vu (DETOMO Inc.)
Itziar Gonzalez-Dios (Universidad del País Vasco)
Aitor Soroa (University of the Basque Country. UPV/EHU.)
Kyle Lo (Allen Institute for AI)
Manan Dey (SAP)
Pedro Ortiz Suarez (Universität Mannheim)
Pedro Ortiz Suarez

I'm a postdoctoral researcher at the [Data and Web Science Group](https://www.uni-mannheim.de/dws/) at the [University of Mannheim](https://www.uni-mannheim.de/en/). I am interested in [large corpora](https://oscar-corpus.com) for training [language models](https://camembert-model.fr), specially for under resourced languages and historical languages. I am interested in tasks such as Name Entity Recognition (NER), Dependency Parsing and Part-of-Speech tagging, Machine Translation and Document structuration. I love coffee, cookies and maths.

Aaron Gokaslan (Cornell University)
Shamik Bose
David Adelani (University College London)
David Adelani

I am a Research Fellow (or DeepMind Academic Fellow) at University College London, UK

Long Phan (VietAI)
Hieu Tran (VietAI Research)
Ian Yu (Groupby Inc)
Suhas Pai
Jenny Chim (Queen Mary University London)
Violette Lepercq (Hugging Face)
Suzana Ilic (Universität Innsbruck)
Margaret Mitchell (Hugging Face)
Sasha Alexandra Luccioni (Hugging Face)
Yacine Jernite (Hugging Face)

More from the Same Authors