Keywords: [ low-resourced ] [ Lingua franca ] [ Latin based alphabet ]
South Sudan has 64 different ethnic groups, and probably more languages. English is the official language, but Juba Arabic functions as a lingua franca. Juba Arabic, however, is largely a spoken language and no standard orthography exists. The writings that do exist use a Latin based alphabet. The researchers, therefore, opted for Bari as a case study for a “low-resourced” languages because it is the language they are most familiar with. Machine learning technologies are increasingly being used in education, agriculture, medicine, engineering, bio-technology, and many other fields. Machine Translation, therefore, has many applications. But South Sudan, and other countries in the global south are severely under-represented and are largely not benefiting from new technological advancements. Our goal is a Bari language dataset. The dataset uses text as training data, primarily from the JW300 parallel corpus dataset (which includes 343 languages, one of which is Bari), supplemented with non-religious works so as to be more representative of different subject domains. This dataset could be useful in many ways. For example, uncovering implicit gender biases. Other applications are in language education, or scanning internet forums for hate speech.The process follows standard frameworks for creating language models for low-resource languages. That is collecting raw data, cleaning and documentation, annotating/tagging the data with linguistic tag. Following which comes evaluation procedures, performance measures (metrics). These can be used to create benchmarks (for a model) are made of on datasets and metrics, and a way to aggregate performance.