Skip to yearly menu bar Skip to main content


Poster
in
Affinity Workshop: Black in AI Workshop

On the use of linguistic similarities to improve Neural Machine Translation for African Languages

Pascal Junior Tikeng Notsawo · Brice Yvan NANDA ASSOBJIO · James Assiene


Abstract:

In recent years, there has been a resurgence in research on empirical methods for machine translation. Most of this research has been focused on high-resource, European languages. Despite the fact that around 30% of all languages spoken worldwide are African, the latter have been heavily under investigated and this, partly due to the lack of public parallel corpora online. Furthermore, despite their large number (more than 2,000) and the similarities between them, there is currently no publicly available study on how to use this multilingualism (and associated similarities) to improve machine translation systems performance on African languages. So as to address these issues, we propose a new dataset (from a source that allows us to use and release) for African languages that provides parallel data for vernaculars not present in commonly used dataset like JW300. To exploit multilingualism, we first use a historical approach based on migrations of population to identify similar vernaculars. We also propose a new metric to automatically evaluate similarities between languages. This new metric does not require word level parallelism like traditional methods but only paragraph level parallelism. We then show that performing Masked Language Modelling and Translation Language Modeling in addition to multi-task learning on a cluster of similar languages leads to a strong boost of performance in translating individual pairs inside this cluster. In particular, we record an improvement of 29 BLEU on the pair Bafia-Ewondo using our approaches compared to previous work methods that did not exploit multilingualism in any way. Finally, we release the dataset and code of this work to ensure reproducibility and accelerate research in this domain.

Chat is not available.