Affinity Workshop: Global South in AI

Tamil Kuthu: Method and Process for creating dataset for Madras Tamil

T Pranav · Suresh Lokiah

Keywords: [ machine language translation ] [ NLP ] [ Tamil Kuttu dataset ]

[ Abstract ] [ Project Page ]
[ OpenReview
presentation: Global South in AI
Mon 28 Nov 12:30 p.m. PST — 4 p.m. PST


Language, in whatever form, is a fundamental prerequisite for human civilization to communicate and interact. Tamil is a classical language primarily spoken by Tamils in India, Sri Lanka, Malaysia, and Singapore, with minority groups in many other countries. The language has evolved to fit the cultural and social-economic groups in the regions. One such evolution is Madras Tamil. It combines words from English, Telugu, and Hindi to have its flavor and vocabulary. It exists primarily in spoken form among people with a lower education background. India being the Bollywood capital, this language flavor is now being infused in many movie songs commonly referred to as Tamil Kuthu.The current AI systems and language translators only perform translations for pure Tamil language words and have not developed to make sense of Madras Tamil (or Tamil Kuthu).This paper attempts to create a dataset for Tamil Kuthu, which in future be leveraged to build a language model and translations in the future. The text dataset would list each word, meaning, derivation, variations, and source. As there are no automation methods, the researchers would manually attempt to gather data from the Web, including coding from at least 1 Kuthu movie song.The dataset would be hosted and published on GitHub for further collaboration.

