The Global South in AI workshop will run for 1 hours. Join us in-person or via virtual and ask a question in the chat.
1. keynote by Bonaventure Dossou (McGill University, past Mila Institute and Google AI) - on their research about building technologies for African Languages
2. keynote by Jazmia Henry on operationalizing bias reduction in post production (Remember AI needs to keep learning continuously, how to get that chatbot to take my input when it failed to understand me this time?)
2. Design thinking workshop led by Sudha Jamthe (Stanford University) and Susanna Raj (AI4Nomads) live + a hybrid zoom session including two of our accepted authors Ogbuokiri Blessing (York University) and Ricardo Alanis (CodeandOMexico)
Join us as global south innovators to arrive at actionable insights about commons problems, potential solutions and identifying shared resources for open datasets for translations, transliterations, language model generators, benchmarks and speech recognition for Global south languages. For more information, please visit the Global South in AI website.
Don't miss our poster sessions at the joint poster sessions following this workshop at 4.30pm in Hall J
And our 12 peer-review double blind selected authors representing 8 languages/12 countr
Mon 12:30 p.m. - 12:35 p.m.
|
Kickoff
(
Intro
)
Introduction to Global South in AI, new affinity group focused on Language AI. See the list of 12 posters selected by the double blind peer-review process by the Global South in AI. Get ready for the keynote and design thinking workshop that follows. |
Sudha Jamthe · Susanna Raj 🔗 |
Mon 12:35 p.m. - 12:55 p.m.
|
Keynote: Bonaventure Dossou about Building Efficient Language Technologies for African Languages
(
Keynote with Audience Q&A
)
SlidesLive Video » Research Readout: AfroLM: A Self-Active Learning-based Multilingual Pre-trained models Hear from Bonaventure Dossou co-founder of Lanfrica, ex- researcher from Google AI, Mila Institute and a research leader in Language AI inclusiveness on their work and research readout on AfroML. Read full paper here https://arxiv.org/abs/2211.03263 and come with questions. |
Bonaventure F. P. Dossou 🔗 |
Mon 12:55 p.m. - 1:15 p.m.
|
Keynote: Jazmia Henry on operationalizing bias reduction in post production
(
Keynote with audience Q&A
)
link »
SlidesLive Video » Jazmia Henry will share her paper and research on making sure we keep AI as a learning AI by showing us how to operationalize bias reduction post production in chatbots and conversational agents. Jazmia was one of our area chairs at Global South in AI double blind peer review process and helped select the 12 authors who will be presenting in the poster sessions (some inperson at 4pm CET following the workshop in Hall J) and some in the virtual poster session on Dec 5th (Check schedules and join on Topia) |
🔗 |
Mon 1:10 p.m. - 1:30 p.m.
|
Design Thinking Workshop on Language AI
(
Discussion with audience in breakout groups
)
SlidesLive Video » Design thinking workshop led by Sudha Jamthe (Stanford University) and Susanna Raj (AI4Nomads) live + a hybrid zoom session Includes two of our accepted authors Ogbuokiri Blessing (York University) and Ricardo Alanis (CodeandOMexico) Come find out how to solve for local problems and build datasets for low resource languages. |
Sudha Jamthe · Pariya Sarin · Blessing Ogbuokiri 🔗 |
Mon 2:30 p.m. - 4:00 p.m.
|
Affinity Joint Poster Session
(
Poster Session
)
link »
Come join us in Hall J for the Joint Social Affinity Poster Sessions. Susanna Raj, Sudha Jamthe and Pariya Sarin will be there from the program chairs presenting: 1. Joint poster Global South in AI: Making Language AI inclusive - Our 3 key learnings from organizing Global South in AI and how to make Language AI inclusive of low resources languages from global south countries. 2. Inclusive AI Construct: A concept to make language AI equitable across all languages with a focus on defining Gender across languages. A brainchild of Susanna Raj applied to Global South in AI. |
Sudha Jamthe · Susanna Raj · Pariya Sarin 🔗 |
Mon 12:00 p.m. -
|
Virtual Affinity Poster Session: 12 authors selected by double-blind peer review will present their posters
(
Topia Poster Session
)
link »
The Virtual Affinity Poster Session will be held on Monday 5 Dec 2pm EDT (or Tuesday 6 Dec for far eastern timezones, check the link for your time). Yashaswini Viswanath (AI Researcher from Bangalore) and a Program Chair of Global South in AI will lead the virtual poster sessions with a summary poster and followed by 12 authors with their posters. |
Yashaswini Viswanath · Yine Nyika · Chinmayi Ramasubramanian · Chandra Sekhar Gupta Aravapalli · Nishant Kumar · Ali Hussein · Klinsman Nang'ore Agam · Ivan Vladimir Meza Ruiz · Sundaraparipurnan Narayanan · T Pranav · Smital Lunawat · Smrithi Suresh · Suresh Lokiah · Sanjana Melkote · Ricardo Alanis Tamez · Rdouan Faizi · Youssef Hmamouche · Amal El Fallah-Seghrouchni · Amanuel Mersha · Muskan Mahajan · Blessing Ogbuokiri · Zahra Movahedi · Bruce Mellado · Jiahong Wu · James Orbinski · Jude Kong · Constance Muthuri
|
-
|
Machine Translation Model for Bari Language
(
Poster
)
link »
South Sudan has 64 different ethnic groups, and probably more languages. English is the official language, but Juba Arabic functions as a lingua franca. Juba Arabic, however, is largely a spoken language and no standard orthography exists. The writings that do exist use a Latin based alphabet. The researchers, therefore, opted for Bari as a case study for a “low-resourced” languages because it is the language they are most familiar with. Machine learning technologies are increasingly being used in education, agriculture, medicine, engineering, bio-technology, and many other fields. Machine Translation, therefore, has many applications. But South Sudan, and other countries in the global south are severely under-represented and are largely not benefiting from new technological advancements. Our goal is a Bari language dataset. The dataset uses text as training data, primarily from the JW300 parallel corpus dataset (which includes 343 languages, one of which is Bari), supplemented with non-religious works so as to be more representative of different subject domains. This dataset could be useful in many ways. For example, uncovering implicit gender biases. Other applications are in language education, or scanning internet forums for hate speech.The process follows standard frameworks for creating language models for low-resource languages. That is collecting raw data, cleaning and documentation, annotating/tagging the data with linguistic tag. Following which comes evaluation procedures, performance measures (metrics). These can be used to create benchmarks (for a model) are made of on datasets and metrics, and a way to aggregate performance. |
Yine Nyika 🔗 |
-
|
Machine Translation Model for Bari Language
(
Oral
)
link »
South Sudan has 64 different ethnic groups, and probably more languages. English is the official language, but Juba Arabic functions as a lingua franca. Juba Arabic, however, is largely a spoken language and no standard orthography exists. The writings that do exist use a Latin based alphabet. The researchers, therefore, opted for Bari as a case study for a “low-resourced” languages because it is the language they are most familiar with. Machine learning technologies are increasingly being used in education, agriculture, medicine, engineering, bio-technology, and many other fields. Machine Translation, therefore, has many applications. But South Sudan, and other countries in the global south are severely under-represented and are largely not benefiting from new technological advancements. Our goal is a Bari language dataset. The dataset uses text as training data, primarily from the JW300 parallel corpus dataset (which includes 343 languages, one of which is Bari), supplemented with non-religious works so as to be more representative of different subject domains. This dataset could be useful in many ways. For example, uncovering implicit gender biases. Other applications are in language education, or scanning internet forums for hate speech.The process follows standard frameworks for creating language models for low-resource languages. That is collecting raw data, cleaning and documentation, annotating/tagging the data with linguistic tag. Following which comes evaluation procedures, performance measures (metrics). These can be used to create benchmarks (for a model) are made of on datasets and metrics, and a way to aggregate performance. |
Yine Nyika 🔗 |
-
|
Mitigation of Gender Bias in NLP of Marathi Language
(
Poster
)
link »
A general human understanding of a teacher or homemaker being female and a professor or doctor being male has been prevalent for ages. The idea behind this falls into the trap of how gender roles have often been unofficially defined without taking into account that these gender roles don’t exist on a concrete base. This analogy exists in the word embeddings studied for NLP models. Natural Language Processing (NLP) is a subset of AI that allows systems to understand spoken language and interpret it the way human beings do. Systematic research in the field of NLP for trying to overcome this type of social (gender) bias is ongoing. There are not enough systems that can identify such type of prejudice to give fair a result. This research paper focuses on the stereotypes present in the Marathi language and properly training the data set by the usage of paired pronouns in Marathi (तो/ती/ते) and gender-neutral terms (सहभागी). Training systems to not interpret gender roles unless defined is a major way to eliminate gender bias from written texts.The generation of a truly unbiased dataset would be possible by giving representation to individuals belonging to different demographic groups where there is a slight change in the way a language is spoken or interpreted. |
Smital Lunawat · Yashaswini Viswanath 🔗 |
-
|
Mitigation of Gender Bias in NLP of Marathi Language
(
Oral
)
link »
A general human understanding of a teacher or homemaker being female and a professor or doctor being male has been prevalent for ages. The idea behind this falls into the trap of how gender roles have often been unofficially defined without taking into account that these gender roles don’t exist on a concrete base. This analogy exists in the word embeddings studied for NLP models. Natural Language Processing (NLP) is a subset of AI that allows systems to understand spoken language and interpret it the way human beings do. Systematic research in the field of NLP for trying to overcome this type of social (gender) bias is ongoing. There are not enough systems that can identify such type of prejudice to give fair a result. This research paper focuses on the stereotypes present in the Marathi language and properly training the data set by the usage of paired pronouns in Marathi (तो/ती/ते) and gender-neutral terms (सहभागी). Training systems to not interpret gender roles unless defined is a major way to eliminate gender bias from written texts.The generation of a truly unbiased dataset would be possible by giving representation to individuals belonging to different demographic groups where there is a slight change in the way a language is spoken or interpreted. |
Smital Lunawat · Yashaswini Viswanath 🔗 |
-
|
Translating Bharatanatyam Sign Language Into Text
(
Poster
)
link »
Bharatanatyam is a traditional Indian classical dance dating back to the second century. As an avid Bharatanatyam dancer for the past 16 years, I’ve come to understand dance and choreography to such an extent that some may even say I’ve learned a different language. Bharatanatyam is a combination of different hand gestures (known as mudras) and facial expressions that form a type of sign language. Although a lot of dance connoisseurs may be aware of how to understand that language, showcasing Bharatanatyam to a global audience comes with its struggles because they may have never seen the art form before. During dance performances, it becomes clear that audiences that have not seen the dance before have a hard time understanding the storyline, further prompting a look into a solution to this language barrier between artists and their audience. The project proposed is to build a dataset with specific hand gestures found in Bharatanatyam and its English interpretation that could be leveraged to create an AI model to interpret the gestures’ meaning, hopefully making Bharatanatyam a more understandable artform. The data set will be made available on GitHub so others can contribute by submitting a picture and its correlating definition. To make sure the data is of good quality, we will take high-quality photos of experienced professional dancers, which will be submitted with an associated description such as the pattern, fingers used, ect. By stringing pictures into a storyline, we are able to translate ancient Indian drama and mythology into transmissible stories. And it doesn’t have to stop with Bharatanatyam. Multiple other forms of storytelling exist in our world, and this would make it easier to connect different cultures and histories. |
Smrithi Suresh · Suresh Lokiah · Sanjana Melkote 🔗 |
-
|
Translating Bharatanatyam Sign Language Into Text
(
Oral
)
link »
Bharatanatyam is a traditional Indian classical dance dating back to the second century. As an avid Bharatanatyam dancer for the past 16 years, I’ve come to understand dance and choreography to such an extent that some may even say I’ve learned a different language. Bharatanatyam is a combination of different hand gestures (known as mudras) and facial expressions that form a type of sign language. Although a lot of dance connoisseurs may be aware of how to understand that language, showcasing Bharatanatyam to a global audience comes with its struggles because they may have never seen the art form before. During dance performances, it becomes clear that audiences that have not seen the dance before have a hard time understanding the storyline, further prompting a look into a solution to this language barrier between artists and their audience. The project proposed is to build a dataset with specific hand gestures found in Bharatanatyam and its English interpretation that could be leveraged to create an AI model to interpret the gestures’ meaning, hopefully making Bharatanatyam a more understandable artform. The data set will be made available on GitHub so others can contribute by submitting a picture and its correlating definition. To make sure the data is of good quality, we will take high-quality photos of experienced professional dancers, which will be submitted with an associated description such as the pattern, fingers used, ect. By stringing pictures into a storyline, we are able to translate ancient Indian drama and mythology into transmissible stories. And it doesn’t have to stop with Bharatanatyam. Multiple other forms of storytelling exist in our world, and this would make it easier to connect different cultures and histories. |
Smrithi Suresh · Suresh Lokiah · Sanjana Melkote 🔗 |
-
|
Towards testing effectiveness of a Spanish Sarcasm Model in different Regions
(
Poster
)
link »
Language is vast, and the expressions found throughout the globe are as diverse as there is cultural richness. With new interfaces that require understanding and generation of responses to provide services, models are expected to be resilient and generalizable to be helpful. However, real models, especially ones formulated for text problems, are known to be brittle and difficult to generalize. Several hypotheses exist on improving such models, such as introducing more inductive bias, adding more context to the models, and evaluating unseen distributions and tasks. We aim to dissect the problem of generalization of the classification of sarcasm by evaluating texts from different regions of Latinamerica with different regionalizations of the Spanish languages. This is done with a model and human annotation to assess the agreement between measurements. Results are then explored in the three dimensions of potential areas of improvement mentioned before, providing a guideline for the next steps for bettering the model to increase its resilience. |
Ricardo Alanis Tamez · Ivan Vladimir Meza Ruiz 🔗 |
-
|
Towards testing effectiveness of a Spanish Sarcasm Model in different Regions
(
Oral
)
link »
Language is vast, and the expressions found throughout the globe are as diverse as there is cultural richness. With new interfaces that require understanding and generation of responses to provide services, models are expected to be resilient and generalizable to be helpful. However, real models, especially ones formulated for text problems, are known to be brittle and difficult to generalize. Several hypotheses exist on improving such models, such as introducing more inductive bias, adding more context to the models, and evaluating unseen distributions and tasks. We aim to dissect the problem of generalization of the classification of sarcasm by evaluating texts from different regions of Latinamerica with different regionalizations of the Spanish languages. This is done with a model and human annotation to assess the agreement between measurements. Results are then explored in the three dimensions of potential areas of improvement mentioned before, providing a guideline for the next steps for bettering the model to increase its resilience. |
Ricardo Alanis Tamez · Ivan Vladimir Meza Ruiz 🔗 |
-
|
Automated Misinformation: Mistranslation of news feed using multi-lingual translation systems in Facebook
(
Poster
)
link »
Machine translations have evolved over the past decade and are increasingly used in multiple applications. Increasingly, translation models focus on becoming multilingual, enabling translations across hundreds of languages, including many low-resource languages (e.g. Facebook's No Languages Left Behind can translate text from and to 200 languages). Facebook, in their announcement, also mentioned that NLLB would support 25 billion translations served daily on Facebook News Feed, Instagram, and other platforms. A Facebook user receives auto-translated (machine-translated) content on his news feed based on the language setting and translation preferences updated on the platform. Multilingual translation models are not free from errors. These errors are typically caused by a lack of adequate context or domain-specific words, ambiguity or sarcasm in the text, incorrect dialect, missing words, transliteration instead of translation, incorrect lexical choice, and differences in grammatical properties between languages. Such errors, may, on some occasions, lead to misinformation about the text translated. This paper examines instances of misinformation caused by mistranslations from English to Tamil in the Facebook news feed. For the purpose of the research, categories of news headlines were collected from multiple sources, including (a) General- news headlines dataset from Kaggle (30 samples), (b) Sarcastic - news headlines from Kaggle (10 samples), (c) Domain-specific -news headline from Wired (10 samples), and (d) Ambiguous headlines from linguistics page (15 samples). News headlines in each of these datasets were filtered for politics as a topic given the potential impact it may cause due to misinformation. From the filtered news headlines category database, samples were randomly identified for the purpose of the translation. Translations were undertaken on these samples using NLLB. A test code was created in Google Colab with NLLB pre-trained model (available on Huggingface). The translations were evaluated for mistranslations. Incomplete translations were eliminated (~27%) and translations that provided complete meaning (~73%) were examined for misinformation. A translation was classified as misinformation if it gives false information in whole or part of the news headline. For instance, “Trump |
Sundaraparipurnan Narayanan 🔗 |
-
|
Automated Misinformation: Mistranslation of news feed using multi-lingual translation systems in Facebook
(
Oral
)
link »
Machine translations have evolved over the past decade and are increasingly used in multiple applications. Increasingly, translation models focus on becoming multilingual, enabling translations across hundreds of languages, including many low-resource languages (e.g. Facebook's No Languages Left Behind can translate text from and to 200 languages). Facebook, in their announcement, also mentioned that NLLB would support 25 billion translations served daily on Facebook News Feed, Instagram, and other platforms. A Facebook user receives auto-translated (machine-translated) content on his news feed based on the language setting and translation preferences updated on the platform. Multilingual translation models are not free from errors. These errors are typically caused by a lack of adequate context or domain-specific words, ambiguity or sarcasm in the text, incorrect dialect, missing words, transliteration instead of translation, incorrect lexical choice, and differences in grammatical properties between languages. Such errors, may, on some occasions, lead to misinformation about the text translated. This paper examines instances of misinformation caused by mistranslations from English to Tamil in the Facebook news feed. For the purpose of the research, categories of news headlines were collected from multiple sources, including (a) General- news headlines dataset from Kaggle (30 samples), (b) Sarcastic - news headlines from Kaggle (10 samples), (c) Domain-specific -news headline from Wired (10 samples), and (d) Ambiguous headlines from linguistics page (15 samples). News headlines in each of these datasets were filtered for politics as a topic given the potential impact it may cause due to misinformation. From the filtered news headlines category database, samples were randomly identified for the purpose of the translation. Translations were undertaken on these samples using NLLB. A test code was created in Google Colab with NLLB pre-trained model (available on Huggingface). The translations were evaluated for mistranslations. Incomplete translations were eliminated (~27%) and translations that provided complete meaning (~73%) were examined for misinformation. A translation was classified as misinformation if it gives false information in whole or part of the news headline. For instance, “Trump |
Sundaraparipurnan Narayanan 🔗 |
-
|
Tamil Kuthu: Method and Process for creating dataset for Madras Tamil
(
Poster
)
link »
Language, in whatever form, is a fundamental prerequisite for human civilization to communicate and interact. Tamil is a classical language primarily spoken by Tamils in India, Sri Lanka, Malaysia, and Singapore, with minority groups in many other countries. The language has evolved to fit the cultural and social-economic groups in the regions. One such evolution is Madras Tamil. It combines words from English, Telugu, and Hindi to have its flavor and vocabulary. It exists primarily in spoken form among people with a lower education background. India being the Bollywood capital, this language flavor is now being infused in many movie songs commonly referred to as Tamil Kuthu.The current AI systems and language translators only perform translations for pure Tamil language words and have not developed to make sense of Madras Tamil (or Tamil Kuthu).This paper attempts to create a dataset for Tamil Kuthu, which in future be leveraged to build a language model and translations in the future. The text dataset would list each word, meaning, derivation, variations, and source. As there are no automation methods, the researchers would manually attempt to gather data from the Web, including coding from at least 1 Kuthu movie song.The dataset would be hosted and published on GitHub for further collaboration. |
T Pranav · Suresh Lokiah 🔗 |
-
|
Tamil Kuthu: Method and Process for creating dataset for Madras Tamil
(
Oral
)
link »
Language, in whatever form, is a fundamental prerequisite for human civilization to communicate and interact. Tamil is a classical language primarily spoken by Tamils in India, Sri Lanka, Malaysia, and Singapore, with minority groups in many other countries. The language has evolved to fit the cultural and social-economic groups in the regions. One such evolution is Madras Tamil. It combines words from English, Telugu, and Hindi to have its flavor and vocabulary. It exists primarily in spoken form among people with a lower education background. India being the Bollywood capital, this language flavor is now being infused in many movie songs commonly referred to as Tamil Kuthu.The current AI systems and language translators only perform translations for pure Tamil language words and have not developed to make sense of Madras Tamil (or Tamil Kuthu).This paper attempts to create a dataset for Tamil Kuthu, which in future be leveraged to build a language model and translations in the future. The text dataset would list each word, meaning, derivation, variations, and source. As there are no automation methods, the researchers would manually attempt to gather data from the Web, including coding from at least 1 Kuthu movie song.The dataset would be hosted and published on GitHub for further collaboration. |
T Pranav · Suresh Lokiah 🔗 |
-
|
A deep learning Approach for the automatic detection of clickbait in Arabic
(
Poster
)
link »
With the advent of technology, everything has become digitized, including newspapers and magazines. Currently, information is accessible in an easy, and fast manner. However, some content creators exploit this opportunity negatively by using unethical methods to attract users' attention with the objective of increasing their ads' income rather than providing trusted information. To address this clickbait phenomenon, we propose various approaches based on natural language processing and deep learning models to detect this type of content in Arabic. The results showed that the fine-tuned BERT model combined with an attached neural network layer or with a self-attention network provides similar performance in terms of accuracy: 91.86\% and 91\% respectively compared to RoBERTa, Word2vec, and TF-IDF using CNN, LSTM, and Neural network. The collected dataset is sourced from multiple Arabic websites. |
Jihad R'baiti · Rdouan Faizi · Youssef Hmamouche · Amal El Fallah-Seghrouchni 🔗 |
-
|
A deep learning Approach for the automatic detection of clickbait in Arabic
(
Oral
)
link »
With the advent of technology, everything has become digitized, including newspapers and magazines. Currently, information is accessible in an easy, and fast manner. However, some content creators exploit this opportunity negatively by using unethical methods to attract users' attention with the objective of increasing their ads' income rather than providing trusted information. To address this clickbait phenomenon, we propose various approaches based on natural language processing and deep learning models to detect this type of content in Arabic. The results showed that the fine-tuned BERT model combined with an attached neural network layer or with a self-attention network provides similar performance in terms of accuracy: 91.86\% and 91\% respectively compared to RoBERTa, Word2vec, and TF-IDF using CNN, LSTM, and Neural network. The collected dataset is sourced from multiple Arabic websites. |
Jihad R'baiti · Rdouan Faizi · Youssef Hmamouche · Amal El Fallah-Seghrouchni 🔗 |
-
|
DistillEmb: Distilling Word Embeddings via Contrastive Learning
(
Poster
)
link »
Word embeddings powered the early days of neural network-based NLP research. Their effectiveness in small data regimes makes them still relevant in low-resource environments. However, they are limited in two critical ways: linearly increasing memory requirement based on the number of tokens and out-of-vocabulary token handling. In this work, we present a distillation technique of word embeddings into a CNN network using contrastive learning. This method allows embeddings to be regressed given the characters of a token. Low resource languages are the primary beneficiary of this method and hence, we show the effectiveness of such a model on two morphologically complex, Semitic languages and in a multilingual setting of 10 African languages. The resulting model utilizes a drastically smaller size of memory and handles out of vocabulary tokens sufficiently. |
Amanuel Mersha 🔗 |
-
|
DistillEmb: Distilling Word Embeddings via Contrastive Learning
(
Oral
)
link »
Word embeddings powered the early days of neural network-based NLP research. Their effectiveness in small data regimes makes them still relevant in low-resource environments. However, they are limited in two critical ways: linearly increasing memory requirement based on the number of tokens and out-of-vocabulary token handling. In this work, we present a distillation technique of word embeddings into a CNN network using contrastive learning. This method allows embeddings to be regressed given the characters of a token. Low resource languages are the primary beneficiary of this method and hence, we show the effectiveness of such a model on two morphologically complex, Semitic languages and in a multilingual setting of 10 African languages. The resulting model utilizes a drastically smaller size of memory and handles out of vocabulary tokens sufficiently. |
Amanuel Mersha 🔗 |
-
|
BELA: Bot for English Language Acquisition
(
Poster
)
link »
Our paper introduces ‘BELA’, Bot for English Language Acquisition, an application of conversational agents (chatbots) for the Hindi-speaking youth. BELA is developed for the young underprivileged students at an Indian non-profit called Udayan Care. Hinglish is a way of writing Hindi words using English letters common among 350 million speakers in India$\footnote{\url{https://www.milestoneloc.com/guide-to-hinglish-language/}}$, BELA’s natural language understanding pipeline supports Hindi and Hinglish utterances by using a language identifier, an Indic-language transliterator and a translator.$\footnote{\url{ https://pypi.org/project/google-transliteration-api/}, \url{https://huggingface.co/salesken/translation-hi-en}}$BELA has two modes, a retrieval-based ‘tutor’ mode to facilitate question-answering on classic English tasks like word meanings, translations, and reading comprehensions, and a generative ‘buddy’ mode to facilitate open-domain chit-chat on general topics like movies, food, and school. Our dialogue management system is designed to route user utterances between the two modes using a binary classifier.Three tenets have governed the design of BELA the Bot: support for Hindi utterances, reliability of answers to learners’ queries, and graceful failure. We ensure that responses from BELA are accurate and reliable using tested translation and thesaurus APIs.$\footnote{\url{ https://developer.oxforddictionaries.com/}}$The challenges in developing BELA included a lack of data for intent classification and DM, and a lack of a database for reading passages and English videos levelled by learner-proficiency level (CEFR); we solved these by creating a custom dataset with text-augmentation techniques, and building a CEFR level predictor for English passages scraped from the Web.Our future work would focus on extending BELA’s support to more English learning tasks and using the mentees' Hinglish messages to adapt the transliterator pipeline to the mentees' regional variations of Hinglish.
|
Muskan Mahajan 🔗 |
-
|
BELA: Bot for English Language Acquisition
(
Oral
)
link »
Our paper introduces ‘BELA’, Bot for English Language Acquisition, an application of conversational agents (chatbots) for the Hindi-speaking youth. BELA is developed for the young underprivileged students at an Indian non-profit called Udayan Care. Hinglish is a way of writing Hindi words using English letters common among 350 million speakers in India$\footnote{\url{https://www.milestoneloc.com/guide-to-hinglish-language/}}$, BELA’s natural language understanding pipeline supports Hindi and Hinglish utterances by using a language identifier, an Indic-language transliterator and a translator.$\footnote{\url{ https://pypi.org/project/google-transliteration-api/}, \url{https://huggingface.co/salesken/translation-hi-en}}$BELA has two modes, a retrieval-based ‘tutor’ mode to facilitate question-answering on classic English tasks like word meanings, translations, and reading comprehensions, and a generative ‘buddy’ mode to facilitate open-domain chit-chat on general topics like movies, food, and school. Our dialogue management system is designed to route user utterances between the two modes using a binary classifier.Three tenets have governed the design of BELA the Bot: support for Hindi utterances, reliability of answers to learners’ queries, and graceful failure. We ensure that responses from BELA are accurate and reliable using tested translation and thesaurus APIs.$\footnote{\url{ https://developer.oxforddictionaries.com/}}$The challenges in developing BELA included a lack of data for intent classification and DM, and a lack of a database for reading passages and English videos levelled by learner-proficiency level (CEFR); we solved these by creating a custom dataset with text-augmentation techniques, and building a CEFR level predictor for English passages scraped from the Web.Our future work would focus on extending BELA’s support to more English learning tasks and using the mentees' Hinglish messages to adapt the transliterator pipeline to the mentees' regional variations of Hinglish.
|
Muskan Mahajan 🔗 |
-
|
Identifying COVID-19 Vaccine Hesitancy Hotspots in Nigeria: Analysis of Social Media Posts
(
Poster
)
link »
One of the major challenges faced by health policymakers in the fight against community-based infectious diseases, such as COVID-19, Malaria, Monkeypox, and Marburg, is vaccine hesitancy. In Nigeria, Twitter is one of the social media platforms used to promote anti-vaccination posts. Anti-vaccination posts or reactions on Twitter can lead to a compromise of community confidence or lack of willingness in taking the vaccine during an outbreak. In this research, we collected 10,000 vaccine-related geotagged Twitter posts in Nigeria, from December 2020 to February 2022, to identify hotspots by clustering tweet sentiments. We used the Natural Language Processing pre-trained model known as VADER to classify the tweets into three sentiment classes (positive, negative, and neutral). The outputs were validated using machine learning classification algorithms, including, Naïve Bayes with an accuracy of 66%, Logistic Regression (71%), Support Vector Machines (65%), Decision Tree (61%), and K-Nearest Neighbour (56%). The average Area under the Curve score of 78%, 85%, 83%, 67%, and 63%, respectively, was used to evaluate the quality of the multi-classification outputs. The classified sentiments were visualised on the Nigerian map using ArcGIS Online. The point-based location technique was used to calculate the hotspots on the map. Green, red, and grey were used to identify the dominance of positive, negative, and neutral sentiments. The outcome of this research shows that social media data can be used to complement existing data in identifying hotspots during an outbreak. It can also be used to inform health policy in managing vaccine hesitancy. |
Blessing Ogbuokiri · Ali Ahmadi · Zahra Movahedi · Bruce Mellado · Jiahong Wu · James Orbinski · Ali Asgary · Jude Kong 🔗 |
-
|
Identifying COVID-19 Vaccine Hesitancy Hotspots in Nigeria: Analysis of Social Media Posts
(
Oral
)
link »
One of the major challenges faced by health policymakers in the fight against community-based infectious diseases, such as COVID-19, Malaria, Monkeypox, and Marburg, is vaccine hesitancy. In Nigeria, Twitter is one of the social media platforms used to promote anti-vaccination posts. Anti-vaccination posts or reactions on Twitter can lead to a compromise of community confidence or lack of willingness in taking the vaccine during an outbreak. In this research, we collected 10,000 vaccine-related geotagged Twitter posts in Nigeria, from December 2020 to February 2022, to identify hotspots by clustering tweet sentiments. We used the Natural Language Processing pre-trained model known as VADER to classify the tweets into three sentiment classes (positive, negative, and neutral). The outputs were validated using machine learning classification algorithms, including, Naïve Bayes with an accuracy of 66%, Logistic Regression (71%), Support Vector Machines (65%), Decision Tree (61%), and K-Nearest Neighbour (56%). The average Area under the Curve score of 78%, 85%, 83%, 67%, and 63%, respectively, was used to evaluate the quality of the multi-classification outputs. The classified sentiments were visualised on the Nigerian map using ArcGIS Online. The point-based location technique was used to calculate the hotspots on the map. Green, red, and grey were used to identify the dominance of positive, negative, and neutral sentiments. The outcome of this research shows that social media data can be used to complement existing data in identifying hotspots during an outbreak. It can also be used to inform health policy in managing vaccine hesitancy. |
Blessing Ogbuokiri · Ali Ahmadi · Zahra Movahedi · Bruce Mellado · Jiahong Wu · James Orbinski · Ali Asgary · Jude Kong 🔗 |
-
|
Transformer Based Kenyan Election Misinformation and Hatespeech monitoring
(
Poster
)
link »
Abstract revised as required, with additional extended abstract pdf added.Kenyan presidential elections are a tense and problematic time. There are documented cases of voter-directed social media manipulation campaigns and incidents of post-election violence during election season. We build and test a dashboard to monitor the 2022 Kenyan election-related content on Twitter utilizing a BERT-derived pre-trained model for hate speech detection and an XLM-T pre-trained transformer for sentiment analysis, a balanced random forest is trained for detecting Twitter bots, and a pre-trained "bag of tricks" model for language identification. A “bag of tricks” is an optimized linear classifier that is comparable in accuracy to deep learning models while maintaining orders of magnitude more efficiently. These models can then be used to generate hourly and daily reports on hate speech, bot activity, and candidate sentiment on Twitter. Additionally, focus on implementing and deploying the dashboard efficiently with low resources is a primary focus. While deployment was not possible due to election time constraints and deployment costs. The open-source code of the dashboard is provided to allow for easy and cost-effective replication/adaption to closely related domains, with modifications implemented to allow for cost-effective deployment. It can be used to act as an early warning system for stakeholders and policymakers to take prompt action in the case of misinformation and hate speech propagation. |
Ali Hussein · Klinsman Nang'ore Agam · Constance Muthuri 🔗 |
-
|
Transformer Based Kenyan Election Misinformation and Hatespeech monitoring
(
Oral
)
link »
Abstract revised as required, with additional extended abstract pdf added.Kenyan presidential elections are a tense and problematic time. There are documented cases of voter-directed social media manipulation campaigns and incidents of post-election violence during election season. We build and test a dashboard to monitor the 2022 Kenyan election-related content on Twitter utilizing a BERT-derived pre-trained model for hate speech detection and an XLM-T pre-trained transformer for sentiment analysis, a balanced random forest is trained for detecting Twitter bots, and a pre-trained "bag of tricks" model for language identification. A “bag of tricks” is an optimized linear classifier that is comparable in accuracy to deep learning models while maintaining orders of magnitude more efficiently. These models can then be used to generate hourly and daily reports on hate speech, bot activity, and candidate sentiment on Twitter. Additionally, focus on implementing and deploying the dashboard efficiently with low resources is a primary focus. While deployment was not possible due to election time constraints and deployment costs. The open-source code of the dashboard is provided to allow for easy and cost-effective replication/adaption to closely related domains, with modifications implemented to allow for cost-effective deployment. It can be used to act as an early warning system for stakeholders and policymakers to take prompt action in the case of misinformation and hate speech propagation. |
Ali Hussein · Klinsman Nang'ore Agam · Constance Muthuri 🔗 |
-
|
Novel method to preserve Sanskrit Shloka heritage using Transformer Language models and Semantic similarity
(
Oral
)
link »
Sanskrit is an ancient and classical Indo-Aryan language that existed in South Asia since the Bronze Age. It is the sacred language of Hinduism and was used in ancient Vedic scriptures, Hindu philosophy, literature, mythological epics, and historical texts. Sanskrit’s impact on India’s culture is well known but despite efforts of revival, there are no first language speakers of Sanskrit now. However, Sanskrit is prevalent in many traditions of Hindus, Jains, and Buddhism even today in the form of chanting of Shlokas. The word shloka means 'song' and is a couplet of Sanskrit verses, which are repeated for spiritual benefit. Recent research into the effects of chanting has discovered a variety of benefits including gaining peace, feeling calm, and becoming more focused with positive energy. There are specific shlokas for specific benefits like removing obstacles, health, knowledge, happiness, etc. Unfortunately, most of this knowledge of which shloka is meant for which purpose is not well documented but is passed on by word of mouth by elders in the family. For example, a person with health issues consults family priests or grandparents to know which shloka to chant to bring about healing. This implementation methodology employed here consists of data collection, transformation and building an API to calculate semantic similarity. Sanskrit Shloka samples and their corresponding benefit in English were collected from sources and stored in a database. Currently, there is no such database that maps the Sanskrit Shloka to its benefit. There are existing databases with translation of the shlokas but there is no database that maps the Sanskrit shloka to its benefit. For example, there are shlokas that can be associated with both “learning” and “knowledge”. The benefit “knowledge” can have multiple shlokas associated(Eg: विद्या or ज्ञान related shlokas). Embeddings of each of the benefits are computed using BERT, a pre-trained Transformer Language model and stored in the database. An API is built that takes input from the user. The use gives the benefit he wants to gain as input (Eg: “knowledge”) and maps it to the database. Semantic similarity is computed using cosine similarity, using which the system recommends one or more suitable shlokas. (Eg: विद्या ददाति विनयं विनयाद् याति पात्रताम्।). In the future, inputs from different languages can be supported. Thus, this heritage can be preserved for many generations. |
Chinmayi Ramasubramanian · Chandra Sekhar Gupta Aravapalli · Nishant Kumar 🔗 |
-
|
Novel method to preserve Sanskrit Shloka heritage using Transformer Language models and Semantic similarity
(
Poster
)
link »
Sanskrit is an ancient and classical Indo-Aryan language that existed in South Asia since the Bronze Age. It is the sacred language of Hinduism and was used in ancient Vedic scriptures, Hindu philosophy, literature, mythological epics, and historical texts. Sanskrit’s impact on India’s culture is well known but despite efforts of revival, there are no first language speakers of Sanskrit now. However, Sanskrit is prevalent in many traditions of Hindus, Jains, and Buddhism even today in the form of chanting of Shlokas. The word shloka means 'song' and is a couplet of Sanskrit verses, which are repeated for spiritual benefit. Recent research into the effects of chanting has discovered a variety of benefits including gaining peace, feeling calm, and becoming more focused with positive energy. There are specific shlokas for specific benefits like removing obstacles, health, knowledge, happiness, etc. Unfortunately, most of this knowledge of which shloka is meant for which purpose is not well documented but is passed on by word of mouth by elders in the family. For example, a person with health issues consults family priests or grandparents to know which shloka to chant to bring about healing. This implementation methodology employed here consists of data collection, transformation and building an API to calculate semantic similarity. Sanskrit Shloka samples and their corresponding benefit in English were collected from sources and stored in a database. Currently, there is no such database that maps the Sanskrit Shloka to its benefit. There are existing databases with translation of the shlokas but there is no database that maps the Sanskrit shloka to its benefit. For example, there are shlokas that can be associated with both “learning” and “knowledge”. The benefit “knowledge” can have multiple shlokas associated(Eg: विद्या or ज्ञान related shlokas). Embeddings of each of the benefits are computed using BERT, a pre-trained Transformer Language model and stored in the database. An API is built that takes input from the user. The use gives the benefit he wants to gain as input (Eg: “knowledge”) and maps it to the database. Semantic similarity is computed using cosine similarity, using which the system recommends one or more suitable shlokas. (Eg: विद्या ददाति विनयं विनयाद् याति पात्रताम्।). In the future, inputs from different languages can be supported. Thus, this heritage can be preserved for many generations. |
Chinmayi Ramasubramanian · Chandra Sekhar Gupta Aravapalli · Nishant Kumar 🔗 |