Keywords: [ Distillation ] [ word embedding ]
Word embeddings powered the early days of neural network-based NLP research. Their effectiveness in small data regimes makes them still relevant in low-resource environments. However, they are limited in two critical ways: linearly increasing memory requirement based on the number of tokens and out-of-vocabulary token handling. In this work, we present a distillation technique of word embeddings into a CNN network using contrastive learning. This method allows embeddings to be regressed given the characters of a token. Low resource languages are the primary beneficiary of this method and hence, we show the effectiveness of such a model on two morphologically complex, Semitic languages and in a multilingual setting of 10 African languages. The resulting model utilizes a drastically smaller size of memory and handles out of vocabulary tokens sufficiently.