Timezone: »

An Autoencoder Approach to Learning Bilingual Word Representations
Sarath Chandar Anbil Parthipan · Stanislas Lauly · Hugo Larochelle · Mitesh Khapra · Balaraman Ravindran · Vikas C Raykar · Amrita Saha

Mon Dec 08 04:00 PM -- 08:59 PM (PST) @ Level 2, room 210D #None

Cross-language learning allows us to use training data from one language to build models for a different language. Many approaches to bilingual learning require that we have word-level alignment of sentences from parallel corpora. In this work we explore the use of autoencoder-based methods for cross-language learning of vectorial word representations that are aligned between two languages, while not relying on word-level alignments. We show that by simply learning to reconstruct the bag-of-words representations of aligned sentences, within and between languages, we can in fact learn high-quality representations and do without word alignments. We empirically investigate the success of our approach on the problem of cross-language text classification, where a classifier trained on a given language (e.g., English) must learn to generalize to a different language (e.g., German). In experiments on 3 language pairs, we show that our approach achieves state-of-the-art performance, outperforming a method exploiting word alignments and a strong machine translation baseline.

Author Information

Sarath Chandar Anbil Parthipan (Mila, University of Montreal)
Stanislas Lauly (NYU)
Hugo Larochelle (Twitter)
Mitesh Khapra (IBM India Research Lab)
Balaraman Ravindran (Indian Institute of Technology Madras)
Vikas C Raykar (IBM Research)
Amrita Saha (IBM India Research Lab)

More from the Same Authors