Affinity Workshop: Women in Machine Learning

Graph-Transformer for Cross-lingual Plagiarism Detection

Oumaima Hourrane


The vast amounts of multilingual textual data on the internet lead to the cross-lingual plagiarism phenomenon that becomes a severe problem in different areas such as education, literature, and science. Cross-lingual plagiarism refers to plagiarism by translation. It is plagiarism where the source text is in one language while the plagiarized text is in another. Because of the increasing menace to the academic world from this kind of plagiarism, it has become crucial to find techniques to detect cross-lingual plagiarism. Current approaches come up with different methods to estimate the similarities; they usually employ syntactic and lexical properties, external Machine Translation systems, or similarities with a multilingual set of documents. However, most of these methods are conceived for literal plagiarism, such as copy and paste, and their performance is diminished when handling complex cases of plagiarism, including paraphrasing.In this work, we propose a graph-based approach that represents text fragments in different languages using multilingual knowledge graphs. An effective way to represent knowledge graphs is by using Graph Neural Networks, and they usually compute the representation of each node based on its neighboring nodes. However, this local propagation interrupts efficient global communication, which becomes problematic at larger graph sizes. Regarding this limitation, we put forward a new graph representation method based on the Transformer architecture that uses explicit relation encoding and provides a more efficient way for global graph representation. Experimental results in Arabic-English, French–English, and Spanish–English plagiarism detection indicate that our graph transformer approach outperforms the state-of-the-art cross-lingual plagiarism detection approaches. Moreover, it proves effective in paraphrasing plagiarism cases and provides exciting insights on the use of knowledge graphs on a language-independent model.

Chat is not available.