Cross-Lingual Word Embeddings with Universal Concepts and their Applications

dc.contributorGray, Jeff
dc.contributorGalloway, Michael
dc.contributorHong, Xiaoyan
dc.contributor.advisorVrbsky, Susan
dc.contributor.advisorMusaev, Aibek
dc.contributor.authorSheinidashtegol, Pezhman
dc.contributor.otherUniversity of Alabama Tuscaloosa
dc.descriptionElectronic Thesis or Dissertationen_US
dc.description.abstractEnormous amounts of data are generated in many languages every day due to our increasing global connectivity. This increases the demand for the ability to read and classify data regardless of language. Word embedding is a popular Natural Language Processing (NLP) strategy that uses language modeling and feature learning to map words to vectors of real numbers. However, these models need a significant amount of data annotated for the training. While gradually, the availability of labeled data is increasing, most of these data are only available in high resource languages, such as English. Researchers with different sets of proficient languages seek to address new problems with multilingual NLP applications. In this dissertation, I present multiple approaches to generate cross-lingual word embedding (CWE) using universal concepts (UC) amongst languages to address the limitations of existing methods. My work consists of three approaches to build multilingual/bilingual word embeddings. The first approach includes two steps: pre-processing and processing. In the pre-processing step, we build a bilingual corpus containing both languages' knowledge in the form of sentences for the most frequent words in English and their translated pair in the target language. In this step, knowledge of the source language is shared with the target language and vice versa by swapping one word per sentence with its corresponding translation. In the second step, we use a monolingual embeddings estimator to generate the CWE. The second approach generates multilingual word embeddings using UCs. This approach consists of three parts. For part I, we introduce and build UCs using bilingual dictionaries and graph theory by defining words as nodes and translation pairs as edges. In part II, we explain the configuration used for word2vec to generate encoded-word embeddings. Finally, part III includes decoding the generated embeddings using UCs. The final approach utilizes the supervised method of the MUSE project, but, the model trained on our UCs. Finally, we applied our last two proposed methods to some practical NLP applications; document classification, cross-lingual sentiment analysis, and code-switching sentiment analysis. Our proposed methods outperform the state of the art MUSE method on the majority of applications.en_US
dc.publisherUniversity of Alabama Libraries
dc.relation.hasversionborn digital
dc.relation.ispartofThe University of Alabama Electronic Theses and Dissertations
dc.relation.ispartofThe University of Alabama Libraries Digital Collections
dc.rightsAll rights reserved by the author unless otherwise indicated.en_US
dc.subjectBUCWE- Bilingual Universal Concepts Word Embeddings
dc.subjectCross-lingual word embeddings
dc.subjectICE- Indirect correlation evaluation
dc.subjectNatural Language Processing
dc.titleCross-Lingual Word Embeddings with Universal Concepts and their Applicationsen_US
dc.typetext of Alabama. Department of Computer Science Science University of Alabama
Original bundle
Now showing 1 - 1 of 1
No Thumbnail Available
1.25 MB
Adobe Portable Document Format