Cross-Lingual Word Embeddings with Universal Concepts and their Applications

dc.contributorGray, Jeff
dc.contributorGalloway, Michael
dc.contributorHong, Xiaoyan
dc.contributor.advisorVrbsky, Susan
dc.contributor.advisorMusaev, Aibek
dc.contributor.authorSheinidashtegol, Pezhman
dc.contributor.otherUniversity of Alabama Tuscaloosa
dc.date.accessioned2022-04-13T20:33:58Z
dc.date.available2027-09-01
dc.date.issued2020
dc.descriptionElectronic Thesis or Dissertationen_US
dc.description.abstractEnormous amounts of data are generated in many languages every day due to our increasing global connectivity. This increases the demand for the ability to read and classify data regardless of language. Word embedding is a popular Natural Language Processing (NLP) strategy that uses language modeling and feature learning to map words to vectors of real numbers. However, these models need a significant amount of data annotated for the training. While gradually, the availability of labeled data is increasing, most of these data are only available in high resource languages, such as English. Researchers with different sets of proficient languages seek to address new problems with multilingual NLP applications. In this dissertation, I present multiple approaches to generate cross-lingual word embedding (CWE) using universal concepts (UC) amongst languages to address the limitations of existing methods. My work consists of three approaches to build multilingual/bilingual word embeddings. The first approach includes two steps: pre-processing and processing. In the pre-processing step, we build a bilingual corpus containing both languages' knowledge in the form of sentences for the most frequent words in English and their translated pair in the target language. In this step, knowledge of the source language is shared with the target language and vice versa by swapping one word per sentence with its corresponding translation. In the second step, we use a monolingual embeddings estimator to generate the CWE. The second approach generates multilingual word embeddings using UCs. This approach consists of three parts. For part I, we introduce and build UCs using bilingual dictionaries and graph theory by defining words as nodes and translation pairs as edges. In part II, we explain the configuration used for word2vec to generate encoded-word embeddings. Finally, part III includes decoding the generated embeddings using UCs. The final approach utilizes the supervised method of the MUSE project, but, the model trained on our UCs. Finally, we applied our last two proposed methods to some practical NLP applications; document classification, cross-lingual sentiment analysis, and code-switching sentiment analysis. Our proposed methods outperform the state of the art MUSE method on the majority of applications.en_US
dc.format.mediumelectronic
dc.format.mimetypeapplication/pdf
dc.identifier.otherhttp://purl.lib.ua.edu/182077
dc.identifier.otheru0015_0000001_0004230
dc.identifier.otherSheinidashtegol_alatus_0004D_14203
dc.identifier.urihttps://ir.ua.edu/handle/123456789/8409
dc.languageEnglish
dc.language.isoen_US
dc.publisherUniversity of Alabama Libraries
dc.relation.hasversionborn digital
dc.relation.ispartofThe University of Alabama Electronic Theses and Dissertations
dc.relation.ispartofThe University of Alabama Libraries Digital Collections
dc.rightsAll rights reserved by the author unless otherwise indicated.en_US
dc.subjectBUCWE- Bilingual Universal Concepts Word Embeddings
dc.subjectCode-switching
dc.subjectCross-lingual word embeddings
dc.subjectICE- Indirect correlation evaluation
dc.subjectNatural Language Processing
dc.subjectUC-MUSE
dc.titleCross-Lingual Word Embeddings with Universal Concepts and their Applicationsen_US
dc.typethesis
dc.typetext
etdms.degree.departmentUniversity of Alabama. Department of Computer Science
etdms.degree.disciplineComputer Science
etdms.degree.grantorThe University of Alabama
etdms.degree.leveldoctoral
etdms.degree.namePh.D.
Files
Original bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
u0015_0000001_0004230.pdf
Size:
1.25 MB
Format:
Adobe Portable Document Format