Sparse regression of textual analysis
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
We consider sparse regression techniques as tools for classification of sentiment within Twitter posts. Analysis of Twitter usage suffers from several unique challenges. For example, the 140-character limit severely limits the amount of information contained in each post; this causes most tweets to contain an extremely small subset of the dictionary, presenting challenges for learning schemes based on dictionary usage. To remedy this undersampling issue, we propose usage of penalized regression. Here, we employ logistic regularization to avoid any degeneracy caused by the sparse usage of the dictionary in each tweet, while simultaneously learning which terms are most associated with each sentiment. Accelerated sparse discriminant analysis is also used to combat the issues of degeneracy and overfitting of the training data while providing dimension reduction. As illustrative examples, we employ sparse logistic regression to classify tweets based on the users’ perception of a connection between vaccination and autism, and we examine the Twitter users' sentiment of the use of autonomous cars.