Online topic modeling for software maintenance using a changeset-based approach

Show simple item record

dc.contributor Gray, Jeff
dc.contributor Smith, Randy K.
dc.contributor Atkison, Travis Levestis
dc.contributor.advisor Kraft, Nicholas A.
dc.contributor.advisor Carver, Jeffrey C. Corley, Christopher Scott 2018-07-11T16:49:03Z 2018-07-11T16:49:03Z 2018
dc.identifier.other u0015_0000001_0002925
dc.identifier.other Corley_alatus_0004D_13461
dc.description Electronic Thesis or Dissertation
dc.description.abstract Topic modeling is a machine learning technique for discovering thematic structure within a corpus. Topic models have been applied to several areas of software engineering, including bug localization, feature location, triaging change requests, and traceability link recovery. Many of these approaches train topic models on a source code snapshot -- a revision or state of code at a particular point of time, such as a versioned release. However, source code evolution leads to model obsolescence and thus to the need to retrain the model from the latest snapshot, incurring a non-trivial computational cost of model re-learning. This work proposes and investigates an approach that can remedy the obsolescence problem. Conventional wisdom in the software maintenance research community holds that the topic model training information must be the same information that is of interest for retrieval. The primary insight for this work is that topic models can infer the topics of any information, regardless of the information used to train the model. Pairing online topic modeling with mining software repositories, I can remove the need to retrain a model and achieve model persistence. For this, I suggest training of topic models on the software repository history in the form of the changeset -- a textual representation of the changes that occur between two source code snapshots. To show the feasibility of this approach, I investigate two popular applications of text retrieval in software maintenance, feature location and developer identification. Feature location is a search activity for locating the source code entity that relates to a feature of interest. Developer identification is similar, but focuses on identifying the developer most apt for working on a feature of interest. Further, to demonstrate the usability of changeset-based topic models, I investigate whether I can coalesce topic-modeling-based maintenance tasks into using a single model, rather than needing to train a model for each task at hand. In sum, this work aims to show that training online topic models on software repositories removes retraining costs while maintaining accuracy of a traditional snapshot-based topic model for different software maintenance problems.
dc.format.extent 195 p.
dc.format.medium electronic
dc.format.mimetype application/pdf
dc.language English
dc.language.iso en_US
dc.publisher University of Alabama Libraries
dc.relation.ispartof The University of Alabama Electronic Theses and Dissertations
dc.relation.ispartof The University of Alabama Libraries Digital Collections
dc.relation.hasversion born digital
dc.rights All rights reserved by the author unless otherwise indicated.
dc.subject.other Computer science
dc.title Online topic modeling for software maintenance using a changeset-based approach
dc.type thesis
dc.type text University of Alabama. Dept. of Computer Science Computer Science The University of Alabama doctoral Ph.D.

Files in this item

This item appears in the following Collection(s)

Show simple item record

Search DSpace


My Account