Investigating the effect of corpus construction on latent dirichlet allocation based feature location
The software maintenance community has adopted text retrieval techniques to aid program comprehension tasks, e.g., feature location --- the process of finding the source code entity or entities that implement a system feature. Latent Dirichlet Allocation (LDA) and latent semantic indexing (LSI) are two such text retrieval techniques. However, little work exists to inform the configuration of these text retrieval techniques for software maintenance tasks. This work investigates the impact of highly configurable preprocessing techniques on LDA based feature location. These decisions affect the composition and quality of the corpus and thus the accuracy of the text retrieval technique. Source code extraction is based on a researcher's understanding of source code use. We decompose source code into three distinct lexicons: identifiers, comments, and literals. Many researchers choose the aggregation of the lexicons; however, some choose specific subsets. This work finds that the chosen text source(s) does impact the accuracy of the LDA based FLT. Conventional wisdom holds that identifier splitting improves the performance of a text retrieval based FLT. However, the decision to retain or remove the original identifier is unexplored. This work finds that identifier splitting does impact the accuracy of the LDA based FLT, but retaining or removing the original identifier does not have a significant impact. Stop words, words with little semantic value, are often removed from natural language corpora. This work explores the impact of stop word removal on source code corpora. The observations prove that few stop word configurations are significantly different from one another --- even a null configuration is acceptable. The Porter stemming algorithm is a popular, light-weight, rule-based stemmer, often used in software maintenance preprocessing applications. We investigate the effects of two heavy stemmers, two light stemmers, four blended stemmers, and a null configuration. One light stemmer is morphological and the other stemmers are rule-based. The results indicate that no stemming algorithm significantly affects the performance of the FLT as compared to another stemming algorithm. We suggest basing preprocessing decisions on system structure and constraints. As such, these recommendations reduce the memory and/or processing time needed for LDA based feature location.