Three essays on improving ensemble models
Ensemble models, such as bagging (Breiman, 1996), random forests (Breiman, 2001a), and boosting (Freund and Schapire, 1997), have better predictive accuracy than single classifiers. These ensembles typically consist of hundreds of single classifiers, which makes future predictions and model interpretation much more difficult than for single classifiers. Breiman (2001b) gave random forests a grade of A+ in predictive performance, but a grade of F in interpretability. Breiman (2001a) also mentioned that the performance of an ensemble model depends on the strengths of the individual classifiers in the ensemble and the correlations among them. Reyzin and Schapire (2006) stated that "the margins explanation basically says that when all other factors are equal, higher margins result in lower error," which is referred to as the "large margin theory." Shen and Li (2010) showed that the performance of an ensemble model is related to the mean and the variance of the margins. In this research, we improve ensemble models from two perspectives, increasing the interpretability and/or decreasing the test error rate. We first propose a new method based on quadratic programming that uses information on the strengths of the individual classifiers in the ensemble and their correlations, to improve or maintain the predictive accuracy of an ensemble while significantly reducing its size. In the second essay, we improve the predictive accuracy of random forests by adding an AdaBoost-like improvement step to random forests. Finally, we propose a method to improve the strength of the individual classifiers by using fully-grown trees fitted on weighted resampling training data and then combining the trees by using the AdaBoost method.