Department of Information Systems, Statistics & Management Science
Permanent URI for this community
Browse
Browsing Department of Information Systems, Statistics & Management Science by Author "Albright, Thomas L."
Now showing 1 - 2 of 2
Results Per Page
Sort Options
Item Model tree analysis with randomly generated and evolved trees (M-TARGET)(University of Alabama Libraries, 2010) Sasamoto, Mark Makoto; Gray, J. Brian; University of Alabama TuscaloosaTree structured modeling is a data mining technique used to recursively partition a data set into relatively homogeneous subgroups in order to make more accurate predictions on future observations. One of the earliest decision tree induction algorithms, CART (Classification and Regression Trees) (Breiman, Friedman, Olshen, and Stone 1984), had problems including greediness, split selection bias, and simplistic formation of classification and prediction rules in the terminal leaf nodes. Improvements are proposed in other algorithms including Bayesian CART (Chipman, George, and McCulloch 1998), Bayesian Treed Regression (Chipman, George, and McCulloch 2002), TARGET (Tree Analysis with Randomly Generated and Evolved Trees) (Fan and Gray 2005; Gray and Fan 2008), and Treed Regression (Alexander and Grimshaw 2006). TARGET, Bayesian CART, and Bayesian Treed Regression introduced stochastically driven search methods that explore the tree space in a non-greedy fashion. These methods enable the tree space to be searched with global optimality in mind, rather than following a series of locally optimal splits. Treed Regression and Bayesian Treed Regression feature the addition of models in the leaf nodes to predict and classify new observations instead of using the mean or weighted majority vote as in traditional regression and classification trees, respectively. This dissertation proposes a new method called M-TARGET (Model Tree Analysis with Randomly Evolved and Generated Trees) which combines the stochastic nature of TARGET with the enhancement of models in the leaf nodes to improve prediction and classification accuracy. Comparisons with Treed Regression and Bayesian Treed Regression using real data sets show favorable results with regard to RMSE and tree size, which suggests that M-TARGET is a viable approach to decision tree modeling.Item Three essays on the use of margins to improve ensemble methods(University of Alabama Libraries, 2012) Martinez Cid, Waldyn Gerardo; Gray, J. Brian; University of Alabama TuscaloosaEnsemble methods, such as bagging (Breiman, 1996), boosting (Freund and Schapire, 1997) and random forests (Breiman, 2001) combine a large number of classifiers through (weighted) voting to produce strong classifiers. To explain the successful performance of ensembles and particularly of boosting, Schapire, Freund, Bartlett and Lee (1998) developed an upper bound on the generalization error of an ensemble based on the margins, from which it was concluded that larger margins should lead to lower generalization error, everything else being equal (sometimes referred to as the "large margins theory"). This result has led many researchers to consider direct optimization of functions of the margins (see, e.g., Grove and Schuurmans, 1998; Breiman, 1999 Mason, Bartlett and Baxter, 2000; and Shen and Li, 2010). In this research, we show that the large margins theory is not sufficient for explaining the performance of AdaBoost. Shen and Li (2010) and Xu and Gray (2012) provide evidence suggesting that generalization error might be reduced by increasing the mean and decreasing the variance of the margins, which we refer to as "squeezing" the margins. For that reason, we also propose several alternative techniques for squeezing the margins and evaluate their effectiveness through simulations with real and synthetic data sets. In addition to the margins being a determinant of the performance of ensembles, we know that AdaBoost, the most common boosting algorithm, can be very sensitive to outliers and noisy data, since it assigns observations that have been misclassified a higher weight in subsequent runs. Therefore, we propose several techniques to identify and potentially delete noisy samples in order to improve its performance.