Department of Information Systems, Statistics & Management Science
Permanent URI for this community
Browse
Browsing Department of Information Systems, Statistics & Management Science by Author "Barrett, Bruce E."
Now showing 1 - 8 of 8
Results Per Page
Sort Options
Item Contributions to joint monitoring of location and scale parameters: some theory and applications(University of Alabama Libraries, 2012) McCracken, Amanda Kaye; Chakraborti, Subhabrata; University of Alabama TuscaloosaSince their invention in the 1920s, control charts have been popular tools for use in monitoring processes in fields as varied as manufacturing and healthcare. Most of these charts are designed to monitor a single process parameter, but recently, a number of charts and schemes for jointly monitoring the location and scale of processes which follow two-parameter distributions have been developed. These joint monitoring charts are particularly relevant for processes in which special causes may result in a simultaneous shift in the location parameter and the scale parameter. Among the available schemes for jointly monitoring location and scale parameters, the vast majority are designed for normally distributed processes for which the in-control mean and variance are known rather than estimated from data. When the process data are non-normally distributed or the process parameters are unknown, alternative control charts are needed. This dissertation presents and compares several control schemes for jointly monitoring data from Laplace and shifted exponential distributions with known parameters as well as a pair of charts for monitoring data from normal distributions with unknown mean and variance. The normal theory charts are adaptations of two existing procedures for the known parameter case, Razmy's (2005) Distance chart and Chen and Cheng's (1998) Max chart, while the Laplace and shifted exponential charts are designed using an appropriate statistic for each parameter, such as the maximum likelihood estimators.Item Contributions to outlier detection methods: some theory and applications(University of Alabama Libraries, 2011) Dovoedo, Yinaze Herve; Chakraborti, Subhabrata; University of Alabama TuscaloosaTukey's traditional boxplot (Tukey, 1977) is a widely used Exploratory Data Analysis (EDA) tools often used for outlier detection with univariate data. In this dissertation, a modification of Tukey's boxplot is proposed in which the probability of at least one false alarm is controlled, as in Sim et al. 2005. The exact expression for that probability is derived and is used to find the fence constants, for observations from any specified location-scale distribution. The proposed procedure is compared with that of Sim et al., 2005 in a simulation study. Outlier detection and control charting are closely related. Using the preceding procedure, one- and two-sided boxplot-based Phase I control charts for individual observations are proposed for data from an exponential distribution, while controlling the overall false alarm rate. The proposed charts are compared with the charts by Jones and Champ, 2002, in a simulation study. Sometimes, the practitioner is unable or unwilling to make an assumption about the form of the underlying distribution but is confident that the distribution is skewed. In that case, it is well documented that the application of Tukey's boxplot for outlier detection results in increased number of false alarms. To this end, in this dissertation, a modification of the so-called adjusted boxplot for skewed distributions by Hubert and Vandervieren, 2008, is proposed. The proposed procedure is compared to the adjusted boxplot and Tukey's procedure in a simulation study. In practice, the data are often multivariate. The concept of a (statistical) depth (or equivalently outlyingness) function provides a natural, nonparametric, "center-outward" ordering of a multivariate data point with respect to data cloud. The deeper a point, the less outlying it is. It is then natural to use some outlyingness functions as outlier identifiers. A simulation study is performed to compare the outlier detection capabilities of selected outlyingness functions available in the literature for multivariate skewed data. Recommendations are provided.Item The development of diagnostic tools for mixture modeling and model-based clustering(University of Alabama Libraries, 2016) Zhu, Xuwen; Melnykov, Volodymyr; University of Alabama TuscaloosaCluster analysis performs unsupervised partition of heterogeneous data. It has applications in almost all fields of study. Model-based clustering is one of the most popular clustering methods these days due to its flexibility and interpretability. It is based on finite mixture models. However, the development of diagnostic tools and visualization tools for clustering procedures is limited. This dissertation is devoted to assessing different properties of the clustering procedure. This report has four chapters. The summary of each chapter is given below: In the first chapter we provide the practitioners with an approach to assess the certainty of a classification made in model-based clustering. The second chapter introduces a novel finite mixture model called Manly mixture model. It is capable of modeling skewness in data and performs diagnostics on the normality of variables. In the third chapter we develop an extension of the traditional K-means procedure that is capable of modeling skewness in data. The fourth chapter contributes to the ManlyMix R package, which is the developed software corresponding to our paper in Chapter 2.Item GA-Boost: a genetic algorithm for robust boosting(University of Alabama Libraries, 2012) Oh, Dong-Yop; Gray, J. Brian; University of Alabama TuscaloosaMany simple and complex methods have been developed to solve the classification problem. Boosting is one of the best known techniques for improving the prediction accuracy of classification methods, but boosting is sometimes prone to overfit and the final model is difficult to interpret. Some boosting methods, including Adaboost, are very sensitive to outliers. Many researchers have contributed to resolving boosting problems, but those problems are still remaining as hot issues. We introduce a new boosting algorithm "GA-Boost" which directly optimizes weak learners and their associated weights using a genetic algorithm, and three extended versions of GA-Boost. The genetic algorithm utilizes a new penalized fitness function that consists of three parameters (a, b, and p) which limit the number of weak classifiers (by b) and control the effects of outliers (by a) to maximize an appropriately chosen p-th percentile of margins. We evaluate GA-Boost performance with an experimental design and compare it to AdaBoost using several artificial and real-world data sets from the UC-Irvine Machine Learning Repository. In experiments, GA-Boost was more resistant to outliers and resulted in simpler predictive models than AdaBoost. GA-Boost can be applied to data sets with three different weak classifier options. We introduce three extended versions of GA-Boost, which performed very well on two simulation data sets and three real world data sets.Item On robust estimation of multiple change points in multivariate and matrix processes(University of Alabama Libraries, 2017) Melnykov, Yana; Perry, Marcus B.; University of Alabama TuscaloosaThere are numerous areas of human activities where various processes are observed over time. If the conditions of the process change, it can be reflected through the shift in observed response values. The detection and estimation of such shifts is commonly known as change point inference. While the estimation helps us learn about the process nature, assess its parameters, and analyze identified change points, the detection focuses on finding shifts in the real-time process flow. There is a vast variety of methods proposed in the literature to target change point detections in both settings. Unfortunately, the majority of procedures impose very restrictive assumptions. Some of them include the normality of data, independence of observations, or independence of subjects in multisubject studies. In this dissertation, a new methodology, relying on more realistic assumptions, is developed. This dissertation report includes three chapters. The summary of each chapter is provided below. In the first chapter, we develop methodology capable of estimating and detecting multiple change points in a multisubject single variable process observed over time. In the second chapter, we introduce methodology for the robust estimation of change points in multivariate processes observed over time. In the third chapter, we generalize the ideas presented in the first two chapters by developing methodology capable of identifying multiple change points in multisubject matrix processes observed over time.Item Some contributions to univariate nonparametric tests and control charts(University of Alabama Libraries, 2017) Zheng, Rong; Chakraborti, Subhabrata; University of Alabama TuscaloosaIn general, statistical methods have two categories: parametric and nonparametric. Parametric analysis is usually made based on information regarding the probability distribution of the random variable. While, nonparametric method is also referred as a distribution-free procedure, which does not require prior knowledge of the distribution of the random variable. In reality, few cases allow practitioners to gain full knowledge of a random variable and tell the probability distribution for sure. Hence, there are two choices for practitioners. One can still use the parametric methods due to the scientific evaluations or the simplification of situation, with an assumption of the parametric distribution. Alternatively, one can directly apply the nonparametric methods without having much knowledge of the distribution. The conclusions from the parametric methods are valid as long as the assumptions are substantiated. These assumptions would help solving problems, but also risky because making a wrong assumption might be dangerous. Hence, nonparametric techniques would be a preferable alternative. One chief advantage of the nonparametric methods lies in its relaxation of the shapes of the distributions, namely, distribution-free property. Hence, from a research point of view, new methodology with nonparametric techniques applied, or further investigation related to existing nonparametric techniques could be interesting, informative and valuable. All research in this matter contributes to univariate nonparametric tests and control charts.Item Three essays on improving ensemble models(University of Alabama Libraries, 2013) Xu, Jie; Gray, J. Brian; University of Alabama TuscaloosaEnsemble models, such as bagging (Breiman, 1996), random forests (Breiman, 2001a), and boosting (Freund and Schapire, 1997), have better predictive accuracy than single classifiers. These ensembles typically consist of hundreds of single classifiers, which makes future predictions and model interpretation much more difficult than for single classifiers. Breiman (2001b) gave random forests a grade of A+ in predictive performance, but a grade of F in interpretability. Breiman (2001a) also mentioned that the performance of an ensemble model depends on the strengths of the individual classifiers in the ensemble and the correlations among them. Reyzin and Schapire (2006) stated that "the margins explanation basically says that when all other factors are equal, higher margins result in lower error," which is referred to as the "large margin theory." Shen and Li (2010) showed that the performance of an ensemble model is related to the mean and the variance of the margins. In this research, we improve ensemble models from two perspectives, increasing the interpretability and/or decreasing the test error rate. We first propose a new method based on quadratic programming that uses information on the strengths of the individual classifiers in the ensemble and their correlations, to improve or maintain the predictive accuracy of an ensemble while significantly reducing its size. In the second essay, we improve the predictive accuracy of random forests by adding an AdaBoost-like improvement step to random forests. Finally, we propose a method to improve the strength of the individual classifiers by using fully-grown trees fitted on weighted resampling training data and then combining the trees by using the AdaBoost method.Item Three essays on the use of margins to improve ensemble methods(University of Alabama Libraries, 2012) Martinez Cid, Waldyn Gerardo; Gray, J. Brian; University of Alabama TuscaloosaEnsemble methods, such as bagging (Breiman, 1996), boosting (Freund and Schapire, 1997) and random forests (Breiman, 2001) combine a large number of classifiers through (weighted) voting to produce strong classifiers. To explain the successful performance of ensembles and particularly of boosting, Schapire, Freund, Bartlett and Lee (1998) developed an upper bound on the generalization error of an ensemble based on the margins, from which it was concluded that larger margins should lead to lower generalization error, everything else being equal (sometimes referred to as the "large margins theory"). This result has led many researchers to consider direct optimization of functions of the margins (see, e.g., Grove and Schuurmans, 1998; Breiman, 1999 Mason, Bartlett and Baxter, 2000; and Shen and Li, 2010). In this research, we show that the large margins theory is not sufficient for explaining the performance of AdaBoost. Shen and Li (2010) and Xu and Gray (2012) provide evidence suggesting that generalization error might be reduced by increasing the mean and decreasing the variance of the margins, which we refer to as "squeezing" the margins. For that reason, we also propose several alternative techniques for squeezing the margins and evaluate their effectiveness through simulations with real and synthetic data sets. In addition to the margins being a determinant of the performance of ensembles, we know that AdaBoost, the most common boosting algorithm, can be very sensitive to outliers and noisy data, since it assigns observations that have been misclassified a higher weight in subsequent runs. Therefore, we propose several techniques to identify and potentially delete noisy samples in order to improve its performance.