- Browse by Author

# Department of Information Systems, Statistics & Management Science

## Permanent URI for this community

## Browse

### Browsing Department of Information Systems, Statistics & Management Science by Author "Adams, Benjamin Michael"

Now showing 1 - 8 of 8

###### Results Per Page

###### Sort Options

Item Advances in mixture modeling and model based clustering(University of Alabama Libraries, 2015) Michael, Semhar K.; Melnykov, Volodymyr; University of Alabama TuscaloosaShow more Cluster analysis is part of unsupervised learning that deals with finding groups of similar observations in heterogeneous data. There are several clustering approaches with the goal of minimizing the within cluster variance while maximizing the variance between clusters. K-means or hierarchical clustering with different linkages can be thought as distance-based approaches. Another approach is model-based which relies on the idea of finite mixture models. This dissertations will propose new advances in clustering area mostly related to model-based clustering and its extension to the K-means algorithm. This report has five chapters. The first chapter is a literature review on recent advances in the area of model-based clustering and finite mixture modeling. Main advances and challenges are described in the methodology section. Then some interesting and diverse applications of model-based clustering are presented in the application section. The second chapter deals with a simulation study conducted to analyze the factors that affect complexity of model-based clustering. In the third chapter we develop a methodology for model-based clustering of regression time series data and show its application to annual tree rings. In the fourth chapter, we utilize the relationship between model-based clustering and the Kmeans algorithm to develop a methodology for merging clusters formed by K-means to find meaningful grouping. The final chapter is dedicated to the problem of initialization in model-based clustering. It is well known fact that the performance of model-based clustering is highly dependent on initialization of the EM algorithm. So far there is no method that comprehensively works in all situations. In this project, we use the idea of model averaging and initialization using the emEM algorithm to solve this problem.Show more Item Construction of estimation-equivalent second-order split-split-plot designs(University of Alabama Libraries, 2011) Yuan, Fang; Perry, Marcus B.; University of Alabama TuscaloosaShow more In many experimental settings, some experimental factors are very hard to change or very expensive to change, some factors are hard to change, and some factors are easy to change, which usually leads to a split-split-plot design. In such a case, there are randomization restrictions in our experiments. If the data is analyzed as if it were a completely randomized design, the results could be misleading. The analysis of split-split-plot designs is more complicated relative to the completely randomized design, as generalized least squares (GLS) is recommended for estimating the factor effects, and restricted maximum likelihood (REML) is recommended for estimating the variance components. As an alternative, one can consider estimation-equivalent designs, wherein ordinary least squares (OLS) and GLS estimates of the factor effects are equivalent. These designs provide practical benefits from the perspective of design selection and estimation and are consistent with traditional response surface methods. Although much work has been done with respect to estimation-equivalent second-order split-plot designs, less emphasis has been placed on split-split-plot (and higher strata) designs of this type. My research is to derive the general conditions for achieving OLS-GLS equivalence and use these conditions to construct balanced and unbalanced estimation-equivalent second-order split-split-plot designs from the central composite design (CCD).Show more Item Contributions to joint monitoring of location and scale parameters: some theory and applications(University of Alabama Libraries, 2012) McCracken, Amanda Kaye; Chakraborti, Subhabrata; University of Alabama TuscaloosaShow more Since their invention in the 1920s, control charts have been popular tools for use in monitoring processes in fields as varied as manufacturing and healthcare. Most of these charts are designed to monitor a single process parameter, but recently, a number of charts and schemes for jointly monitoring the location and scale of processes which follow two-parameter distributions have been developed. These joint monitoring charts are particularly relevant for processes in which special causes may result in a simultaneous shift in the location parameter and the scale parameter. Among the available schemes for jointly monitoring location and scale parameters, the vast majority are designed for normally distributed processes for which the in-control mean and variance are known rather than estimated from data. When the process data are non-normally distributed or the process parameters are unknown, alternative control charts are needed. This dissertation presents and compares several control schemes for jointly monitoring data from Laplace and shifted exponential distributions with known parameters as well as a pair of charts for monitoring data from normal distributions with unknown mean and variance. The normal theory charts are adaptations of two existing procedures for the known parameter case, Razmy's (2005) Distance chart and Chen and Cheng's (1998) Max chart, while the Laplace and shifted exponential charts are designed using an appropriate statistic for each parameter, such as the maximum likelihood estimators.Show more Item Contributions to outlier detection methods: some theory and applications(University of Alabama Libraries, 2011) Dovoedo, Yinaze Herve; Chakraborti, Subhabrata; University of Alabama TuscaloosaShow more Tukey's traditional boxplot (Tukey, 1977) is a widely used Exploratory Data Analysis (EDA) tools often used for outlier detection with univariate data. In this dissertation, a modification of Tukey's boxplot is proposed in which the probability of at least one false alarm is controlled, as in Sim et al. 2005. The exact expression for that probability is derived and is used to find the fence constants, for observations from any specified location-scale distribution. The proposed procedure is compared with that of Sim et al., 2005 in a simulation study. Outlier detection and control charting are closely related. Using the preceding procedure, one- and two-sided boxplot-based Phase I control charts for individual observations are proposed for data from an exponential distribution, while controlling the overall false alarm rate. The proposed charts are compared with the charts by Jones and Champ, 2002, in a simulation study. Sometimes, the practitioner is unable or unwilling to make an assumption about the form of the underlying distribution but is confident that the distribution is skewed. In that case, it is well documented that the application of Tukey's boxplot for outlier detection results in increased number of false alarms. To this end, in this dissertation, a modification of the so-called adjusted boxplot for skewed distributions by Hubert and Vandervieren, 2008, is proposed. The proposed procedure is compared to the adjusted boxplot and Tukey's procedure in a simulation study. In practice, the data are often multivariate. The concept of a (statistical) depth (or equivalently outlyingness) function provides a natural, nonparametric, "center-outward" ordering of a multivariate data point with respect to data cloud. The deeper a point, the less outlying it is. It is then natural to use some outlyingness functions as outlier identifiers. A simulation study is performed to compare the outlier detection capabilities of selected outlyingness functions available in the literature for multivariate skewed data. Recommendations are provided.Show more Item On the detection and estimation of changes in a process mean based on kernel estimators(University of Alabama Libraries, 2012) Mercado Velasco, Gary Ricardo; Perry, Marcus B.; University of Alabama TuscaloosaShow more Parametric control charts are very attractive and have been used in the industry for a very long time. However, in many applications the underlying process distribution is not known sufficiently to assume a specific distribution function. When the distributional assumptions underlying a parametric control chart are violated, the performance of the control chart could be potentially affected. Since robustness to departures from normality is a desirable property for control charts, this dissertation reports three separate papers on the development and evaluation of robust Shewhart-type control charts for both the univariate and multivariate cases. In addition, a statistical procedure is developed for detecting step changes in the mean of the underlying process given that Shewhart-type control charts are not very sensitive to smaller changes in the process mean. The estimator is intended to be applied following a control chart signal to aid in diagnosing root cause of change. Results indicate that methodologies proposed throughout this dissertation research provide robust in-control average run length, better detection performance than that offered by the traditional Shewhart control chart and/or the Hotelling's control chart, and meaningful change point diagnostic statistics to aid in the search for the special cause.Show more Item Reduced bias prediction regions and estimators of the original response when using data transformations(University of Alabama Libraries, 2015) Walker, Michael; Perry, Marcus B.; University of Alabama TuscaloosaShow more Initially motivated by electron microscopy experiments, we develop an approximate prediction interval on the univariate response variable Y, where it is assumed that a normal-theory linear model is fit using a transformed version of Y, and the transformation type is contained in the Box-Cox family. Further motivated by A-10 single-engine climb experiments, we then develop an approximate prediction interval on the univariate response Y, in which a linear model is fit using a transformed version of Y, contained in the Manly exponential family. For each case, we derive a closed-form approximation to the kth moment of the original response variable Y, which is then used to estimate the mean and variance of Y, given parameter estimates obtained from fitting the model in the transformed domain. Chebychev’s inequality is then used to construct a 100(1 − α)% prediction interval estimator on Y based on these mean and variance estimators. Extended data obtained from the A-10 single-engine climb experiments motivates the development of prediction regions in the original domain of a q-variate response vector Y through the use of multivariate extensions of both the Box-Cox power transformation and the Manly exponential transformation. For each transformation, we derive closed-form approximations to the kth moment of each original response Y, as well as a closed-form approximation to E(Yi Yi'), which are used to estimate the mean and variance of each Y and the covariance between them, given parameter estimates obtained from fitting the model in the transformed domain. Exploiting two multivariate analogs of Chebyshev’s inequality, we construct an approximate 100(1 − α)% prediction sphere and ellipsoid on the original response vector Y.Show more Item Three essays on improving ensemble models(University of Alabama Libraries, 2013) Xu, Jie; Gray, J. Brian; University of Alabama TuscaloosaShow more Ensemble models, such as bagging (Breiman, 1996), random forests (Breiman, 2001a), and boosting (Freund and Schapire, 1997), have better predictive accuracy than single classifiers. These ensembles typically consist of hundreds of single classifiers, which makes future predictions and model interpretation much more difficult than for single classifiers. Breiman (2001b) gave random forests a grade of A+ in predictive performance, but a grade of F in interpretability. Breiman (2001a) also mentioned that the performance of an ensemble model depends on the strengths of the individual classifiers in the ensemble and the correlations among them. Reyzin and Schapire (2006) stated that "the margins explanation basically says that when all other factors are equal, higher margins result in lower error," which is referred to as the "large margin theory." Shen and Li (2010) showed that the performance of an ensemble model is related to the mean and the variance of the margins. In this research, we improve ensemble models from two perspectives, increasing the interpretability and/or decreasing the test error rate. We first propose a new method based on quadratic programming that uses information on the strengths of the individual classifiers in the ensemble and their correlations, to improve or maintain the predictive accuracy of an ensemble while significantly reducing its size. In the second essay, we improve the predictive accuracy of random forests by adding an AdaBoost-like improvement step to random forests. Finally, we propose a method to improve the strength of the individual classifiers by using fully-grown trees fitted on weighted resampling training data and then combining the trees by using the AdaBoost method.Show more Item Three essays on the use of margins to improve ensemble methods(University of Alabama Libraries, 2012) Martinez Cid, Waldyn Gerardo; Gray, J. Brian; University of Alabama TuscaloosaShow more Ensemble methods, such as bagging (Breiman, 1996), boosting (Freund and Schapire, 1997) and random forests (Breiman, 2001) combine a large number of classifiers through (weighted) voting to produce strong classifiers. To explain the successful performance of ensembles and particularly of boosting, Schapire, Freund, Bartlett and Lee (1998) developed an upper bound on the generalization error of an ensemble based on the margins, from which it was concluded that larger margins should lead to lower generalization error, everything else being equal (sometimes referred to as the "large margins theory"). This result has led many researchers to consider direct optimization of functions of the margins (see, e.g., Grove and Schuurmans, 1998; Breiman, 1999 Mason, Bartlett and Baxter, 2000; and Shen and Li, 2010). In this research, we show that the large margins theory is not sufficient for explaining the performance of AdaBoost. Shen and Li (2010) and Xu and Gray (2012) provide evidence suggesting that generalization error might be reduced by increasing the mean and decreasing the variance of the margins, which we refer to as "squeezing" the margins. For that reason, we also propose several alternative techniques for squeezing the margins and evaluate their effectiveness through simulations with real and synthetic data sets. In addition to the margins being a determinant of the performance of ensembles, we know that AdaBoost, the most common boosting algorithm, can be very sensitive to outliers and noisy data, since it assigns observations that have been misclassified a higher weight in subsequent runs. Therefore, we propose several techniques to identify and potentially delete noisy samples in order to improve its performance.Show more