Some Contributions to Modern Mixture Modeling and Model-Based Clustering
Clustering analysis is a technique of recognizing groups of similar objects. Based on the finite mixture models, model-based clustering is one of the most popular methods due to its flexibility and interpretability in modeling heterogeneous data. In this background, the one-to-one correspondence between mixture components and groups is assumed. The clustering process can be viewed as the model estimation by using an optimization algorithm. The age of big data poses new challenges. Due to a potentially high number of parameters, finite mixture models are often at the risk of being overparameterized. The overparameterization in model-based clustering often results in mixture order underestimation. As a fast-growing field, developing simulation studies to validate the mixture models becomes another crucial topic. This thesis contributes to modern mixture modeling and model-based clustering, and mainly focuses on developing approaches for solving overparameterization issues in this context. In addition, algorithms for simulating various types of clusters are created, which can be utilized to evaluate and improve clustering techniques. For each of the chapters, the expectation-maximization (EM) algorithm of the proposed mixture is developed, the expressions for model parameter estimations are provided, and corresponding parsimonious procedures are proposed. The utilities of methodologies are tested on both synthetic and well-known classification datasets. The organization of the thesis is as follows. In the firstchapter, a variable selection procedure is developed and applied in the matrix mixture modeling. The second chapter develops a novel mixture modeling approach called conditional mixture modeling and its corresponding parsimonious procedure. The third chapter provides an extension for simulating heterogeneous data for studying the systematic performance of clustering algorithms. Finally, the fourth chapter describes an R package cmbClust functionality developed for clustering multivariate data using the methodology proposed in chapter two.