Model-Based Clustering of Sequential and Directional Data
The goal of cluster analysis is to separate objects into distinct groups so that observations in the same group are more similar to each other. A variety of clustering algorithms have been proposed to implement this task, among which model-based clustering stands out due to its flexibility and usefulness. Model-based clustering often employs finite mixture models to cluster heterogeneous observations. A finite mixture model is a sum of several probability distributions, each distribution can be considered as one component, and the number of components is named as mixture order. In addition to this, the weight corresponding to each cluster can be termed as the mixing proportion. It denotes the prior probability that an observation originates from the associated cluster. There are two constraints for mixing proportions, these values must be between zero and one, and the sum of all mixing proportions should always be equal to one. Clustering objects are challenging due to the increasing complexity of data structure. This thesis solves problems with the clustering of categorical sequences and directional observations. Nowadays, clustering algorithms developed for categorical data are very limited, but this type of data can be found in many areas. Thus efficient models need to be proposed to measure the state-of-the-art nature of categorical data. On the other hand, directional data can also be obtained in many areas, like meteorology, astronomy, biology, and medical science. The most commonly employed Gaussian mixture models can no longer describe the directional nature of the data, while the von-Mises Fisher distribution pays a vital role in this area. One evident phenomenon is that real-life directional data have many noises, outliers, and heavy tails, but the current models are very sensitive to the presence of these noises. A model that can deal with such a problem will also be explored in this thesis. For each model, the Expectation-Maximization (EM) algorithm is employed to find estimates of parameters for the associated mixture model, and the performance of the proposed model is tested under various types of synthetic data and compared to the already developed models. Then the proposed models are applied to the corresponding real-life data. The results indicate the superiority of proposed models for both synthetic and real data sets. The thesis is organized as follows: In the first chapter, a brief introduction is given for the background description of cluster analysis. Then the second chapter introduces a new model to include the temporal character of categorical sequences. In the third chapter, semi-supervised clustering is developed to explore potential factors that can affect observation classifications. Finally, in the fourth chapter, a new model is proposed to tackle directional data with noises.