A predictive model for highway accidents and two papers on clustering averaging
Predictive models and clustering algorithms are two of the most important statistical methodologies in solving quantitative problems. This dissertation document aims at proposing several innovative prediction and clustering techniques and demonstrating their successful applications in solving several real world problems. Chapter 1 discusses how the choice of highway safety performance function (SPF), as a predictive model on crash rate, affects the importance of various highway intersection characteristics. In this chapter, a highway data inventory of 36 safety relevant parameters along state routes in Alabama is used to study the importance of the road characteristics and their interactions. Four SPFs are considered including Poisson regression, negative binomial regression, regularized generalized linear model, and boosted regression trees (BRT). Overall, the BRT outperforms other models on predictive accuracy, due to its capability of accounting for non-linearities and multi-way interactions. Additionally, the boosted tree model identifies several important variables, such as pedestrian crossing control type and distance to next public intersection, that are ignored by other SPFs. Although models of linear form have straightforward interpretations of the relationship between crash rate and the road characteristics, BRT better identifies critical variables with an superior prediction accuracy. Chapter 2 presents an improvement of Bayesian model-based clustering using similarity-based model aggregation and a clustering estimation approach named non-negative matrix factorization (NMF). In Bayesian model-based clustering, MCMC algorithm provides sufficient outcome for statistical inference on the model-specific parameters. However, traditional posterior inference techniques, such as maximum a posteriori (MAP), is difficult to apply to the partitioning vector due to the exchangeability of the cluster labels. Therefore, this chapter proposes a methodology for estimating the final partitioning vector based on the close relationship between NMF and the loss-function approaches in literature. Our method not only provides clustering solution of better accuracy but also enables a soft or probabilistic interpretation of the cluster assignments. Chapter 3 illustrates how clustering averaging can be utilized to refine model-based clustering using finite mixture models. As the Expectation-Maximization (EM) algorithm for estimating finite mixture models is notably sensitive to the initialization and the specification of a correct number of clusters K, clustering model averaging can be employed to provide an aggregated partition better than any individual solution. However, various specifications are available for each step of the clustering aggregation and estimation process. This chapter proposes an aggregated multi-component clustering algorithm (AMCCA) which optimizes the options for each step of the clustering aggregation. Additionally, our algorithm imposes an extra step of multi-component clustering using an initial partition from NMF, which presents better clustering performance than existing approaches including Gaussian mixture models.