In the 21th century, the importance of finite mixture models in the statistical analysis of data keeps increasing, so that the number of articles on mixture model applications appearing in the statistical and general scientific literature increases steadily, for example, cluster analysis, unsupervised pattern recognition, speech recognition, medical imaging and other applications.
The expectation and maximization (EM) algorithm is a well-known and convenient way for parameter estimation in mixture models. However, the EM algorithm is an iterative algorithm requiring starting values. Different starting values for the EM algorithm can significantly impact the resulting solution. In addition to initialization strategies, the extensions of EM algorithm with different iteration processes from classic EM algorithm also affect the resulting solution. Furthermore, the majority of finite mixture models cannot capture clusters, which are non-elliptical and asymmetric tail dependencies. Due to the higher flexibility of vine copulas, the vine copula mixture model (VCMM) algorithm proposed by Sahin and Czado [2021] is more suitable for modelling Non-Gaussian multivariate data with clusters.
This thesis aims to study the performance of the classic Gaussian mixture model (GMM) algorithm and the vine copulas mixture model (VCMM) algorithm with different EM algorithms and initialization strategies for clustering data with different characteristics and real data sets. According to our results, clustering Gaussian data with GMM algorithm using different EM algorithms and initialization strategies both have a significant effect on classification rate. For the best fit, we recommend the expectation/conditional maximization either (ECME) algorithm together with the optimization method Nelder-Mead and initialization by k-means clustering. However, the is not the case in VCMM algorithm. For clustering data with various characteristics and real data sets by VCMM algorithm, different EM algorithms don’t affect the performance regarding to classification rate for clustering significantly, but computation time. We found that heuristic based optimization method (Nelder-Maad) is taking more time than with gradient based optimization method (BFGS) in many situations. Moreover, we found that some initialization strategies for VCMM algorithm outperform other strategies for clustering data with different characteristics. The result is summarized as a flow chart in the Figure 3.3.19 and the recommendation of the initialization strategy for data clustering in the Figure 3.3.20 and 3.3.21. Lastly, we show how the VCMM algorithm improves the clustering fit over GMM in two data sets and the performance of the selected models from the Figure 3.3.20 for clustering the two real data sets.
«
In the 21th century, the importance of finite mixture models in the statistical analysis of data keeps increasing, so that the number of articles on mixture model applications appearing in the statistical and general scientific literature increases steadily, for example, cluster analysis, unsupervised pattern recognition, speech recognition, medical imaging and other applications.
The expectation and maximization (EM) algorithm is a well-known and convenient way for parameter estimation in mixtur...
»