The goal of this bachelor thesis is to create and evaluate a simulation study for model selection using the singular Bayesian Information Criterion (sBIC) for Gaussian mixture distribution models. In particular, the focus is on the investigation of an optimal value of the φ parameter of the sBIC. We use a classification of the complexity of the datasets to investigate to what extent this has an impact on the optimal value. We use R for the calculations and in particular the packages mclust and MixSim. First, we simulate two-dimensional data sets with a different number of mixture components. In the following, we test whether the conclusions drawn from these simulations also hold in higher dimensions and models with significantly more components. In addition, we investigated the influence that different models, for the singular components, have. The following results were obtained. There seems to be no value φ that produces the best results in all cases. A higher number of components and more complex data sets obtain better results for smaller values of φ. However, sBIC with φ 4, 5, 6 achieves better or comparable values compared to BIC in virtually all situations. The only exceptions seem to be data sets where the components barely overlap. In these cases, the BIC sometimes achieves minimally better results. The different models have an influence on the model selection result, such that the number of selected components has a larger variance. However, this has no influence on the best possible value for φ. With a very large number of components, the sBIC cannot select the right number with a high degree of certainty, but it can achieve good results on average. This is something that the BIC cannot do. We can give some guidance on the choice of φ but not a definitive answer.
«
The goal of this bachelor thesis is to create and evaluate a simulation study for model selection using the singular Bayesian Information Criterion (sBIC) for Gaussian mixture distribution models. In particular, the focus is on the investigation of an optimal value of the φ parameter of the sBIC. We use a classification of the complexity of the datasets to investigate to what extent this has an impact on the optimal value. We use R for the calculations and in particular the packages mclust and Mi...
»