Gaussian processes generalize the normal distribution from finite to infinite dimensions. Whereas a finite-dimensional probability distribution describes a finite collection of random variables, a stochastic process governs the properties of functions. If Gaussian processes are used for regression in machine learning, large datasets provide a challenge, so a solution would be subsampling of the datasets. A particular way to subsample a large dataset is by "landmarking". Landmarking helps to choose a particular, small set of data points that best represent the large dataset. In this thesis, two different algorithms will be used as a reference for finding landmarks which will then be applied for classification. We developed two landmarking algorithms to incorporate the minibatch approach and iteratively reused the old landmarks to extract the new ones. We also developed a stochastic sampling approach to compute the optimum kernel bandwidths for different datasets. Finally, we used the landmarks to classify all the data points in the test data by applying KNN and Gaussian process classifiers. We tested this approach on five different datasets. As a result, we were able to substantially reduce computation time in finding landmarks and increase classification accuracy based on the found landmarks. In addition, we also found out that Gaussian process performs better or equivalent in terms of classification accuracy when compared to KNN.
«
Gaussian processes generalize the normal distribution from finite to infinite dimensions. Whereas a finite-dimensional probability distribution describes a finite collection of random variables, a stochastic process governs the properties of functions. If Gaussian processes are used for regression in machine learning, large datasets provide a challenge, so a solution would be subsampling of the datasets. A particular way to subsample a large dataset is by "landmarking". Landmarking helps to choo...
»