This thesis aims at processing recordings of human conversation and characterizing them regarding their social context. The social characterization is part of a larger system predicting the interruptibility of smart phone users by recording and classifying their conversations. The aspects used as characteristics are speaker emotions on the one hand and speech affects on the other hand. All predictions are based solely on auditive reception without processing the spoken text. The first part of the thesis focuses on studying different emotion classification methods on the benchmark datasets Berlin-EMO, FAU Aibo and SEMAINE. Although all datasets have been processed and classified by previous research, the focus of this approach is to include mayor disturbances such as noise and different smart phone channel effects. Also the experiments are limited to the use of Mel Frequency cepstral coefficients (MFCCs) as input data in order to reduce the data transfer from the smart phone. The goal of imposing these limitations is to test the known classification methods within a real life environment. The algorithms applied are Gaussian Mixture Models (GMMs), iVectors, Anchor Models and Random Forests. The second part of the thesis focuses on predicting lower level features, namely speech affects, such as valence, arousal, power and expectation. These affects could be used for direct prediction of the interruptibility or as an intermediate step towards predicting the emotion. For applying and testing the regression of these affects, the SEMAINE dataset is used as a benchmark. Again the data is disturbed by noise and channel effects as well as reduced to MFCC features. The algorithms applied are Support Vector Regressors, Random Forest Regressors and Long short-term memory recurrent neural networks (LSTM-RNNs). For the emotion recognition the GMM algorithm showed the best performance with 29.75\% UAR in average over all datasets and audio types. The overall low quality of results is caused by the limitations of a real life application. For the affect regression, the best performance was reached by the Random Forest Regressor with R2 values of -1.60 for valence, -0.65 for arousal, -2.84 for expectation and -1,36 for power. Although these correlations seem very low, a transformation into a tenaer classification task (-1,0,1) reaches average recall values of 52.84 for valence, 56.04 for arousal, 53.02 for expectation and 54.010739 for power averaged over all audio types.
«