Automated recognition of paralinguistic signals in spoken dialogue systems: ways of improvement
Žurnal Sibirskogo federalʹnogo universiteta. Matematika i fizika, Tome 8 (2015) no. 2, pp. 208-216.

Voir la notice de l'article provenant de la source Math-Net.Ru

The ability of artificial systems to recognize paralinguistic signals, such as emotions, depression, or openness, is useful in various applications. However, the performance of such recognizers is not yet perfect. In this study we consider several directions which can significantly improve the performance of such systems. Firstly, we propose building speaker- or gender-specific emotion models. Thus, an emotion recognition (ER) procedure is followed by a gender- or speaker-identifier. Speaker- or gender-specific information is used either for including into the feature vector directly, or for creating separate emotion recognition models for each gender or speaker. Secondly, a feature selection procedure is an important part of any classification problem; therefore, we proposed using a feature selection technique, based on a genetic algorithm or an information gain approach. Both methods result in higher performance than baseline methods without any feature selection algorithms. Finally, we suggest analysing not only audio signals, but also combined audio-visual cues. The early fusion method (or feature-based fusion) has been used in our investigations to combine different modalities into a multimodal approach. The results obtained show that the multimodal approach outperforms single modalities on the considered corpora. The suggested methods have been evaluated on a number of emotional databases of three languages (English, German and Japanese), in both acted and non-acted settings. The results of numerical experiments are also shown in the study.
Keywords: recognition of paralinguistic signals, machine learning algorithms, speaker-adaptive emotion recognition, multimodal approach.
@article{JSFU_2015_8_2_a10,
     author = {Maxim Sidorov and Alexander Schmitt and Eugene S. Semenkin},
     title = {Automated recognition of paralinguistic signals in spoken dialogue systems: ways of improvement},
     journal = {\v{Z}urnal Sibirskogo federalʹnogo universiteta. Matematika i fizika},
     pages = {208--216},
     publisher = {mathdoc},
     volume = {8},
     number = {2},
     year = {2015},
     language = {en},
     url = {http://geodesic.mathdoc.fr/item/JSFU_2015_8_2_a10/}
}
TY  - JOUR
AU  - Maxim Sidorov
AU  - Alexander Schmitt
AU  - Eugene S. Semenkin
TI  - Automated recognition of paralinguistic signals in spoken dialogue systems: ways of improvement
JO  - Žurnal Sibirskogo federalʹnogo universiteta. Matematika i fizika
PY  - 2015
SP  - 208
EP  - 216
VL  - 8
IS  - 2
PB  - mathdoc
UR  - http://geodesic.mathdoc.fr/item/JSFU_2015_8_2_a10/
LA  - en
ID  - JSFU_2015_8_2_a10
ER  - 
%0 Journal Article
%A Maxim Sidorov
%A Alexander Schmitt
%A Eugene S. Semenkin
%T Automated recognition of paralinguistic signals in spoken dialogue systems: ways of improvement
%J Žurnal Sibirskogo federalʹnogo universiteta. Matematika i fizika
%D 2015
%P 208-216
%V 8
%N 2
%I mathdoc
%U http://geodesic.mathdoc.fr/item/JSFU_2015_8_2_a10/
%G en
%F JSFU_2015_8_2_a10
Maxim Sidorov; Alexander Schmitt; Eugene S. Semenkin. Automated recognition of paralinguistic signals in spoken dialogue systems: ways of improvement. Žurnal Sibirskogo federalʹnogo universiteta. Matematika i fizika, Tome 8 (2015) no. 2, pp. 208-216. http://geodesic.mathdoc.fr/item/JSFU_2015_8_2_a10/

[1] F. Burkhardt et al., “A database of German emotional speech”, Proceedings of the Interspeech 2005 Conference (2005), 1517–1520

[2] A. Schmitt, B. Schatz, W. Minker, “Modeling and predicting quality in spoken human-computer interaction”, Proceedings of the SIGDIAL 2011 Conference (2011), 173–184

[3] S. Haq, P. J. B. Jackson, “Multimodal Emotion Recognition”, Machine Audition: Principles, Algorithms and Systems, ed. W. Wang, IGI Global Press, 2010, 173–184

[4] H. Mori, T. Satake, M. Nakamura, “Constructing a spoken dialogue corpus for studying paralinguistic information in expressive conversation and analyzing its statistical/acoustic characteristics”, Speech Communication, 53:1 (2011), 36–50 | DOI

[5] M. Grimm, K. Kroschel, S. Narayanan, “The Vera am Mittag German audio-visual emotional speech database”, IEEE International Conference Multimedia and Expo (Hannover, 2008), 865–868

[6] A. Dhall et al., “Collecting large, richly annotated facial-expression databases from movies”, IEEE MultiMedia, 19:3 (2012), 34–41 | DOI

[7] M. Valstar et al., “AVEC 2013: the continuous audio/visual emotion and depression recognition challenge”, Proceedings of the 3rd ACM international workshop on Audio/visual emotion challenge (New York, USA, 2013), 3–10

[8] O. Celiktutan et al., “MAPTRAITS 2014: The First Audio/Visual Mapping Personality Traits Challenge”, Proceedings of the Personality Mapping Challenge Workshop (Istanbul, Turkey, 2014)

[9] M. Sidorov, Ch. Brester, E. Semenkin, W. Minker, “Speaker State Recognition with Neural Network-based Classification and Self-adaptive Heuristic Feature Selection”, Proceedings of the 11th International Conference on Informatics in Control, Automation and Robotics (Vienna, Austria, 2014), v. 1, 699–703

[10] M. Sidorov, S. Ultes, A. Schmitt, “Emotions are a personal thing: Towards speaker-adaptive emotion recognition”, IEEE International Conference on Acoustics, Speech and Signal Processing (Florence, Italy, 2014), 4803–4807

[11] B. Schuller et al., “The INTERSPEECH 2010 paralinguistic challenge”, Proc. of the Interspeech (2010), 2794–2797

[12] G. Zhao, M. Pietikainen, “Dynamic texture recognition using local binary patterns with an application to facial expressions”, IEEE Transactions on Pattern Analysis and Machine Intelligence, 29:6 (2007), 915–928 | DOI

[13] T. R. Almaev, M. F. Valstar, “Local gabor binary patterns from three orthogonal planes for automatic facial expression recognition”, IEEE Conference on Affective Computing and Intelligent Interaction (2013), 356–361

[14] N. Singhal et al., “Robust image watermarking using local Zernike moments”, Journal of Visual Communication and Image Representation, 20:6 (2009), 408–419 | DOI

[15] Xiong X., De la Torre F., “Supervised descent method and its applications to face alignment”, IEEE Conference on Computer Vision and Pattern Recognition (2013), 532–539