Speech-based emotion recognition and speaker identification: static vs. dynamic mode of speech representation
Žurnal Sibirskogo federalʹnogo universiteta. Matematika i fizika, Tome 9 (2016) no. 4, pp. 518-523.

Voir la notice de l'article provenant de la source Math-Net.Ru

In this paper we present the performance of different machine learning algorithms for the problems of speech-based Emotion Recognition (ER) and Speaker Identification (SI) in static and dynamic modes of speech signal representation. We have used a multi-corporal, multi-language approach in the study. 3 databases for the problem of SI and 4 databases for the ER task of 3 different languages (German, English and Japanese) have been used in our study to evaluate the models. More than 45 machine learning algorithms were applied to these tasks in both modes and the results alongside discussion are presented here.
Keywords: emotion recognition from speech, speaker identification from speech, machine learning algorithms, speaker adaptive emotion recognition from speech.
@article{JSFU_2016_9_4_a15,
     author = {Maxim Sidorov and Wolfgang Minker and Eugene S. Semenkin},
     title = {Speech-based emotion recognition and speaker identification: static vs. dynamic mode of speech representation},
     journal = {\v{Z}urnal Sibirskogo federalʹnogo universiteta. Matematika i fizika},
     pages = {518--523},
     publisher = {mathdoc},
     volume = {9},
     number = {4},
     year = {2016},
     language = {en},
     url = {http://geodesic.mathdoc.fr/item/JSFU_2016_9_4_a15/}
}
TY  - JOUR
AU  - Maxim Sidorov
AU  - Wolfgang Minker
AU  - Eugene S. Semenkin
TI  - Speech-based emotion recognition and speaker identification: static vs. dynamic mode of speech representation
JO  - Žurnal Sibirskogo federalʹnogo universiteta. Matematika i fizika
PY  - 2016
SP  - 518
EP  - 523
VL  - 9
IS  - 4
PB  - mathdoc
UR  - http://geodesic.mathdoc.fr/item/JSFU_2016_9_4_a15/
LA  - en
ID  - JSFU_2016_9_4_a15
ER  - 
%0 Journal Article
%A Maxim Sidorov
%A Wolfgang Minker
%A Eugene S. Semenkin
%T Speech-based emotion recognition and speaker identification: static vs. dynamic mode of speech representation
%J Žurnal Sibirskogo federalʹnogo universiteta. Matematika i fizika
%D 2016
%P 518-523
%V 9
%N 4
%I mathdoc
%U http://geodesic.mathdoc.fr/item/JSFU_2016_9_4_a15/
%G en
%F JSFU_2016_9_4_a15
Maxim Sidorov; Wolfgang Minker; Eugene S. Semenkin. Speech-based emotion recognition and speaker identification: static vs. dynamic mode of speech representation. Žurnal Sibirskogo federalʹnogo universiteta. Matematika i fizika, Tome 9 (2016) no. 4, pp. 518-523. http://geodesic.mathdoc.fr/item/JSFU_2016_9_4_a15/

[1] P. Boersma, D. Weenink, Praat: doing phonetics by computer, [Computer program]. Version 5.3.50, retrieved 21 May 2013 http://www.praat.org/

[2] F. Burkhardt et al., “A database of German emotional speech”, Proceedings of the Interspeech Conference (2005), 1517–1520

[3] M. Eskenazi, A. Black, A. Raux, B. Langner, “Let's Go Lab: a platform for evaluation of spoken dialog systems with real world use”, Proceedings of Interspeech Conference (Brisbane, Australia, 2008)

[4] M. Grimm, K. Kroschel, S. Narayanan, “The Vera am Mittag German audio-visual emotional speech database”, Multimedia and Expo, IEEE International Conference (2008), 865–868

[5] M. Hall et al., “The WEKA Data Mining Software: An Update”, SIGKDD Explorations, 11:1 (2009), 10–18 | DOI

[6] I. Mierswa et al., “YALE: Rapid Prototyping for Complex Data Mining Tasks”, Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-06) (2006)

[7] H. Mori et al., “Constructing a spoken dialogue corpus for studying paralinguistic information in expressive conversation and analyzing its statistical/acoustic characteristics”, Speech Communication, 53 (2011), 36–50 | DOI

[8] A. Schmitt, B. Schatz, W. Minker, “Modeling and predicting quality in spoken human-computer interaction”, Proceedings of the SIGDIAL 2011 Conference (2011), 173–184

[9] A. Schmitt, T. Heinroth, J. Liscombe, On NoMatchs, NoInputs and BargeIns: Do Non-Acoustic Features Support Anger Detection?, Proceedings of the SIGDIAL 2009 Conference (Association for Computational Linguistics, London, UK, 2009), 128–131

[10] B. Schuller et al., “Acoustic emotion recognition: A benchmark comparison of performances”, Automatic Speech Recognition and Understanding, IEEE Workshop (2009), 552–557

[11] M. Sidorov, A. Schmitt, S. Zablotskiy, W. Minker, “Survey of Automated Speaker Identification Methods”, Proceedings of Intellegent Environment 2013 (Athens, Greece, 2013)

[12] Y. Wang, I. Witten, “Modeling for optimal probability prediction”, Proceedings of the Nineteenth International Conference in Machine Learning (Sydney, Australia, 2002), 650–657 | Zbl