Neural network model for multimodal recognition of human aggression
Vestnik KRAUNC. Fiziko-matematičeskie nauki, Tome 33 (2020) no. 4, pp. 132-149 Cet article a éte moissonné depuis la source Math-Net.Ru

Voir la notice de l'article

Growing user base of socio-cyberphysical systems, smart environments, IoT (Internet of Things) systems actualizes the problem of revealing of destructive user actions, such as various acts of aggression. Thereby destructive user actions can be represented in different modalities: locomotion, facial expression, associated with it, non-verbal speech behavior, verbal speech behavior. This paper considers a neural network model of multi-modal recognition of human aggression, based on the establishment of an intermediate feature space, invariant to the actual modality, being processed. The proposed model ensures high-fidelity aggression recognition in the cases when data on certain modality are scarce or lacking. Experimental research showed 81.8
Keywords: aggression recognition, behavior analysis, neural networks, multimodal data processing.
@article{VKAM_2020_33_4_a11,
     author = {M. Yu. Uzdyaev},
     title = {Neural network model for multimodal recognition of human aggression},
     journal = {Vestnik KRAUNC. Fiziko-matemati\v{c}eskie nauki},
     pages = {132--149},
     year = {2020},
     volume = {33},
     number = {4},
     language = {ru},
     url = {http://geodesic.mathdoc.fr/item/VKAM_2020_33_4_a11/}
}
TY  - JOUR
AU  - M. Yu. Uzdyaev
TI  - Neural network model for multimodal recognition of human aggression
JO  - Vestnik KRAUNC. Fiziko-matematičeskie nauki
PY  - 2020
SP  - 132
EP  - 149
VL  - 33
IS  - 4
UR  - http://geodesic.mathdoc.fr/item/VKAM_2020_33_4_a11/
LA  - ru
ID  - VKAM_2020_33_4_a11
ER  - 
%0 Journal Article
%A M. Yu. Uzdyaev
%T Neural network model for multimodal recognition of human aggression
%J Vestnik KRAUNC. Fiziko-matematičeskie nauki
%D 2020
%P 132-149
%V 33
%N 4
%U http://geodesic.mathdoc.fr/item/VKAM_2020_33_4_a11/
%G ru
%F VKAM_2020_33_4_a11
M. Yu. Uzdyaev. Neural network model for multimodal recognition of human aggression. Vestnik KRAUNC. Fiziko-matematičeskie nauki, Tome 33 (2020) no. 4, pp. 132-149. http://geodesic.mathdoc.fr/item/VKAM_2020_33_4_a11/

[1] Berkowitz L., Aggression: Its causes, consequences, and control, Mcgraw-Hill Book Company, 1993, 158 pp.

[2] Bandura A., Aggression: A social learning analysis., prentice-hall, 1973

[3] Enikolopov S. N., “Ponyatie agressii v sovremennoy psikhologii”, Prikladnaya psikhologiya, 2001, no. 1, 60–72 (in Russian)

[4] Buss A. H., The psychology of aggression, Wiley, 1961

[5] El Ayadi M., Kamel M. S., Karray F., “Survey on speech emotion recognition: Features, classification schemes, and databases”, Pattern Recognition, 44:3 (2011), 572–587 | DOI

[6] Trigeorgis G. et al., “Adieu features end-to-end speech emotion recognition using a deep convolutional recurrent network”, 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE, 2016, 5200–5204 | DOI

[7] De Souza F. D. M. et al., “Violence detection in video using spatio-temporal features”, Graphics, Patterns and Images (SIBGRAPI), 23rd SIBGRAPI Conference, IEEE, 2010, 224–230

[8] Lefter I., Rothkrantzac L.J.M., Burghoutsb G.J., “A comparative study on automatic audio–visual fusion for aggression detection using meta-information”, Pattern Recognition Letters, 2010, 1953–1963

[9] Lefter I. et al., “Addressing multimodality in overt aggression detection”, International Conference on Text, Speech and Dialogue, Springer, Berlin, Heidelberg, 2010, 25–32

[10] Zajdel W. et al., “CASSANDRA: audio-video sensor fusion for aggression detection”, 2007 IEEE conference on advanced video and signal based surveillance, IEEE, 2007, 200–205 | DOI

[11] Kooij J. F. P. et al., “Multi-modal human aggression detection”, Computer Vision and Image Understanding, 144 (2016), 106–120 | DOI

[12] Qiu Q. et al., “Multimodal information fusion for automated recognition of complex agitation behaviors of dementia patients”, 10th International Conference on Information Fusion, IEEE, 2007, 1–8

[13] Giannakopoulos T. et al., “Audio-visual fusion for detecting violent scenes in videos”, Hellenic conference on artificial intelligence, Springer, Berlin, Heidelberg, 2010, 91-100

[14] Giannakopoulos T. et al., “An extended set of haar-like features for rapid object detection”, Proceedings. international conference on image processing, v. 1, IEEE, 2002

[15] Yang Z., Multi-modal aggression detection in trains, 2009

[16] Lefter I., Burghouts G. J., Rothkrantz L. J. M., “Learning the fusion of audio and video aggression assessment by meta-information from human annotations”, 2012 15th International Conference on Information Fusion, IEEE, 2012, 1527–1533

[17] Lefter I., Multimodal Surveillance: Behavior analysis for recognizing stress and aggression., 2014

[18] Lefter I., et al., “NAA: A multimodal database of negative affect and aggression”, 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII), IEEE, 2017, 21–27 | DOI

[19] Lefter I., Rothkrantz L. J. M., “Multimodal cross-context recognition of negative interactions”, 2017 Seventh International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW), IEEE, 2017, 56–61

[20] Patwardhan A., Knapp G., Aggressive actions and anger detection from multiple modalities using Kinect, ArXiv, 2016

[21] Levonevskii D. et al., “Methods for Determination of Psychophysiological Condition of User Within Smart Environment Based on Complex Analysis of Heterogeneous Data”, Proceedings of 14th International Conference on Electromechanics and Robotics “Zavalishin's Readings”, Springer, Singapore, 2020, 511–523 | DOI

[22] Uzdiaev M. et al., “Metody detektirovaniya agressivnykh pol'zovateley informatsionnogo prostranstva na osnove generativno-sostyazatel'nykh neyronnykh setey”, Informatsionno-izmeritel'nye i upravlyayushchie sistem, 17:5 (2019), 60–68 (in Russian)

[23] Uzdiaev M., “Methods of Multimodal Data Fusion and Forming Latent Representation in the Human Aggression Recognition Task”, 2020 IEEE 10th International Conference on Intelligent Systems (IS), IEEE, 2020, 399–403 | DOI

[24] Zhang K., et al., “Joint face detection and alignment using multitask cascaded convolutional networks”, IEEE Signal Processing Letters, 23:10 (2016), 1499–1503 | DOI

[25] Zhang X., et al., “Shufflenet: An extremely efficient convolutional neural network for mobile devices”, Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, 6848–6856

[26] Mollahosseini A., Hasani B., Mahoor M. H., “Affectnet: A database for facial expression, valence, and arousal computing in the wild”, IEEE Transactions on Affective Computing, 10:1 (2017), 18–31 | DOI

[27] Ioffe S., Szegedy C., “Batch normalization: Accelerating deep network training by reducing internal covariate shift”, 2015, arXiv: 1502.03167

[28] Nair V., Hinton G. E., “Rectified linear units improve restricted Boltzmann machines”, Proceedings of the 27th international conference on machine learning (ICML-10), 2010, 807–814

[29] Hara K., Kataoka H., Satoh Y., “Learning spatio-temporal features with 3D residual networks for action recognition”, Proceedings of the IEEE International Conference on Computer Vision, 2017, 3154–3160

[30] Hara K., Kataoka H., Satoh Y., “Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?”, Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2018, 6546–6555

[31] He K., et al., “Deep residual learning for image recognition”, Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, 770–778

[32] Kay W., et al., “The kinetics human action video dataset”, 2017, arXiv: 1705.06950

[33] Simonyan K., Zisserman A., “Very deep convolutional networks for large-scale image recognition”, 2014, arXiv: 1409.1556

[34] Hochreiter S., Schmidhuber J., “Long short-term memory”, Neural computation, 9:8 (1997), 1735–1780 | DOI

[35] Schuster M., Paliwal K. K., “Bidirectional recurrent neural networks”, IEEE transactions on Signal Processing, 45:11 (1997), 2673–2681 | DOI

[36] Srivastava N., et al., “Dropout: a simple way to prevent neural networks from overfitting”, The journal of machine learning research, 15:1 (2014), 1929–1958 | MR

[37] Busso C., et al., “IEMOCAP: Interactive emotional dyadic motion capture database”, Language resources and evaluation, 42:4 (2008), 335 | DOI

[38] https://pytorch.org/

[39] Chicco D., “Siamese neural networks: An overview”, Artificial Neural Networks, 2020, 73–94

[40] Goodfellow I., et al., “Generative adversarial nets”, Advances in neural information processing systems, 2014, 2672–2680