Modern problems of automatic speech recognition
News of the Kabardin-Balkar scientific center of RAS, no. 6 (2020), pp. 20-33.

Voir la notice de l'article provenant de la source Math-Net.Ru

This paper provides a concise review of the most applied methods in speech recognition. Various principles of transcription developed in the Linguistic Data Consortium are discussed. The problems in evaluating the human level of efficiency in solving the problem of speech recognition are described. The typical errors made by a human are analyzed. It has been shown that transcribers demonstrate a high level of consistency with accurate transcription of pre-prepared English speech and fast transcription of conversational telephone speech. It is also shown that with increasing complexity of speech, the word disagreement rate increases. The results of a comparative analysis of errors generated by the speech system and those made by humans are presented. Their similarities and differences are analyzed. The modern automatic speech recognition problems are listed, the prospects for their solution and the directions of future research are estimated.
Keywords: deep learning, artificial intelligence, artificial neuron networks, speech recognition, human parity.
@article{IZKAB_2020_6_a2,
     author = {I. A. Gurtueva},
     title = {Modern problems of automatic speech recognition},
     journal = {News of the Kabardin-Balkar scientific center of RAS},
     pages = {20--33},
     publisher = {mathdoc},
     number = {6},
     year = {2020},
     language = {ru},
     url = {http://geodesic.mathdoc.fr/item/IZKAB_2020_6_a2/}
}
TY  - JOUR
AU  - I. A. Gurtueva
TI  - Modern problems of automatic speech recognition
JO  - News of the Kabardin-Balkar scientific center of RAS
PY  - 2020
SP  - 20
EP  - 33
IS  - 6
PB  - mathdoc
UR  - http://geodesic.mathdoc.fr/item/IZKAB_2020_6_a2/
LA  - ru
ID  - IZKAB_2020_6_a2
ER  - 
%0 Journal Article
%A I. A. Gurtueva
%T Modern problems of automatic speech recognition
%J News of the Kabardin-Balkar scientific center of RAS
%D 2020
%P 20-33
%N 6
%I mathdoc
%U http://geodesic.mathdoc.fr/item/IZKAB_2020_6_a2/
%G ru
%F IZKAB_2020_6_a2
I. A. Gurtueva. Modern problems of automatic speech recognition. News of the Kabardin-Balkar scientific center of RAS, no. 6 (2020), pp. 20-33. http://geodesic.mathdoc.fr/item/IZKAB_2020_6_a2/

[1] M. Campbell, A. J. Hoane, F. h. Hsu, “Deep Blue”, Artificial intelligence, 134 (2002), 57–83 | DOI | Zbl

[2] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot et al., “Mastering the game of Go with deep neural networks and tree search”, Nature, 529 (2016), 484–489 | DOI

[3] D. Amodei, R. Anubhai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, J. Hen, M. Chrzanowski, A. Coates, G. Diamos et al., Deep Speech 2: End-to-end speech recognition in English and Mandarin, 2015, arXiv: 1512.02595

[4] T. T. Kristjansson, J. R. Hershey, P. A. Olsen, S. J. Rennie, R. A. Gopinath, “Super-human multi-talker speech recognition: the IBM 2006 Speech Separation Challenge system”, Proc. Inter-speech, 12 (2006), 155

[5] C. Weng, D. Yu, M. L. Seltzer, J. Droppo, “Single-channel mixed speech recognition using deep neural networks”, Proc. IEEE ICASSP, 2014, 5632–5636

[6] D. S. Pallett, “A look at NIST's benchmark ASR tests: past, present and future”, IEEE Automatic Speech Recognition and Understanding Workshop, 2003, 483–488 | DOI

[7] P. Price, W. M. Fisher, J. Bernstein, D. S. Pallett, “The DARPA 1000-word resource management database for continuous speech recognition”, Proc. IEEE ICASSP, 1988, 651–654 | MR

[8] D. B. Paul, J. M. Baker, “The design for the wall street journal-based csr corpus”, Proceedings of the workshop on Speech and Natural Language, 1992, 357–362 | DOI

[9] D. Graff, Z. Wu, R. MacIntyre, M. Liberman, “The 1996 broadcast news speech and language-model corpus”, Proceedings of the DARPA Workshop on Spoken Language technology, 1997, 11–14 | MR

[10] A. Ljolje, “The AT 2001 LVCSR system”, NIST LVCSR Workshop, 2001

[11] D. Philipov, Interactive Voice Text Editing Using New Speech Technologies from Yandex, 2014 https://habr.com/ru/company/yandex/blog/243813/

[12] S. F. Chen, B. Kingsbury, L. Mangu, D. Povey, G. Saon, H. Soltau, G. Zweig, “Advances in speech transcription at IBM under the DARPA EARS program”, IEEE Trans. Audio, Speech, and Language Processing, 14 (2006), 1596–1608 | DOI

[13] F. Seide, G. Li, D. Yu, “Conversational speech transcription using context-dependent deep neural networks”, Proc. Interspeech, 2011, 437–440

[14] S. Matsoukas, J. L. Gauvain, G. Adda, T. Colthurst, C. L. Kao, O. Kimball, L. Lamel, F. Lefevre, J. Z. Ma, J. Makhoul et al., “Advances in transcription of broadcast news and conver-sational telephone speech within the combined ears bbn/limsi system”, IEEE Transactions on Audio, Speech, and Language Processing, 14 (2006), 1541–1556 | DOI

[15] A. Stolcke, B. Chen, H. Franco, V. R. R. Gadde, M. Graciarena, M. Y. Hwang, K. Kirchhoff, A. Mandal, N. Morgan, X. Lei et al., “Recent innovations in speech-to-text transcription at SRI-ICSI-UW”, IEEE Transactions on Audio, Speech, and Language Processing, 14 (2006), 1729–1744 | DOI

[16] J. L. Gauvain, L. Lamel, H. Schwenk, G. Adda, L. Chen, F. Lefevre, “Conversational telephone speech recognition”, Proc. IEEE ICASSP, 1 (2003), 1–212

[17] G. Evermann, H. Y. Chan, M. J. F. Gales, T. Hain, X. Liu, D. Mrva, L. Wang, P. C. Woodland, “Development of the 2003 CU-HTK conversational telephone speech transcription system”, Proc. IEEE ICASSP, v. 1, 2004, 1–249 pp.

[18] D. B. Fry, “Theoretical aspects of mechanical speech recognition”, J. British Inst. Radio Engr., 1959, 211–229

[19] T. K. Vintsyuk, “Speech discrimination by dynamic programming”, Kibernetika, 4 (1968), 81–88 | DOI | MR

[20] A. J. Viterbi, “Error bounds for convolutional codes and an asymptotically optimal decoding algorithm”, IEEE Trans. Information Theory, IT - 13 (1967), 260–269 | DOI | Zbl

[21] D. R. Reddy, An approach to computer speech recognition by direct analysis of the speech wave, Tech. Report No. C549, Stanford Univ., Computer Science Dept., 1966. | MR

[22] V. M. Velichko, N. G. Zagoruyko, “Automatic recognition of 200 words”, Int. J. Man- Machine Studies., 1970., no. 2. | Zbl

[23] H. Sakoe, S. Chiba, “Dynamic programming algorithm optimization for spoken word recognition”, IEEE Trans. Acoustics, Speech, Signal Proc, ASSP, 26:1 (1978), 43-49. | MR | Zbl

[24] L. R. Rabiner et. al., “Speaker independent recognition of isolated words using clustering techniques”, IEEE Trans. Acoustics, Speech, Signal Proc., ASSP, 27 (1979), 336-349. | MR | Zbl

[25] D. Klatt, “Review of the ARPA speech understanding project”, J.A.S.A., 62:6 (1977), 1324-1366.

[26] B. Lowerre, “The HARPY speech understanding system”, Trends in Speech Recognition, W. Lea, Speech Science Pub., 1990, 576-586 | DOI | MR

[27] L. R. Rabiner, B. H. Juang, Fundamentals of Speech Recognition, Prentice-Hall, Englewood Cliff., New Jersey, 1993.

[28] S. Katagiri, “Speech pattern recognition using neural networks”, Pattern Recognition in Speech and Language Processing, eds. W. Chou, B.-H. Juang, CRC Press, 2003., 115-147

[29] C. S. Myers, L. R. Rabiner, “A level building dynamic time warping algorithm for connected word recognition”, IEEE Trans. Acoustics, Speech, Signal Proc., ASSP, 29 (1981), 284-297 | DOI | Zbl

[30] C. H. Lee, L. R. Rabiner, “A frame synchronous network search algorithm for connected word recognition”, IEEE Trans. Acoustics, Speech, Signal Proc., 37:11 (1989.), 1649-1658 | DOI

[31] J. S. Bridle , M. D. Brown, “Connected word recognition using whole word templates”, Proc. Inst. Acoust. Autumn Conf., 1979, 25-28

[32] B.-H. Juang , S. Furui, “Automatic speech recognition and understanding: A first step to-ward natural human-machine communication”, Proc. IEEE, 88:8 (2000), 1142-1165 | DOI | MR

[33] W. Chou, “Mimimum classification error (MCE) approach in pattern recognition”, Pattern Recognition in Speech and Language Processing, Chou W., Juang B.-H., CRC Press., 2003, 1-49

[34] C. J. Leggetter, P. C. Woodland, “Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models”, Computer Speech and Language, 1995, no. 9, 171-185 | DOI

[35] A. P. Varga , R. K. Moore, “Hidden Markov model decomposition of speech and noise”, Proc. ICASSP, 1990, 845-848

[36] M. J. F. Gales , S. J. Young Parallel model combination for speech recognition in noise, Technical Report, CUED/FINFENG/ TR135, 1993

[37] K. Shinoda, C. H. Lee, “A structural Bayes approach to speaker adaptation”, IEEE Trans. Speech and Audio Proc., 9:3 (2001), 276-287 | DOI

[38] A. Stolcke, J. Droppo, “Comparing Human and Machine Errors in Conversational Speech Transcription”, Interspeech., 2017, 137-141

[39] G. Saon, G. Kurata, T. Sercu, K. Audhkhasi, S. Thomas, D. Dimitriadis, X. Cui, B. Ramabhadran, M. Picheny, L.-L. Lim, B. Roomi, P. Hall, “English Conversational Telephone Speech Recognition by Humans and Machines”, INTERSPEECH, 2017 | Zbl

[40] R. R. Lippmann, “Speech recognition by machines and humans”, Speech Communication, 22:1 (1997), 1-15 | DOI

[41] M. L. Glenn, S. M. Strassel , H. Lee, K. Maeda, R. Zakhary, X. Li, “Transcription Methods for Consistency, Volume and Efficiency”, Proceedings of the International Conference on Language Resources and Evaluation, LREC. (Malta, 2010)

[42] A. Hannun, Speech Recognition Is Not Solved, 2017. https://awni.github.io/speech-recognition

[43] C. Han, J. O'Sullivan, Y. Luo, J. Herrero, A. D. Mehta, N. Mesgarani, “Speaker-independent auditory attention decoding without access to clean speech sources”, Sci Adv., 5:5 (2019), eaav6134. (PMID: 31106271; PMCID: PMC652)