تشخیص ارقام گفتاری فارسی با استفاده از شبکه های یادگیری عمیق

نوع مقاله : مقاله کامپیوتر

نویسندگان

1 دانشگاه سمنان، دانشکده برق و کامپبوتر

2 دانشکده مهندسی برق و کامپیوتر دانشگاه سمنان

3 دانشکده برق و کامپیوتر دانشگاه سمنان

چکیده

طبقه‌بندی ارقام جدا شده چالش اساسی برای بسیاری از سیستم‌های طبقه‌بندی گفتار است. درحالی‌که کارهای زیادی بر روی زبان‌های گفتاری انجام شده است، تحقیقات محدودی در مورد داده‌های رقمی گفتاری فارسی در ادبیات گزارش شده است و تمامی تحقیقات مربوط به اعداد صفر تا 9 بوده است. برای این منظور، پایگاه داده ی جامعی شامل بازه ی وسیعتری از اعداد با مشارکت 145 نفر که شامل هفتاد نفر مرد و 75 نفر زن هستند، جمع‌آوری گردیده است. پایگاه داده مذکور، بازه عددی صفر تا 599 را پوشش می‌دهد. پس از پیش‌پردازش داده ها، داده‌های صوتی تبدیل به طیف‌نگار مل شده و برای استخراج ویژگی و طبقه‌بندی داده‌ها از شبکه عصبی کانولوشنی و نیز یک مدل ترکیبی شامل مدل ترنسفورمر و حافظه کوتاه و بلند مدت استفاده گردیده است. نتایج تجربی بر روی پایگاه داده جمع آوری شده حاکی از دقت اعتبارسنجی 98.03 درصد می باشد. آنالیزهای مختلفی نیز بر روی آزمایش و آزمون مدل ها صورت گرفته است.

کلیدواژه‌ها

موضوعات


عنوان مقاله [English]

Spoken Persian digits recognition using deep learning

نویسندگان [English]

  • Sahar Zarbafi 1
  • Kourosh Kiani 2
  • Razieh Rastgoo 3
1 M.Sc. Student, Faculty of Electrical and Computer Engineering, Semnan University, Semnan, Iran
2 Associate Professor, Faculty of Electrical and Computer Engineering, Semnan University, Semnan, Iran
3 Assistant Professor, Faculty of Electrical and Computer Engineering, Semnan University, Semnan, Iran
چکیده [English]

Classification of isolated digits is a fundamental challenge for many speech classification systems. Previous works on spoken digits have been limited to the numbers 0 to 9. In this paper, we propose two deep learning-based models for spoken digit recognition in the range of 0 to 599. The first model is a Convolutional Neural Network (CNN) model that uses the Mel spectrogram obtained from the audio data. The second model uses the recent advances in deep sequential models, especially the Transformer model followed by a Long Short-Term Memory (LSTM) Network and a classifier. Moreover, we also collected a dataset, including audio data by a contribution of 145 people, covering the numerical range from 0 to 599. The experimental results on the collected dataset indicate a validation accuracy of 98.03%.

کلیدواژه‌ها [English]

  • Spoken digits
  • Persian digits
  • Deep learning
  • Convolutional Neural Network (CNN)
  • Mel spectrogram
  • Transformer
[1] P. Sanderson, "Cognitive work analysis and the analysis, design, and evaluation of human-computer interactive systems," in Proceedings 1998 Australasian Computer Human Interaction Conference. OzCHI'98 (Cat. No. 98EX234). 1998.
[2] A. Gunawan, "English digits speech recognition system based on hidden Markov models," in Proceedings of International Conference Computer. 2010.
[3] R. Rastgoo and V. Sattari Naeini, "A neurofuzzy QoS-aware routing protocol for smart grids," 22nd Iranian Conference on Electrical Engineering (ICEE), 2014, pp. 1080-1084.
[4] Rastgoo, R. and Sattari Naeini, V. Tuning parameters of the QoS-aware routing protocol for smart grids using genetic algorithm. Applied Artificial Intelligence, Vol. 30, No. 1, 2016, pp. 52-76.
[5] N. Majidi, K. Kiani, and R. Rastgoo, "A deep model for super-resolution enhancement from a single image," Journal of AI and Data Mining, Vol. 8, No. 4, 2020, pp. 451-460.
[6] K. Kiani, R. Hematpour, and R. Rastgoo, "Automatic grayscale image colorization using a deep hybrid model," Journal of AI and Data Mining, Vol. 9, No. 3, 2021, pp. 321-328.
[7] R. Rastgoo, and V. Sattari-Naeini, "Gsomcr: Multi-constraint genetic-optimized qos-aware routing protocol for smart grids. Iranian Journal of Science and Technology, "Transactions of Electrical Engineering," Vol. 42, 2018, pp. 185-194.
[8] R. Rastgoo, and K. Kiani, "Face recognition using fine-tuning of Deep Convolutional Neural Network and transfer learning," Journal of Modeling in Engineering, Vol. 17, No. 58, 2019, pp. 103-111.
[9] Y. Xu, "English speech recognition and evaluation of pronunciation quality using deep learning," Mobile Information Systems, Vol. 20, No. 2, 2022, pp. 1-12.
[10] M.K. Scheuerman, J.M. Paul, and J.R. Brubaker, "How computers see gender: An evaluation of gender classification in commercial facial analysis services," in Proceedings of the ACM on Human-Computer Interaction, 2019, pp. 1-33.
[11] Li, H., et al., "A convolutional neural network cascade for face detection," in Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
[12] R. Sharmin, S.K. Rahut, and M.R. Huq, "Bengali spoken digit classification: A deep learning approach using convolutional neural network," Procedia Computer Science, Vol. 171, 2020, pp. 1381-1388.
[13] O. Sen, and P. Roy, "A convolutional neural network based approach to recognize bangla spoken digits from speech signal," in 2021 International Conference on Electronics, Communications and Information Technology (ICECIT). 2021.
[14] W. Xiong, et al., "The Microsoft 2017 conversational speech recognition system," in 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), 2018.
[15] A. Graves, N. Beringer, and J. Schmidhuber, "A comparison between spiking and differentiable recurrent neural networks on spoken digit recognition," in The 23rd IASTED International Conference on modelling, identification, and control. 2004.
[16] A. Graves, and J. Schmidhuber, "Framewise phoneme classification with bidirectional LSTM and other neural network architectures," Neural networks, Vol. 18, 2005, pp. 602-610.
[17] D.P. Kingma, and J. Ba, "Adam: A method for stochastic optimization," arXiv preprint arXiv:1412.6980, 2014.
[18] A. Dixit, A. Vidwans, and P. Sharma, "Improved MFCC and LPC algorithm for bundelkhandi isolated digit speech recognition," in 2016 international conference on electrical, electronics, and optimization techniques (ICEEOT), 2016.
[19] Z. Ali, et al., "Database development and automatic speech recognition of isolated Pashto spoken digits using MFCC and K-NN," International Journal of Speech Technology, Vol. 18, No. 2, 2015, pp. 271-275.
[20] G. Muhammad, Y.A. Alotaibi, and M.N. Huda, "Automatic speech recognition for Bangla digits," in 2009 12th International Conference on Computers and Information Technology. 2009.
[21] S.A. Sumon, et al., "Bangla short speech commands recognition using convolutional neural networks," in 2018 international conference on bangla speech and language processing (ICBSLP). 2018.
[22] S.K. Ghanty, S.H. Shaikh, and N. Chaki, "On recognition of spoken Bengali numerals," in 2010 International Conference on Computer Information Systems and Industrial Management Applications (CISIM). 2010.
[23] A. Gupta, and K. Sarkar, "Recognition of spoken bengali numerals using MLP, SVM, RF based models with PCA based feature summarization," Int. Arab J. Inf. Technol., Vol. 15, No. 2, 2018, pp. 263-269.
[24] D.S.S. Megala, "Detection And Classification Of Speech Pathology Using Deep Learning,"  International journal of scientific & technology research, Vol. 8, No. 12, 2019.
[25] Y. Gu, et al., "Speech intention classification with multimodal deep learning," in Canadian conference on artificial intelligence. 2017.
[26] O. Mamyrbayev, et al., "Voice identification using classification algorithms," Intelligent System and Computing, 2019.
[27] B. Zada, and R. Ullah, "Pashto isolated digits recognition using deep convolutional neural network," Heliyon, Vol. 6, No. 2, 2020, pp. 3372.
[28] M. Dawodi, et al., "Dari speech classification using deep convolutional neural network," in 2020 IEEE International IOT, Electronics and Mechatronics Conference (IEMTRONICS), 2020.
[29] F.M. Marcolla, R. de Santiago, and R.L. Dazzi, "Novel Lie Speech Classification by using Voice Stress," in ICAART, Vol. 2, 2020.