Automatic Speaker Recognition based on Gabor Features and Convolutional Neural Networks

Document Type : Power Article

Authors

1 Department of Computer Engineering, Engineering Faculty, Lorestan University, Khorramabad, Iran

2 Department of Electrical Engineering, Faculty of Engineering, Yasouj University, Yasouj, Iran

3 Department of Electrical Engineering, Engineering Faculty, Lorestan University, Khorramabad, Iran

Abstract

Human voice contains characteristics such as: ethnicity, gender, feelings, age and other information, and speaker recognition identifies people based on their voice. Although researchers have worked in this area over the years and provide methods to improve the speaker recognition accuracy, there are still challenges. In this paper, a new speaker recognition method is proposed based on Gabor filter bank and convolutional neural networks. At first, spectrogram of the speech signal is formed and then, effective Gabor filter bank is designed so that these filters are suitable for extracting effective features of the speech signal. In the next step, spectrogram of the signal is passed through the Gabor filter bank to extract the speech signal features. Finally, speaker recognition is done using a convolutional neural network. Two datasets Aurora2 and TIMIT are used to evaluate the proposed method. Results show that the accuracy of the proposed method is competitive with the state-of-the-art methods.

Keywords

Main Subjects


[1] Dunn, J. S., and Podio, F. Biometrics Consortium website, http:// www.biometrics.org.
[2] Campbell, Joseph P. “Speaker recognition: A tutorial”, Proceedings of the IEEE, Vol. 85, No. 9, 1997, pp. 1437–1462.
[3] Müller, C., Speaker Classification I: Fundamentals, Features, and Methods, Springer-Verlag Berlin Heidelberg, 2007.
[4] Müller, C., Speaker Classification II, Springer-Verlag Berlin Heidelberg, 2007.
[5] Keshet, J., and Bengio, S. (Eds.), Automatic speech and speaker recognition: large margin and kernel methods, John Wiley & Sons, 2009.
[6] Ohi, A. Q., Mridha, M. F., Hamid, M. A., and Monowar, M. M., “Deep speaker recognition: process, progress, and challenges”, IEEE Access, Vol. 9, 2021, pp. 89619-89643.
[7] Hanifa, R. M., Isa, K., and Mohamad, S., “A review on speaker recognition: Technology and challenges”, Computers & Electrical Engineering, Vol. 90, 2021, pp. 107005.
[8] Lan, J., Zhang, R., Yan, Z., Wang, J., Chen, Y., and Hou, R., “Adversarial attacks and defenses in Speaker Recognition Systems: A survey”, Journal of Systems Architecture, Vol. 127, 2022, pp. 102526.
[9] Schädler, M. R., Meyer, B. T., and Kollmeier, B., “Spectro-temporal modulation subspace-spanning filter bank features for robust automatic speech recognition”, The Journal of the Acoustical Society of America, Vol. 131, No. 5, 2012, pp. 4134-4151.
[10] Mesgarani, N., Slaney, M., and Shamma, S. A., “Discrimination of speech from nonspeech based on multiscale spectro-temporal modulations”, IEEE Transactions on Audio, Speech, and Language Processing, Vol. 14, No. 3, 2006, pp. 920-930.
[11] Sahidullah, M., and Saha, G., “A novel windowing technique for efficient computation of MFCC for speaker recognition”, IEEE signal processing letters, Vol. 20, No. 2, 2012, pp. 149-152.
[12] Qi, M., Yu, Y., Tang, Y., Deng, Q., Mai, F., and Zhaxi, N., “Deep CNN with se block for speaker recognition”, In 2020 Information Communication Technologies Conference (ICTC), IEEE, 2020, pp. 240-244.
[13] Ghalamiosgouei, S., and Geravanchizadeh, M., “Robust Speaker Identification Based on Binaural Masks”, Speech Communication, Vol. 132, 2021, pp. 1-9.
[14] Chakroun, R., and Frikha, M., “Robust text-independent speaker recognition with short utterances using Gaussian mixture models”, In 2020 International Wireless Communications and Mobile Computing (IWCMC), IEEE, 2020, pp. 2204-2209.
[15] Moumin, A. A., and Kumar, S. S., “Automatic Speaker Recognition using Deep Neural Network Classifiers”, In 2021 2nd International Conference on Computation, Automation and Knowledge Management (ICCAKM), IEEE, 2021, pp. 282-286.
[16] Lin, T., and Zhang, Y., “Speaker recognition based on long-term acoustic features with analysis sparse representation”, IEEE Access, Vol. 7, 2019, pp. 87439-87447.
[17] Jiahong, L., Jie, B., Yingshuang, C., and Chun, L., “An Adaptive ResNet Based Speaker Recognition in Radio Communication”, In 2021 IEEE International Conference on Emergency Science and Information Technology (ICESIT), 2021, pp. 161-164.
[18] Prachi, N. N., Nahiyan, F. M., Habibullah, M., and Khan, R., “Deep Learning Based Speaker Recognition System with CNN and LSTM Techniques”, In 2022 Interdisciplinary Research in Technology and Management (IRTM), 2022, pp. 1-6.
[19] Wang, Y., Wan, S., Zhang, S., and Yu, J., “Speaker recognition of fiber-optic external Fabry-Perot interferometric microphone based on Deep Learning”, IEEE Sensors Journal, Vol. 22, No. 13, 2022, pp. 12906-12912.
[20] Balpande, M., Sansare, R., Padelkar, T., and Shinde, V., “Speaker Recognition based on Mel-Frequency Cepstral Coefficients and Vector Quantization”, In 2021 IEEE Bombay Section Signature Conference (IBSSC), 2021, pp. 1-6.
[21] Roy, M. K., and Keshwala, U., “Res2Net based Text Independent Speaker recognition system”, In 2022 12th International Conference on Cloud Computing, Data Science & Engineering (Confluence), 2022, pp. 612-616.
[22] Wang, R., Ao, J., Zhou, L., Liu, S., Wei, Z., Ko, T., ... and Zhang, Y., “Multi-View Self-Attention Based Transformer for Speaker Recognition”, In 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 6732-6736.
[23] Orken, M., Dina, O., Keylan, A., Tolganay, T., and Mohamed, O., “A study of transformer-based end-to-end speech recognition system for Kazakh language”, Scientific Reports, Vol. 12, No. 1, 2022, pp. 1-11.
[24] Faúndez-Zanuy, M., “Speaker recognition by means of a combination of linear and nonlinear predictive models”, arXiv preprint arXiv:2203.03190.
[25] Hu, H. R., Song, Y., Liu, Y., Dai, L. R., McLoughlin, I., and Liu, L., “Domain Robust Deep Embedding Learning for Speaker Recognition”, In 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 7182-7186.
[26] Chowdhury, A., Cozzo, A., and Ross, A., “Domain Adaptation for Speaker Recognition in Singing and Spoken Voice”, In 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 7192-7196.
[27] Bahmaninezhad, F., Zhang, C., and Hansen, J. H., “An investigation of domain adaptation in speaker embedding space for speaker recognition”, Speech Communication, Vol. 129, 2021, pp. 7-16.
[28] Bharath, K. P., and Kumar, R., “Multitaper based MFCC feature extraction for robust speaker recognition system”, In 2019 Innovations in Power and Advanced Computing Technologies (i-PACT), IEEE, Vol. 1, 2019, pp. 1-5.
[29] Nunes, J. A. C., Macêdo, D., and Zanchettin, C., “Am-mobilenet1d: A portable model for speaker recognition”, In 2020 International Joint Conference on Neural Networks (IJCNN), IEEE, 2020, pp. 1-8.
[30] Nunes, J. A. C., Macêdo, D., and Zanchettin, C., “Additive margin sincnet for speaker recognition”, In 2019 International Joint Conference on Neural Networks (IJCNN), IEEE, 2019, pp. 1-5.
[31] Liu, Z., Wu, Z., Li, T., Li, J., and Shen, C., “GMM and CNN hybrid method for short utterance speaker recognition”, IEEE Transactions on Industrial informatics, Vol. 14, No. 7, 2018, pp. 3244-3252.
[32] Dai, M., Dai, G., Wu, Y., Xia, Y., Shen, F., and Zhang, H., “An Improved Feature Fusion for Speaker Recognition”, In 2019 IEEE Fourth International Conference on Data Science in Cyberspace (DSC), IEEE, 2019, pp. 183-187.
[33] Avila, A. R., O’Shaughnessy, D., and Falk, T. H., “Automatic speaker verification from affective speech using Gaussian mixture model based estimation of neutral speech characteristics”, Speech Communication, Vol. 132, 2021, pp. 21-31.
[34] Rashno, E., Akbari, A., and Nasersharif, B., “A convolutional neural network model based on neutrosophy for noisy speech recognition”, In 2019 4th International Conference on Pattern Recognition and Image Analysis (IPRIA), IEEE, 2019, pp. 87-92.
[35] Bian, T., Chen, F., and Xu, L., “Self-attention based speaker recognition using Cluster-Range Loss”, Neurocomputing, 2019, pp. 368, 59-68.
[36] Devi, K. J., Singh, N. H., and Thongam, K., “Automatic speaker recognition from speech signals using self organizing feature map and hybrid neural network”, Microprocessors and Microsystems, Vol. 79, 2020, pp.103264.
[37] Chien, J. T., and Peng, K. T., “Neural adversarial learning for speaker recognition”, Computer Speech & Language, Vol. 58, 2019, pp. 422-440.
[38] Han, J. H., Bae, K. M., Hong, S. K., Park, H., Kwak, J. H., Wang, H. S., ... and Lee, K. J., “Machine learning-based self-powered acoustic sensor for speaker recognition”, Nano Energy, Vol. 53, 2018, pp. 658-665.
[39] Zhang, X., Zou, X., Sun, M., Zheng, T. F., Jia, C., and Wang, Y., “Noise robust speaker recognition based on adaptive frame weighting in GMM for i-vector extraction”, IEEE Access, Vol. 7, 2019, pp. 27874-27882.
[40] Chowdhury, A., and Ross, A., “Fusing MFCC and LPC features using 1D triplet CNN for speaker recognition in severely degraded audio signals”, IEEE transactions on information forensics and security, Vol. 15, 2019, pp. 1616-1629.
[41] Xu, J., Li, S., Jiang, J., and Dou, Y., “A simplified speaker recognition system based on FPGA platform”, IEEE Access, Vol. 8, 2019, pp. 1507-1516.
[42] Mesgarani, N., David, S. V., Fritz, J. B., and Shamma, S. A., “Phoneme representation and classification in primary auditory cortex”, The Journal of the Acoustical Society of America, Vol. 123, No. 2, 2008, pp. 899-909.
[43] Ezzat, T., Bouvrie, J. V., and Poggio, T. A., “Spectro-temporal analysis of speech using 2-d Gabor filters”, In Interspeech, 2007, pp. 506-509.
]44[ سیاوش حسینی، سعید ستایشی، غلامحسین روشنی، عبدالحمید زاهدی و فرزین شماع، "افزایش کارآیی جریان سنج دوفازی با استفاده از روش های استخراج ویژگی حوزه ی فرکانس و شبکه عصبی در طیف خروجی آشکارساز"، مدل­سازی در مهندسی، دوره 19، شماره 67، زمستان 1400، صفحه 47-57.
]45[ میثم عفتی، رحمت مدندوست، و زینب فلاح زرجو بازکیایی، "ارزیابی عملکرد مدل های شبکه عصبی مصنوعی، نروفازی و رگرسیون چند متغیره در پیش بینی مقاومت فشاری بتن به کمک روش بارنقطه ای"، مدل­سازی در مهندسی، دوره 18، شماره 62، پاییز 1399، صفحه 99-113.
]46[ محمدجسین ولایتی، "ارزیابی قابلیت ضریب مشارکت ژنراتورها به منظور تعیین نوع نوسانات سیگنال کوچک سیستم قدرت با استفاده از روش‌های تحلیلی و پیش‌بینی همزمان آن‌ها با استفاده از شبکه عصبی"، مدل­سازی در مهندسی، دوره 13، شماره 42، پاییز 1394، صفحه 119-133.
[47] TIMIT dataset, available online on: https://catalog.ldc.upenn.edu/LDC93S1. Last accessed at 14 September 2021.
[48] The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions.
[49] Naing, H. M. S., Hidayat, R., Hartanto, R., and Miyanaga, “Discrete wavelet denoising into MFCC for noise suppressive in automatic speech recognition system”, International Journal of Intelligent Engineering and Systems, Vol. 13, No. 2, 2020, pp. 74-82.
[50] NOISEX-92 noise dataset, available online on: http://spib.linse.ufsc.br/noise.html. Last accessed at 14 September 2021.