A New and Efficient Feature Extraction Method for Robust Speech Recognition Based on Fractional Fourier Transform and Differential Evolution Optimizer

Document Type : Power Article

Authors

1 Shahrood University of Technology

2 Faculty of Electrical and Computer Engineering, Shahrood University of Technology

Abstract

One of the main challenges in speech recognition is noise resistant feature extraction. In this paper, a new feature extraction algorithm, called Fractional and Adaptive Power Normalized Cepstral Coefficients Algorithm, has been proposed as a noise-resistant method for speech recognition. This proposed feature extraction method is based on a fractional short-term Fourier Transform. The selection of fractional conversion coefficient is important for proper analysis of multi-component signals like speech. Therefore, the proposed method obtains the optimum parameter of α for fractional Fourier Transform based on the noise class in the environment, adaptively by the Differential Evolution meta-heuristic algorithm. Moreover, TI Digit and Noisex-92 are used for evaluation of the resistance and accuracy of the recognition of the automatic speech recognition system. Simulation results show more resistance and higher recognition accuracy of the proposed feature extraction method rather than other methods in noisy and without noise environments. In the proposed ASR system, the Support Vector Machine (SVM) classifier with a nonlinear kernel has been used. Also, all the simulations are performed in MATLAB.

Keywords


[1] L. R. Rabiner and B.-H. Juang, "Fundamentals of Speech Recognition", Englewood Cliffs, NJ: Prentice Hall, 1993.
 [2] F. Jelinek, "Statistical methods for speech recognition", MIT press, 1997.
[3] A. Acero and R. M. Stern, "Environmental robustness in automatic speech recognition", IEEE International Conference on Acoustics, Speech, and Signal Processing, Albuquerque, New Mexico, Vol. 2, April 1990, pp. 849–852.
 [4] P. J. Moreno, B. Raj, and R. M. Stern ,"A vector Taylor series approach for environment-independent speech recognition", IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 2, May 1996, pp. 733–736.
 [5] P. Pujol, D. Macho, and C. Nadeu ,"On real-time mean-and-variance normalization of speech recognition features", IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 1, May 2006, pp. 773–776.
[6] R. M. Stern, B. Raj, and P. J. Moreno ,"Compensation for environmental degradation in automatic speech recognition" ESCA Workshop on Robust Speech Recognition for Unknown Communication, April 1997, pp. 33–42.
 [7] R. Singh, R. M. Stern, B. Raj and G. Davis, "Signal and feature compensation methods for robust speech recognition", Noise Reduction in Speech Applications, April 2002, pp. 221-246.
 [8] J. Droppo, and A. Acero, "Noise robust speech recognition with a switching linear dynamic model", IEEE International Conference on Acoustics, Speech, and Signal Processing, Montreal, Que., Canada, Vol. 1, May 2004, pp. 953-956.
[9] S. Molau, M. Pitz, and H. Ney ,"Histogram based normalization in the acoustic feature space", IEEE Workshop Automatic Speech Recognition Understanding, November 2001, pp. 21–24.
[10] H. Misra, S. Ikbal, H. Bourlard, and H. Hermansky ,"Spectral entropy based feature for robust ASR", IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 1, May 2004, pp. 193–196.
 [11] S. B. Davis and P. Mermelstein, "Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences", IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. 28, No. 4, August 1980, pp. 357–366.
[12] M. Sadeghi, H. Marvi, and M. Ali, "The effect of different acoustic noise on speech signal formant frequency location", International Journal of Speech Technology, Vol. 21, No. 3, September 2018, pp. 741–752.
[13] علی حریمی و خشایار یغمائی، "بهبود نرخ تشخیص احساس از روی گفتار با استفاده از محدودیت تفکیک جنسیتی گویندگان"، نشریه مدل­‌سازی در مهندسی، دوره 15، شماره 48، بهار 1396، صفحه 183- 200.
[14] سید­علی سیلمانی ایوری، محمد­رضا فدوی امیری و حسین مروی، "تولید سیگنال مصنوعی زلزله به­کمک مدلی جدید در فشرده­سازی و آموزش شبکه­های عصبی مصنوعی"، نشریه مدل‌­سازی در مهندسی، دوره 14، شماره 46، پاییز 1395، صفحه 75- 85.
[15] H. Hermansky, "Perceptual linear prediction analysis of speech", Journal of the Acoustical Society of America, Vol. 87, No. 4, April 1990, pp. 1738–1752.
[16] Y. Gong, "Speech recognition in noisy environments: A survey", Speech communication, Vol. 16, No. 3, April 1995, pp. 261-291.
[17] D. S. Kim, S. Y. Lee, and R. M. Kil, "Auditory processing of speech signals for robust speech recognition in real-world noisy environments", IEEE Transactions on speech and audio processing, Vol .7, No. 1, January 1999, pp. 55-69.
[18] B. Milner, "A comparison of front-end configurations for robust speech recognition", IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 1, May 2002, pp. I-797.
[19] S. J. Lee, H. Chung, J. G. Park, H. Y. Jung, and Y. Lee, "A commercial car navigation system using Korean large vocabulary automatic speech recognizer", Asia-Pacific Signal and Information Processing Association, Sapporo, Japan, October 2009, pp. 286-289.
[20] C. Kim, and R. M. Stern ,"Feature extraction for robust speech recognition using a power-law nonlinearity and power-bias subtraction", Tenth Annual Conference of the International Speech Communication Association, Brighton, UK, September 2009, pp. 28-31.
[21] J. Li, L. Deng, Y. Gong, and R. Haeb-Umbach, "An overview of noise-robust automatic speech recognition", IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 22, No. 4, February 2014, pp. 745-777.
[22] G. Sárosi, M. Mozsáry, P. Mihajlik, and T. Fegyó, "Comparison of feature extraction methods for speech recognition in noise-free and in traffic noise environment", 6th Conference on Speech Technology and Human-Computer Dialogue, Brasov, Romania, May 2011, pp. 18-21.
[23] C. Kim, and R. M. Stern, "Power-normalized cepstral coefficients (PNCC) for robust speech recognition", IEEE/ACM Transactions on Audio, Speech and Language Processing, Vol. 24, No. 7, July 2016, pp. 1315-1329.
[24] K. Markov, and T. Matsui, "Robust Speech Recognition Using Generalized Distillation Framework", International Speech Communication Association, San Francisco, USA, September 2016, pp. 2364-2368.
[25] Y. Qian, M. Bi, T. Tan, and K. Yu, "Very deep convolutional neural networks for noise robust speech recognition", IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 24, No. 12, August 2016, pp. 2263-2276.
[26] C. Kim and R. M. Stern, "Feature extraction for robust speech recognition based on maximizing the sharpness of the power distribution and on power flooring", IEEE International Conference on Acoustics, Speech and Signal Processing, Dallas, USA, March 2010, pp. 4574–4577.
[27] K. K. Sinha, "The Fractional Fourier Transform in Signal Processing", International Journal of Scientific and Research Publications, Vol. 3, No. 2, February 2013, pp. 1-3.
[28] U. K. Agrawal, M. Chandra, and C. Badgaiyan "Fractional Fourier transform combination with MFCC based speaker identification in clean environment", International Journal of Advanced Science, Engineering and Technology, Vol. 1, No. 1, October 2012, pp. 26-28.
[29] V. A. Narayanan, and K. M. M. Prabhu, "The fractional Fourier transform: theory, implementation and error analysis", Microprocessors and Microsystems, Vol. 27, No. 10, November 2003, pp. 511-521.
[30] P. Kumar, and S. Kansal, "Noise removal in speech signal using fractional Fourier transform", International Conference on Information, Communication, Instrumentation and Control, Indore, India, August 2017, pp. 1-4.
[31] D. J. Ma, X. Xie, and J. M. Kuang, "A novel algorithm of seeking FrFT order for speech processing", IEEE International Conference on Acoustics, Speech and Signal Processing, Prague, Czech Republic, May 2011, pp. 3832-3835.
[32] R. Saxena, and K. Singh, "Fractional Fourier transform: A novel tool for signal processing", Journal of the Indian Institute of Science, Vol. 85, No. 1, February 2005, pp. 11–26.
[33] H. Yin, C. Nadeu, and V. Hohmann, "Pitch-and formant-based order adaptation of the fractional Fourier transform and its application to speech recognition", EURASIP Journal on Audio, Speech, and Music Processing, Vol. 2009, January 2009, pp. 1-14.
[34] H. Yin, C. Nadeu, and V. Hohmann, "Using pitch and formants for order adaptation of fractional Fourier transform in speech signal processing", V Jornadas en Tecnología del Habla, 2008, pp. 71-74.
[35] M. Sadeghi and H. Marvi, "Optimal MFCC features extraction by differential evolution algorithm for speaker recognition", 3rd Iranian Conference on Intelligent Systems and Signal Processing, Shahrood, Semnan, Iran, December 2017, pp. 169–173.
[36] عمید خطیبی بردسیری، سید­محسن هاشمی و محمد­رضا رزازی، "ارائه یک­مدل جدید جهت تخمین تالش الزم برای توسعه سرویس­های نرم­افزاری"، نشریه مدل‌­سازی در مهندسی، دوره 15، شماره 49، تابستان 1396، صفحه 245- 261.
[37] B. C. Moore and B. R. Glasberg, "A revision of Zwicker's loudness model", Acta Acustica united with Acustica, Vol. 82, No. 2, March 1996, pp. 335-345.
[38] C. Kim, "Signal processing for robust speech recognition motivated by auditory processing", Carnegie Mellon Univ., Pittsburgh, PA USA, Dec. 2010 [Online]. Available: http://www.cs.cmu.edu/~robust/Thesis/ ChanwooKimPhDThesis.pdf.
[39] E. Principi, S. Squartini, and F. Piazza, "Power Normalized Cepstral Coefficients based supervectors and i-vectors for small vocabulary speech recognition", International Joint Conference on Neural Networks, Beijing, China, July 2014, pp. 3562-3568.
[40] M. Bashirpour, and M. Geravanchizadeh, "Speech emotion recognition based on power normalized cepstral coefficients in noisy conditions", Iranian Journal of Electrical and Electronic Engineering, Vol. 12, No. 3, September 2016, pp. 197-205.
[41] A. Varga and H. J. Steeneken, "Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems", Speech communication, Vol. 12, No. 3, July 1993, pp. 247-251.
[42] http://www.ldc.upenn.edu/readme.files/tidigits.readme.html.
[43] I. Missaoui, and Z. Lachiri, "Gabor filterbank features for robust speech recognition", International Conference on Image and Signal Processing, Cherbourg, France, Vol. 8509, June 2014, pp. 665-671.
[44] V.Z. Këpuska, and H. A. Elharati, "Robust speech recognition system using conventional and hybrid features of MFCC, LPCC, PLP, RASTA-PLP and hidden markov model classifier in noisy conditions", Journal of Computer and Communications, Vol. 3, No. 6, May 2015, pp. 1-9.
[45] V. F. S. de Alencar, and A. Alcaim "Transformations of LPC and LSF parameters to speech recognition features", International Conference on Pattern Recognition and Image Analysis, Berlin, Heidelberg, Vol. 3686, August 2005.
[46] M. Tamazin, A. Gouda, and M. Khedr, "Enhanced Automatic Speech Recognition System Based on Enhancing Power-Normalized Cepstral Coefficients", Applied Sciences, Vol. 9, No. 10, January 2019, pp. 1-13.