Data Augmentation and Effective Feature Selection in Generative Adversarial Networks for Speech Emotion Recognition

Document Type : Power Article

Authors

Faculty of Electrical Engineering, Shahrood University of Technology, Shahrood, Iran

Abstract

Until now, there has been no certainty based on the success or failure of using feature selection methods to increase the efficiency of SER systems. This article discusses feature selection for data augmentation in a speech emotion recognition system. The experiments were performed on four databases: EMO-DB, eNTERFACE05, SAVEE, and IEMOCAP. Simulations are performed in Python software and in addition, data analysis was performed on all four databases for four emotions of sadness, anger, happiness, and neutral. This paper discusses feature selection intending to create a GAN to augment data in a speech emotion recognition system. It will demonstrate that artificial data generated by GANs can not only augment data but also can be used to feature selection to improve classification performance. We used a GAN to augment data and selected two feature-selective networks including Fisher and LDA algorithm in two steps. SVM was also used to classify emotions. With the feedback taken from the classification network, we could bring the SER system to the optimal point of sample number and feature vector dimensions. The PCA is more effective on correlated data. The LDA algorithm works better on low-dimensional data. Fisher's method is better at reducing size than PCA. The results showed that the use of both LDA and Fisher methods in the GANs can filter the features in smaller dimensions while preserving the emotional information for classification. The results were compared with recent research and the proposed method was able to achieve 86.32% accuracy in the EMO-DB database.

Keywords


[1] Eva Lieskovská, Maroš Jakubec, Roman Jarina and Michal Chmulík, “A Review on Speech Emotion Recognition Using Deep Learning and Attention Mechanism,” in Electronics, May. 2021.
[2] Akçay, M.B.; O˘guz, K. Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Communication, 2020, pp.116, 56–76.
[3] C. Sima and E. R. Dougherty, “The peaking phenomenon in the presence of feature-selection,” Pattern
Recognition Letter , vol. 29, no. 11, 2008, pp. 1667–1674.
[4] J. Rong, G. Li, and Y.-P. P. Chen, “Acoustic feature selection for automatic emotion recognition from speech,” Information Processing & Management, vol. 45, no. 3, 2009, pp. 315–328.
[5] F. Eyben, F. Weninger, F. Gross, and B. Schuller, “Recent developments in openSMILE, the Munich open-source multimedia feature extractor,” in Proc. 21st ACM International Conference on Multimedia, 2013, pp. 835–838.
[6] S. N. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu, “A unified framework for high-dimensional analysis of M-estimators with decomposable regularizes,”  statistical science, vol. 27, no. 4, Nov 2012, pp. 538–557.
[7] H. Zou, T. Hastie, and R. Tibshirani, “Sparse principal component analysis,” Journal of Computational and Graphical Statistics, vol. 15, no. 2, 2006, pp. 265–286.
[8] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio. “Generative adversarial nets”. In: Advances in neural information processing systems, 2014.
[9] E. Bozkurt, E. Erzin, Ç. E. Erdem, and A. T. Erdem, “Formant position based weighted spectral features for emotion recognition,” Speech Communication., vol. 53, no. 9, 2011, pp. 1186–1197.
[10] S. Wu, T. H. Falk, and W.-Y. Chan, “Automatic speech emotion recognition using modulation spectral features,” Speech Communication, vol. 53, no. 5, 2011, pp. 768–785.
[11] P. Laukka, D. Neiberg, M. Forsell, I. Karlsson, and K. Elenius, “Expression of effect in spontaneous speech: Acoustic correlates and automatic detection of irritation and resignation,” Computer Speech & Language, vol. 25, no. 1, 2011, pp. 84–104.
[12] H. Pérez-Espinosa, C. A. Reyes-García, and L. Villaseñor-Pineda, “Acoustic feature selection and classification of emotions in speech using a 3D continuous emotion model,” Biomed. Signal Process. Control, vol. 7, no. 1, 2012, pp. 79–87.
[13] علی حریمی، علیرضا احمدی فرد، علی شهزادی و خشایار یغمایی، "تشخیص احساس از روی گفتار با استفاده از طبقه‌بند مبتنی بر مدل و ویژگی های دینامیکی غیر خطی"، نشریه مهندسی برق و مهندسی کامپیوتر ایران، دوره 15، شماره 2، تابستان 1396، صفحه 152-145.
[14] K. Han, D. Yu, and I. Tashev, Speech Emotion Recognition Using Deep Neural Network and Extreme Learning Machine., vol. 3, 2014, pp. 232–243.
[15] H. Palo and M. Mohanty, “Modified-VQ Features for Speech Emotion Recognition,” Journal of Mathematical Sciences, vol. 16,Sep 2016, pp. 406–418.
[16] B. Schuller, R. Müller, M. Lang, and G. Rigoll, Speaker independent emotion recognition by early fusion of acoustic and linguistic features within ensembles., vol. 2, 2005, pp. 565–572.
[17] I. Luengo, E. Navas, and I. Hernáez, “Feature Analysis and Evaluation for Automatic Emotion Identification in Speech,” Multimedia, IEEE transaction, vol. 12, Nov 2010, pp. 490–501.
[18] D. Gharavian, M. Sheikhan, and F. Ashoftedel, “Emotion recognition improvement using normalized formant supplementary features by a hybrid of DTW-MLP-GMM model,” Neural Computing & Applications, vol. 22, no. 6, 2013, pp. 1181–1191.
[19] X. Zhao, S. Zhang, and B. Lei, “Robust emotion recognition in a noisy speech via sparse representation,” Neural Computing & Applications, vol. 24, Jun. 2013.
[20] H. Hu, T. Tan, and Y. Qian, “Generative adversarial network-based data augmentation for noise-robust speech recognition,” in Proc. The international Conference on Acoustics, Speech, & Signal Processing (ICASSP), Apr 2018, pp. 5044–5048.
[21] A. Harimi and Kh. Yaghmaie, “improving speech emotion recognition via gender classification,” in Journal of Modeling in Engineering., vol.48, 2017, pp. 184–200.
[22] M. Sadeghi, H. Marvi and A. Ahmadifard “A New and Efficient Feature Extraction Method for Robust Speech Recognition Based on Fractional Fourier Transform and Differential Evolution Optimizer,” in Journal of Modeling in Engineering., vol.61, 2020, pp. 86–96.
[23] سیدعلی سلیمانی ایوری، محمدرضا فدوی امیری و حسین مروی، "تولید سیگنال مصنوعی زلزله به­کمک مدلی جدید در فشرده­سازی و آموزش شبکه­های عصبی مصنوعی"، نشریه مدلسازی در مهندسی، دوره 14، شماره 46، پائیز 1395، صفحه 85-75.
[24] M. Chourasia, S. Haral, , S. Bhatkar,  and S. Kulkarni, “Emotion Recognition from Speech Signal Using Deep Learning.” In Intelligent Data Communication Technologies and Internet of Things, 2021, pp. 471–481.
[25] J. Chang, S. Scherer. “Learning representations of emotional speech with deep convolutional generative adversarial networks”. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017.
[26] F. Eyben, F. Weninger, F. Gross and B. Schuller. “Recent developments in openSMILE, the Munich open-source multimedia feature extractor”. In: Proc. 21st ACM international conference on Multimedia. ACM, vol. 5, 2013, pp. 232–240.
[27] I. Goodfellow. “NIPS 2016 tutorial: Generative adversarial networks”. In: arXiv preprint arXiv:1701.00160, 2016.
[28] F. Bao, M. Neumann, and N. T. Vu, “Cyclegan-based emotion style transfer as data augmentation for speech emotion recognition.” In Interspeech, 2019, pp. 2828–2832.
[29] W. Y. Zhao, “Discriminant component analysis for face recognition,” in Proceeding’s 15th International Conference on Pattern Recognition. ICPR-2000, vol. 2, 2000, pp. 818–821.
[30] F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, and B. Weiss, “A database of German emotional speech,” in Proc. 9th European Conference on Speech Communication and Technology , 2005, pp 1–4.
[31] B. Yang and M. Lugger, “Emotion recognition from speech signals using new harmony features,” Signal Processing, vol. 90, no. 5, 2010, pp. 1415–1423.
[32] I. Luengo, E. Navas, and I. Hernáez, “Feature Analysis and Evaluation for Automatic Emotion Identification in Speech,” Multimedia, IEEE transaction, vol. 12, Nov 2010, pp. 490–501.
[33] G. Paraskevopoulos, E. Tzinis, N. Ellinas, T. Giannakopoulos, and A. Potamianos, “Unsupervised low-rank representations for speech emotion recognition,” Proc. Interspeech 2019, 2019, pp. 939–943.
[34] S. Sahu, R. Gupta, and C. Espy-Wilson, “On enhancing speech emotion recognition using generative adversarial networks,” Proc. Interspeech 2018, 2018, pp. 3693–3697.
[35] S. Latif , M. Asim, R. Rana, S. Khalifa, R. Jurdak and B.W. Schuller, “Augmenting Generative Adversarial Networks for Speech Emotion Recognition, ” Proc. Interspeech, 521-525, doi: 10.21437/Interspeech.2020-3194, 2020.