An Improved Deep Text Clustering via Local Manifold of an Autoencoder Embedding

Document Type : Computer Article

Authors

1 Department of Computer Engineering, University of Kurdistan, Sanandaj, Iran

2 Department of Computer Engineering, University of Tabriz, Tabriz, Iran

3 Department of Electrical and Computer Engineering, Kharazmi University, Tehran, Iran

Abstract

Text clustering is a method for separating specific information from textual data and can even classify text according to topic and sentiment, which has drawn much interest in recent years. Deep clustering methods are especially important among clustering techniques because of their high accuracy. These methods include two main components: dimensionality reduction and clustering. Many earlier efforts have employed autoencoder for dimension reduction; however, they are unable to lower dimensions based on manifold structures, and samples that are like one another are not necessarily placed next to one another in the low dimensional. In the paper, we develop a Deep Text Clustering method based on a local Manifold in the Autoencoder layer (DCTMA) that employs multiple similarity matrices to obtain manifold information, such that this final similarity matrix is obtained from the average of these matrices. The obtained matrix is added to the bottleneck representation layer in the autoencoder. The DCTMA's main goal is to generate similar representations for samples belonging to the same cluster; after dimensionality reduction is achieved with high accuracy, clusters are detected using an end-to-end deep clustering. Experimental results demonstrate that the suggested method performs surprisingly well in comparison to current state-of-the-art methods in text datasets.

Keywords

Main Subjects


[1] M.H. Aghdam, and M. Daryaie Zanjani. "A novel regularized asymmetric non-negative matrix factorization for text clustering." Information Processing & Management 58, no. 6 (2021): 102694.
[2] B.J. Sun, H. Shen, J. Gao, W. Ouyang, and X. Cheng. "A non-negative symmetric encoder-decoder approach for community detection." In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 597-606. 2017.
[3] T. Shi, K. Kang, J. Choo, and C.K. Reddy. "Short-text topic modeling via non-negative matrix factorization enriched with local word-context correlations." In Proceedings of the 2018 world wide web conference, pp. 1105-1114. 2018.
[4] F. Daneshfar, S. Soleymanbaigi, A. Nafisi, and P. Yamini. "Elastic deep autoencoder for text embedding clustering by an improved graph regularization." Expert Systems with Applications 238 (2024): 121780.
[5] S.T. Li, W.G. Li, J.W. Hu, and Y. Li. "Semi-supervised bi-orthogonal constraints dual-graph regularized NMF for subspace clustering." Applied Intelligence 52, no. 3 (2022): 3227-3248.
[6] N. Salahian, F. Akhlaghian Tab, S.A. Seyedi, and J. Chavoshinejad. "Deep autoencoder-like NMF with contrastive regularization and feature relationship preservation." Expert Systems with Applications 214 (2023): 119051.
[7] S. Wang, Q. Li, C. Zhao, X. Zhu, H. Yuan, and T. Dai. "Extreme clustering–a clustering method via density extreme points." Information Sciences 542 (2021): 24-39.
[8] R. Guan, H. Zhang, Y. Liang, F. Giunchiglia, L. Huang, and X. Feng. "Deep feature-based text clustering and its explanation." IEEE Transactions on Knowledge and Data Engineering 34, no. 8 (2020): 3669-3680.
[9] B. Diallo, J. Hu, T. Li, G. Ahmad Khan, X. Liang, and Y. Zhao. "Deep embedding clustering based on contractive autoencoder." Neurocomputing 433 (2021): 96-107.
[10] L. Settipalli, G.R. Gangadharan, and U. Fiore. "Predictive and adaptive drift analysis on decomposed healthcare claims using ART based topological clustering." Information Processing & Management 59, no. 3 (2022): 102887.
[11] S. Hosseini, and Z. Asghari Varzaneh. "Deep text clustering using stacked AutoEncoder." Multimedia tools and applications 81, no. 8 (2022): 10861-10881.
[12] Z. Ren, W. Zhang, and Z. Zhang. "A deep nonnegative matrix factorization approach via autoencoder for nonlinear fault detection." IEEE Transactions on Industrial Informatics 16, no. 8 (2019): 5042-5052.
[13] G. Behera, and N. Nain. "DeepNNMF: deep nonlinear non-negative matrix factorization to address sparsity problem of collaborative recommender system." International journal of information technology 14, no. 7 (2022): 3637-3645.
[14] J. Wang, and X.L. Zhang. "Deep NMF topic modeling." Neurocomputing 515 (2023): 157-173.
[15] A.M. Veiga Simão, P.C. Ferreira, N. Pereira, S. Oliveira, P. Paulino, H. Rosa, R. Ribeiro, L. Coheur, J.P. Carvalho, and I. Trancoso. "Prosociality in cyberspace: Developing emotion and behavioral regulation to decrease aggressive communication." Cognitive Computation 13, no. 3 (2021): 736-750.
[16] Z. Jiang, Y. Zheng, H. Tan, B. Tang, and H. Zhou. "Variational deep embedding: An unsupervised and generative approach to clustering." Arxiv Preprint Arxiv:1611.05148 (2016).
[17] S.A. Curiskis, B. Drake, T.R. Osborn, and P.J. Kennedy. "An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit." Information Processing & Management 57, no. 2 (2020): 102034.
[18] B. Diallo, J. Hu, T. Li, G. Ahmad Khan, and A. Saad Hussein. "Multi-view document clustering based on geometrical similarity measurement." International Journal of Machine Learning and Cybernetics (2022): 1-13.
[19] M. Śmieja, Ł. Struski, and M. AT Figueiredo. "A classification-based approach to semi-supervised clustering with pairwise constraints." Neural Networks 127 (2020): 193-203.
[20] X. Li, Y. Guan, B. Fu, and Z. Luo. "Anomaly-aware symmetric non-negative matrix factorization for short text clustering." Knowledge and Information Systems (2024): 1-26.
[21] W. Sheng, and J. Lipor. "A Novel Framework for Deep Learning from Pairwise Constraints." In 2020 54th Asilomar Conference on Signals, Systems, and Computers, pp. 594-598. IEEE, 2020.
[22] R. Guan, H. Zhang, Y. Liang, F. Giunchiglia, L. Huang, and X. Feng. "Deep feature-based text clustering and its explanation." IEEE Transactions on Knowledge and Data Engineering 34, no. 8 (2020): 3669-3680.
[23] V.R. Revathy, A.S. Pillai, and F. Daneshfar. "LyEmoBERT: Classification of lyrics’ emotion and recommendation using a pre-trained model." Procedia Computer Science 218 (2023): 1196-1208.
[24] M. Moradi Fard, T. Thonet, and E. Gaussier. "Pairwise-Constrained Deep Document Clustering." In Reliability and Statistics in Transportation and Communication: Selected Papers from the 19th International Conference on Reliability and Statistics in Transportation and Communication, RelStat’19, 16-19 October 2019, Riga, Latvia, pp. 12-21. Springer International Publishing, 2020.
[25] F. Wei, Z. Chen, Z. Hao, F. Yang, H. Wei, B. Han, and S. Guo. "Semi-supervised clustering with contrastive learning for discovering new intents." arXiv preprint arXiv:2201.07604 (2022).
[26] F. Daneshfar, S. Soleymanbaigi, P. Yamini, and M.S. Amini. "A survey on semi-supervised graph clustering." Engineering Applications of Artificial Intelligence 133 (2024): 108215.
[27] L.A. Vilhagra, E.R. Fernandes, and B.M. Nogueira. "Textcsn: a semi-supervised approach for text clustering using pairwise constraints and convolutional siamese network." In Proceedings of the 35th Annual ACM Symposium on Applied Computing, pp. 1135-1142. 2020.
[28] K. Berahmand, F. Daneshfar, E.S. Salehi, Y. Li, and Y. Xu. "Autoencoders and their applications in machine learning: a survey." Artificial Intelligence Review 57, no. 2 (2024): 28.
[29] Y. Yang, Q.J. Wu, and Y. Wang. "Autoencoder with invertible functions for dimension reduction and image reconstruction." IEEE Transactions on Systems, Man, and Cybernetics: Systems 48, no. 7 (2016): 1065-1079.
[30] R. Lakshmi, and S. Baskar. "Efficient text document clustering with new similarity measures." International Journal of Business Intelligence and Data Mining 18, no. 1 (2021): 49-72.
[31] M. Oghbaie, and M. Mohammadi Zanjireh. "Pairwise document similarity measure based on present term set." Journal of Big Data 5 (2018): 1-23.
[32] D. Jin, Z. Yu, P. Jiao, S. Pan, D. He, J. Wu, S. Yu Philip, and W. Zhang. "A survey of community detection approaches: From statistical modeling to deep learning." IEEE Transactions on Knowledge and Data Engineering 35, no. 2 (2021): 1149-1170.
[33] A. Ahmad, and S.S. Khan. "Survey of state-of-the-art mixed data clustering algorithms." Ieee Access 7 (2019): 31883-31902.
[34] X. Su, S. Xue, F. Liu, J. Wu, J. Yang, C. Zhou, W. Hu et al. "A comprehensive survey on community detection with deep learning." IEEE Transactions on Neural Networks and Learning Systems (2022).
[35] A. Golzari Oskouei, M.A. Balafar, and C. Motamed. "EDCWRN: efficient deep clustering with the weight of representations and the help of neighbors." Applied Intelligence 53, no. 5 (2023): 5845-5867.
[36] L. Chen, and Z. Zhong. "Adaptive and structured graph learning for semi-supervised clustering." Information Processing & Management 59, no. 4 (2022): 102949.
[37] D. Lee, and H. Sebastian Seung. "Algorithms for non-negative matrix factorization." Advances in Neural Information Processing Systems 13 (2000)
[38] J. Misztal-Radecka, and B. Indurkhya. "Bias-Aware Hierarchical Clustering for detecting the discriminated groups of users in recommendation systems." Information Processing & Management 58, no. 3 (2021): 102519.
[39] T. Zhang, R. Ramakrishnan, and M. Livny. "BIRCH: an efficient data clustering method for very large databases." ACM Sigmod Record 25, no. 2 (1996): 103-114.
[40] J.B. Alonso. "K-means vs Mini Batch K-means: a comparison (2013)".
[41] Y. Ren, K. Hu, X. Dai, L. Pan, S.C. Hoi, and Z. Xu. "Semi-supervised deep embedded clustering." Neurocomputing 325 (2019): 121-130.
[42] S. Yang, G. Huang, and B. Cai. "Discovering topic representative terms for short text clustering." IEEE Access 7 (2019): 92037-92047.
[43] W. Li, and E. Suzuki. "Adaptive and hybrid context-aware fine-grained word sense disambiguation in topic modeling based document representation." Information Processing & Management 58, no. 4 (2021): 102592.
[44] Y. Yang. Temporal data mining via unsupervised ensemble learning. Elsevier, 2016.
[45] D. Hu, D. Feng, and Y. Xie. "EGC: A novel event-oriented graph clustering framework for social media text." Information Processing & Management 59, no. 6 (2022): 103059.
[46] R. Wang, L. Li, X. Tao, X. Dong, P. Wang, and P. Liu. "Trio-based collaborative multi-view graph clustering with multiple constraints." Information Processing & Management 58, no. 3 (2021): 102466.
[47] G. Salton, and C. Buckley. "Term-weighting approaches in automatic text retrieval." Information Processing & Management 24, no. 5 (1988): 513-523.