نوع مقاله : مقاله کامپیوتر
نویسندگان
مهندسی کامپیوتر، دانشکده برق و کامپیوتر، دانشگاه سمنان، سمنان، ایران
چکیده
کلیدواژهها
موضوعات
عنوان مقاله [English]
نویسندگان [English]
With the growth of information, extracting knowledge from textual collections has become essential. Topic modeling is an unsupervised machine learning technique that uncovers the hidden themes in documents. In this paper, inspired by BERTopic, we present an unsupervised method for topic modeling on Persian texts. The proposed approach employs the LaBSE language embedding model to convert texts into embedding vectors, then reduces their dimensions using UMAP, and finally groups similar texts into clusters using the K-Means algorithm. Next, by forming a cluster-token matrix and applying a topic representation technique, various topics are extracted from each cluster. We compared LaBSE with other language embedding models including XLM-R, ParsBERT, Paraphrase-multilingual-MiniLM-L12-v2, Shiraz, and HooshvareLab (RoBERTa). We also compared the K-Means and HDBSCAN clustering algorithms. For evaluation, the Asre Iran dataset was used, and both the coherence metric (NPMI) and human evaluation confirmed the proposed method’s performance. In HDBSCAN, Hooshvare (RoBERTa) yielded the best coherence, while ParsBERT excelled in human evaluation. In K-Means, Paraphrase-multilingual-MiniLM-L12-v2 performed best in terms of coherence and LaBSE in human evaluation. The superiority of K-Means over HDBSCAN was also verified. Furthermore, using the Asre Iran and Tasnim datasets separately, the proposed method was compared with non-negative matrix factorization, latent Dirichlet allocation, and latent semantic analysis, with results demonstrating its outstanding performance.
کلیدواژهها [English]