From Data to Insight: Topic Modelling and Automatic Topic Labelling Strategies

Rana F. Najeeb; Ban N. Dhannoon; Farah Qais Alkhalidi

doi:10.54153/sjpas.2025.v7i4.1061

From Data to Insight: Topic Modelling and Automatic Topic Labelling Strategies

Authors

Rana F. Najeeb University of Mustansiriyah
Ban N. Dhannoon Al-Nahrain University, Baghdad, Iraq
Farah Qais Alkhalidi

DOI:

https://doi.org/10.54153/sjpas.2025.v7i4.1061

Keywords:

Deep Learning, S-BERT, Dimensionally Reduction, Topic Coherent, Topic Diversity

Abstract

In order to enhance the interpretability of data for decision-making, scientific, biological, and social media text collections require efficient machine learning techniques. Text mining is aided by topic models in sources such as blogs, Twitter data, scientific journals, and biomedical papers. It is still difficult to find appropriate labels, even when topic modeling indicates important concepts. Analysts' cognitive effort is decreased by automating topic evaluation and categorization. While certain techniques rely on word frequency to produce labels with words, phrases, or images, extractive methods choose labels based on probability measures. This study suggests improving the topic modeling in a collection of conference papers on Neural Information Processing Systems (NIPS) released between 1987 and 2017 and achieved two goals: producing more coherent topics and automatic topic labelling. The first goal was achieved through five phases: text pre-processing phase, reduction phase using a new method called SR-LW (Sentences Reduction Based on Length and Weight), which removes sentences of shorter length, then calculates the weight for the remaining sentences and removes approximately 25% of the less weight sentences. The sentence embedding phase uses S-BERT (Sentence-Bidirectional Encoder Representation from Transformer) to reduce the dimensionality of the sentence embedding phase by utilizing the Uniform Manifold Approximation and Projection (UMAP). Lastly, Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) organized comparable documents. The experimental findings demonstrate that using the proposed SR-LW phase has produced more cohesive topics, improving topic coherence by (0.593) and topic diversity performance by (0.96). Though topic modelling extracts the most salient sentences describing latent topics from text collections, an appropriate label has not yet been identified. The second goal was achieved by suggesting a new method to generate the keywords by accessing the authors’ profiles in Google Scholar and extracting the interests for use in automatically labelling the topics.

Downloads

Published

2025-12-30

Issue

Vol. 7 No. 4 (2025): Samarra Journal of Pure and Applied Science

Section

Computer Science

License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Copyright Notice

Authors retain copyright and grant the SJPAS journal right of first publication, with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in Samarra Journal of Pure and Applied Science.

The Samarra Journal of Pure and Applied Science permits and encourages authors to archive Pre-print and Post-print items submitted to the journal on personal websites or institutional repositories per the author's choice while providing bibliographic details that credit their submission, and publication in this journal. This includes the archiving of a submitted version, an accepted version, or a published version without any Risks.

How to Cite

From Data to Insight: Topic Modelling and Automatic Topic Labelling Strategies. (2025). Samarra Journal of Pure and Applied Science, 7(4), 207-223. https://doi.org/10.54153/sjpas.2025.v7i4.1061

Download Citation

From Data to Insight: Topic Modelling and Automatic Topic Labelling Strategies

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Issue

Section

License

How to Cite

Similar Articles

google scholar

Information

Language

Journal Info

Similar Articles

An Intelligent Intrusion Detection System Using Deep Learning and Multi-modal Fusion with RF Classification

Multi-Class of Retinal Diseases Classification via Deep Learning Techniques Based on Fundus Images

Applied of color image processing system to detect plant disease using clustering algorithms

Application of a Hybrid Genetic and Hill Climbing Algorithm (HGA) in Metaverse-Based Learning Environments

Federated Learning with Contrastive Pretraining for Retinal OCT Disease Classification on OCT Retinal Images

The effect of some plant nutrients concentrations on the algal diversity in the water of the Kirkuk irrigation project within the district of TuzKhurmatu -Iraq

Learnheuristics in routing and scheduling problems: A review

Study Estimating hourly traffic flow using Artificial Neural Network: A M25 motorway case

The Effectiveness of Intranet Network Design in Activating Educational Platforms and Enhancing Academic Achievement

Actual Needs Criteria for Assessing Data Classification Platforms