From Data to Insight: Topic Modelling and Automatic Topic Labelling Strategies

Authors

  • Rana F. Najeeb University of Mustansiriyah
  • Ban N. Dhannoon Al-Nahrain University, Baghdad, Iraq
  • Farah Qais Alkhalidi

DOI:

https://doi.org/10.54153/sjpas.2025.v7i4.1061

Keywords:

Deep Learning, S-BERT, Dimensionally Reduction, Topic Coherent, Topic Diversity

Abstract

In order to enhance the interpretability of data for decision-making, scientific, biological, and social media text collections require efficient machine learning techniques. Text mining is aided by topic models in sources such as blogs, Twitter data, scientific journals, and biomedical papers. It is still difficult to find appropriate labels, even when topic modeling indicates important concepts. Analysts' cognitive effort is decreased by automating topic evaluation and categorization. While certain techniques rely on word frequency to produce labels with words, phrases, or images, extractive methods choose labels based on probability measures. This study suggests improving the topic modeling in a collection of conference papers on Neural Information Processing Systems (NIPS) released between 1987 and 2017 and achieved two goals: producing more coherent topics and automatic topic labelling. The first goal was achieved through five phases: text pre-processing phase, reduction phase using a new method called SR-LW (Sentences Reduction Based on Length and Weight), which removes sentences of shorter length, then calculates the weight for the remaining sentences and removes approximately 25% of the less weight sentences. The sentence embedding phase uses S-BERT (Sentence-Bidirectional Encoder Representation from Transformer) to reduce the dimensionality of the sentence embedding phase by utilizing the Uniform Manifold Approximation and Projection (UMAP). Lastly, Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) organized comparable documents. The experimental findings demonstrate that using the proposed SR-LW phase has produced more cohesive topics, improving topic coherence by (0.593) and topic diversity performance by (0.96). Though topic modelling extracts the most salient sentences describing latent topics from text collections, an appropriate label has not yet been identified. The second goal was achieved by suggesting a new method to generate the keywords by accessing the authors’ profiles in Google Scholar and extracting the interests for use in automatically labelling the topics.

Downloads

Published

2025-12-30

How to Cite

From Data to Insight: Topic Modelling and Automatic Topic Labelling Strategies. (2025). Samarra Journal of Pure and Applied Science, 7(4), 207-223. https://doi.org/10.54153/sjpas.2025.v7i4.1061

Similar Articles

1-10 of 22

You may also start an advanced similarity search for this article.