From Data to Insight: Topic Modelling and Automatic Topic Labelling Strategies
DOI:
https://doi.org/10.54153/sjpas.2025.v7i4.1061Keywords:
Deep Learning, S-BERT, Dimensionally Reduction, Topic Coherent, Topic DiversityAbstract
In order to enhance the interpretability of data for decision-making, scientific, biological, and social media text collections require efficient machine learning techniques. Text mining is aided by topic models in sources such as blogs, Twitter data, scientific journals, and biomedical papers. It is still difficult to find appropriate labels, even when topic modeling indicates important concepts. Analysts' cognitive effort is decreased by automating topic evaluation and categorization. While certain techniques rely on word frequency to produce labels with words, phrases, or images, extractive methods choose labels based on probability measures. This study suggests improving the topic modeling in a collection of conference papers on Neural Information Processing Systems (NIPS) released between 1987 and 2017 and achieved two goals: producing more coherent topics and automatic topic labelling. The first goal was achieved through five phases: text pre-processing phase, reduction phase using a new method called SR-LW (Sentences Reduction Based on Length and Weight), which removes sentences of shorter length, then calculates the weight for the remaining sentences and removes approximately 25% of the less weight sentences. The sentence embedding phase uses S-BERT (Sentence-Bidirectional Encoder Representation from Transformer) to reduce the dimensionality of the sentence embedding phase by utilizing the Uniform Manifold Approximation and Projection (UMAP). Lastly, Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) organized comparable documents. The experimental findings demonstrate that using the proposed SR-LW phase has produced more cohesive topics, improving topic coherence by (0.593) and topic diversity performance by (0.96). Though topic modelling extracts the most salient sentences describing latent topics from text collections, an appropriate label has not yet been identified. The second goal was achieved by suggesting a new method to generate the keywords by accessing the authors’ profiles in Google Scholar and extracting the interests for use in automatically labelling the topics.
Downloads
Published
Issue
Section
License

This work is licensed under a Creative Commons Attribution 4.0 International License.
Copyright Notice
Authors retain copyright and grant the SJPAS journal right of first publication, with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in Samarra Journal of Pure and Applied Science.
The Samarra Journal of Pure and Applied Science permits and encourages authors to archive Pre-print and Post-print items submitted to the journal on personal websites or institutional repositories per the author's choice while providing bibliographic details that credit their submission, and publication in this journal. This includes the archiving of a submitted version, an accepted version, or a published version without any Risks.



