An Embedding-Based Topic Model for Document Classification

Sattar Seifollahi, Massimo Piccardi, Alireza Jolfaei

Research output: Contribution to journalArticlepeer-review

4 Citations (Scopus)

Abstract

Topic modeling is an unsupervised learning task that discovers the hidden topics in a collection of documents. In turn, the discovered topics can be used for summarizing, organizing, and understanding the documents in the collection. Most of the existing techniques for topic modeling are derivatives of the Latent Dirichlet Allocation which uses a bag-of-word assumption for the documents. However, bag-of-words models completely dismiss the relationships between the words. For this reason, this article presents a two-stage algorithm for topic modelling that leverages word embeddings and word co-occurrence. In the first stage, we determine the topic-word distributions by soft-clustering a random set of embedded n-grams from the documents. In the second stage, we determine the document-topic distributions by sampling the topics of each document from the topic-word distributions. This approach leverages the distributional properties of word embeddings instead of using the bag-of-words assumption. Experimental results on various data sets from an Australian compensation organization show the remarkable comparative effectiveness of the proposed algorithm in a task of document classification.

Original languageEnglish
Article number52
Number of pages13
JournalACM Transactions on Asian and Low-Resource Language Information Processing
Volume20
Issue number3
DOIs
Publication statusPublished - May 2021
Externally publishedYes

Keywords

  • clustering
  • document classification
  • Topic modelling
  • word embedding

Fingerprint

Dive into the research topics of 'An Embedding-Based Topic Model for Document Classification'. Together they form a unique fingerprint.

Cite this