Topic discovery in massive text corpora based on Min-Hashing

Topics have proved to be a valuable source of information for exploring, discovering, searching and representing the contents of text corpora. They have also been useful for different natural language processing tasks such as text classification, text summarization and machine translation. Most existing topic discovery approaches require the number of topics to be provided beforehand. However, an appropriate number of topics for a given corpus depends on its characteristics and is often difficult to estimate. In addition, in order to handle massive amounts of text documents, the vocabulary must be reduced considerably and large computer clusters and/or GPUs are typically required. This paper describes Sampled Min-Hashing (SMH), a scalable approach to topic discovery which does not require the number of topics to be specified in advance and can handle massive text corpora and large vocabularies using modest computer resources. The basic idea behind SMH is to generate multiple random partitions of the corpus vocabulary to find sets of highly co-occurring words, which are then clustered to produce the final topics. An extensive qualitative and quantitative evaluation on the 20 Newsgroups, Reuters, Spanish Wikipedia and English Wikipedia corpora shows that SMH is able to consistently discover meaningful and coherent topics at scale. Remarkably, the time required by SMH grows linearly with the size of the corpus and the number of words in the vocabulary; a nonparallel implementation of SMH was able to discover topics from the whole English version of Wikipedia (5M documents approximately) with a vocabulary of 1M words in less than 7 h. Our findings provide further evidence of the relevance and generality of beyondpairwise co-occurrences for pattern discovery on large-scale discrete data, which opens the door for other applications and several interesting research directions.

Fuentes-Pineda, G., & Meza-Ruiz, I. V. (2019). Topic discovery in massive text corpora based on min-hashing. Expert Systems with Applications, 136, 62-72.