Topic Models Ensembles for AD-HOC Information Retrieval

RL5, Publisher: Information, Link>


Pablo Ormeño, Marcelo Mendoza, Carlos Valle


Ad hoc information retrieval (ad hoc IR) is a challenging task consisting of ranking text documents for bag-of-words (BOW) queries. Classic approaches based on query and document text vectors use term-weighting functions to rank the documents. Some of these methods’ limitations consist of their inability to work with polysemic concepts. In addition, these methods introduce fake orthogonalities between semantically related words. To address these limitations, model-based IR approaches based on topics have been explored. Specifically, topic models based on Latent Dirichlet Allocation (LDA) allow building representations of text documents in the latent space of topics, the better modeling of polysemy and avoiding the generation of orthogonal representations between related terms. We extend LDA-based IR strategies using different ensemble strategies. Model selection obeys the ensemble learning paradigm, for which we test two successful approaches widely used in supervised learning. We study Boosting and Bagging techniques for topic models, using each model as a weak IR expert. Then, we merge the ranking lists obtained from each model using a simple but effective top-k list fusion approach. We show that our proposal strengthens the results in precision and recall, outperforming classic IR models and strong baselines based on topic models.

0 visualizaciones

Entradas Recientes

Ver todo

RL2, Publisher: Journal of Machine Learning Research, Link> AUTHORS Jorge Pérez, Pablo Barceló, Javier Marinkovic ABSTRACT Alternatives to recurrent neural networks, in particular, architectures bas

RL2, Publisher: https://github.com/pdm-book/community Link> AUTHORS Marcelo Arenas, Pablo Barceló, Leonid Libkin, Wim Martens, Andreas Pieris ABSTRACT This is a release of parts 1, 2, and 4 of the