Unsupervised Query Reformulation through Latent Concept Induction in Large-Scale Heterogeneous Information Retrieval Environments
Abstract
In large-scale heterogeneous information retrieval (IR) environments, user queries are often semantically ambiguous or structurally sparse, limiting retrieval effectiveness. This paper proposes a novel unsupervised query reformulation framework based on latent concept induction (LCI), which learns implicit semantic structures from retrieved document sets. Unlike supervised approaches, the proposed model autonomously uncovers latent concepts via document co-occurrence and context propagation techniques. Experiments on TREC and ClueWeb datasets show significant improvements in mean average precision (MAP) and normalized discounted cumulative gain (nDCG) over baseline and supervised models. The proposed LCI framework enhances retrieval effectiveness without requiring annotated query reformulation data, making it scalable across domains and languages