Fixing Data That Hurts Performance: Cascading Llms To Relabel Hard Negatives For Robust Information Retrieval | Awesome LLM Papers Add your paper to Awesome LLM Papers

Fixing Data That Hurts Performance: Cascading Llms To Relabel Hard Negatives For Robust Information Retrieval

Nandan Thakur, Crystina Zhang, Xueguang Ma, Jimmy Lin . No Venue 2025

[Paper]   Search on Google Scholar   Search on Semantic Scholar
Content Enrichment Datasets Efficiency Evaluation Model Architecture Question Answering RAG Training Techniques Visual Contextualization

Training robust retrieval and reranker models typically relies on large-scale retrieval datasets; for example, the BGE collection contains 1.6 million query-passage pairs sourced from various data sources. However, we find that certain datasets can negatively impact model effectiveness – pruning 8 out of 15 datasets from the BGE collection reduces the training set size by 2.35times and increases nDCG@10 on BEIR by 1.0 point. This motivates a deeper examination of training data quality, with a particular focus on “false negatives”, where relevant passages are incorrectly labeled as irrelevant. We propose a simple, cost-effective approach using cascading LLM prompts to identify and relabel hard negatives. Experimental results show that relabeling false negatives with true positives improves both E5 (base) and Qwen2.5-7B retrieval models by 0.7-1.4 nDCG@10 on BEIR and by 1.7-1.8 nDCG@10 on zero-shot AIR-Bench evaluation. Similar gains are observed for rerankers fine-tuned on the relabeled data, such as Qwen2.5-3B on BEIR. The reliability of the cascading design is further supported by human annotation results, where we find judgment by GPT-4o shows much higher agreement with humans than GPT-4o-mini.

Similar Work