Poster
in
Workshop: Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling

Haystack Engineering: Context Engineering Meets the Long-Context Challenge in Large Language Models

Mufei Li ⋅ Dongqi Fu ⋅ Limei Wang ⋅ Si Zhang ⋅ Hanqing Zeng ⋅ Kaan Sancak ⋅ Ruizhong Qiu ⋅ Haoyu Wang ⋅ Xiaoxin He ⋅ Xavier Bresson ⋅ Yinglong Xia ⋅ Chonglin Sun ⋅ Pan Li

Project Page [ OpenReview]

Abstract

Existing "needle-in-a-haystack" (NIAH) benchmarks for long-context LLM evaluation often overlook "context engineering", using random distractors rather than biased outputs of retrieval systems. We present HaystackCraft, a new NIAH benchmark built on the full English Wikipedia hyperlink network, which evaluates LLMs against ranked distractors from sparse, dense, hybrid, and graph-based retrievers. Experiments on 10 LLMs show significant performance degradation as context size increases. We find that distractor composition is crucial: semantically similar documents are more challenging than lexically similar ones. Graph-based reranking mitigates harmful distractors, improving the LLM performance by up to 44%.

Chat is not available.