Haystack Engineering: Context Engineering Meets the Long-Context Challenge in Large Language Models
Mufei Li · Dongqi Fu · Limei Wang · Si Zhang · Hanqing Zeng · Kaan Sancak · Ruizhong Qiu · Haoyu Wang · Xiaoxin He · Xavier Bresson · Yinglong Xia · Chonglin Sun · Pan Li
Abstract
Existing "needle-in-a-haystack" (NIAH) benchmarks for long-context LLM evaluation often overlook "context engineering", using random distractors rather than biased outputs of retrieval systems. We present HaystackCraft, a new NIAH benchmark built on the full English Wikipedia hyperlink network, which evaluates LLMs against ranked distractors from sparse, dense, hybrid, and graph-based retrievers. Experiments on 10 LLMs show significant performance degradation as context size increases. We find that distractor composition is crucial: semantically similar documents are more challenging than lexically similar ones. Graph-based reranking mitigates harmful distractors, improving the LLM performance by up to 44%.
Chat is not available.
Successful Page Load