Network Dynamics Reasoning: A Novel Benchmark for Evaluating Multi-Step Inference in Large Language Models
Abstract
We introduce a novel benchmark for evaluating large language models' ability to reason about network dynamics and multi-step system evolution. Our benchmark tests models on predicting the final state of threshold-based adoption processes in social networks, requiring precise numerical prediction after complex temporal reasoning. We evaluate five state-of-the-art models across different architectures and API providers, revealing significant performance gaps and emergent reasoning capabilities. Our key findings show that Google's Gemini models substantially outperform Meta's Llama and Google's Gemma models, with Gemini 1.5 Pro achieving 55\% accuracy compared to 10\% for Llama 3.3 70B, despite the latter's larger parameter count. This benchmark addresses critical gaps in current LLM evaluation by testing contamination-resistant synthetic scenarios, precise numerical reasoning, and multi-step temporal dynamics—capabilities essential for AI systems operating in complex real-world environments.