NeurIPS 2026 Evaluations & Datasets Track Call for Papers
The NeurIPS Evaluations & Datasets (E&D) Track invites submissions that advance the science and practice of evaluation in AI/ML including the development and use of datasets and other resources.
Formerly the Datasets & Benchmarks Track, the E&D Track reflects a shift and broadening in scope: evaluation becomes an object of scientific study in its own right. Scientific debates increasingly hinge on evaluation - what is measured, under what assumptions, and how results are interpreted. We define evaluation as the full set of processes, tools, datasets, benchmarks, and practices used to test, stress-test, audit, compare, and interpret AI/ML systems across their lifecycle.
Datasets remain central to the track, both as components of evaluations and as resources used across the AI/ML lifecycle (training, fine-tuning, testing, auditing). However, dataset submissions should also clarify how they should be meaningfully used in evaluative practices rather than being endpoints in themselves. Submissions are expected to clearly articulate the evaluative role their contribution plays: what claims it supports, under what assumptions, and what limitations apply. Authors are strongly encouraged to review the accompanying blog post [LINK to ED blog post] for a more detailed explanation of the revised scope.
The Call for Papers of the NeurIPS 2026 Evaluations and Datasets will follow the Call for Papers of the NeurIPS 2026 Main Track. Accepted papers will be published in the NeurIPS proceedings and presented at the conference alongside main track papers. We aim for an equally stringent review as the main conference track, while allowing for the track-specific guidelines described below. For details on formatting, code of conduct, ethics review, and other submission-related topics, please refer to the NeurIPS 2026 Main Track Call for Papers.
Scope
The ED Track welcomes contributions that advance the science of AI evaluation. These include, but are not limited to, work that:
- Analyze strengths, limitations, or failure modes of existing benchmarks or evaluation practices
- Study benchmark saturation or overfitting and their impact on scientific conclusions
- Compare evaluation designs and demonstrate how different assumptions lead to different results or conclusions
- Provide rigorous reproduction, auditing, and stress-testing of prior evaluations
- Develop documentation methodologies (e.g., Data Cards, Model Cards, evaluation cards) that improve how evaluative claims are made, interpreted, and compared
- Refine existing evaluation setups
- Propose new evaluation protocols, practices, or methodologies
- Conduct human- or interaction-centered evaluations (e.g., user studies, red-teaming)
- Introduce datasets and clearly explain their scope, assumptions, limitations, and how they are intended to support or shape evaluative claims across the AI/ML lifecycle
- Contribute tools, analyses, or frameworks that improve how evaluative claims are constructed or interpreted
- Present negative results, critical analyses, and use-case-inspired evaluations
Moreover, data-centric and benchmarking submissions historically welcomed by the track remain fully in scope. These include, but are not limited to:
- New datasets and dataset collections
- Data generators and reinforcement learning environments
- Data-centric AI methods and tools
- Advanced data collection and curation practices
- Responsible dataset development frameworks
- Audits of existing datasets
- Benchmarks on new or existing datasets, benchmarking tools, and methodologies
- Systematic analyses of systems on novel datasets
- In-depth analyses of machine learning challenges and competitions (by organisers and/or participants) that yield important new insights
- Competition papers from prior NeurIPS competitions
Submissions need not introduce a new model or outperform prior work. Submissions may introduce a variety of artifacts (e.g., datasets, protocols, tools, documentation practices). What matters is how the contribution supports and/or advances meaningful evaluation. Authors should articulate how their work meaningfully changes, strengthens, or enables the evaluation of AI/ML systems.
For authors wondering how the E&D Track relates to the Main Track themes (e.g., General, Theory, Use-case Inspired, Concept & Feasibility, or Negative Results), we have provided a dedicated Track Boundaries guide (see Key Links section below), including a decision tree to help identify the most appropriate track based on the paper’s primary contribution. We strongly encourage authors to consult this guide before submission.
Key Dates
The dates are identical to the main track:
- Abstract submission deadline: May 4, 2026 (AoE). All authors must have an OpenReview profile when submitting.
- Full paper submission deadline (including all supplementary materials): May 6, 2026 (AoE)
- Author notification: September 24, 2026 (AoE)
Please note: The submission portal opens on April 5, 2026 (at the same time as the main track).
Key Links
- Submission link: OpenReview (separate from the Main Track portal)
- NeurIPS 2026 Main Track Call for Papers:
- Blog post on the ED Track scope update: [LINK TO BLOG POST]
ED Track-Specific Guidelines
The Evaluations and Datasets Track follows the NeurIPS Main Track Call for Papers in terms of requirements and timelines, with the following track-specific additions.
Review Policy
Unlike previous years, the default review mode for the E&D Track is now double-blind, reflecting the track's broader scope beyond dataset submissions. Authors whose submission centers on datasets may still choose single-blind review, given the practical challenges of anonymizing data releases.
Note: NeurIPS does not tolerate any collusion whereby authors secretly cooperate with reviewers, ACs, or SACs to obtain favorable reviews.
Dataset and Code Submission
Datasets and code must be properly hosted, accessible, and clearly documented upon submission; these are essential submission requirements.
Dataset Hosting. New datasets should be hosted on one of the dedicated ML hosting sites (Dataverse, Kaggle, Hugging Face, or OpenML), or on a bespoke hosting site if the dataset requires it. See the guidelines for data hosting for details.
Code. As the track now encompasses a broad range of contributions — including executable artifacts, empirical audits, negative results, methodological analyses, and formal treatments of metrics and evaluation design — our code policy is contribution-dependent. We strongly encourage all authors to release code whenever feasible to promote transparency and reproducibility. However, code release is required at submission when the primary contribution is a reusable executable artifact, such as a benchmark suite, evaluation environment, data generator, or software tool, whose functionality must be inspected in order to evaluate the scientific claims. For submissions whose contributions are analytical, empirical, conceptual, or methodological and do not introduce such artifacts, code release is not mandatory, provided that the paper includes sufficient detail to enable meaningful review of the claims.
Code should be made accessible in an executable format via a hosting platform (e.g., GitHub, Bitbucket). See the main track handbook for authors for details
Accessibility. Datasets and code should be available and accessible to all reviewers, ACs, and SACs at the time of submission, and without a personal request to the PI. Code should be documented and executable. Non-compliance justifies the desk rejection of the paper.
Metadata. Authors of datasets must use the Croissant machine-readable format to document their datasets and include the Croissant file with their paper submission in OpenReview. This year, we request that authors provide both core and Responsible AI (RAI) fields of the Croissant format.
- If your data is hosted on one of the preferred platforms (Kaggle, OpenML, Hugging Face, Dataverse), a Croissant metadata file is automatically generated for you, including core Croissant fields. You will need to add the Croissant RAI fields. We will provide tools to facilitate this process.
- If you host your data elsewhere, you must generate the Croissant metadata file yourself.
- You can verify the validity of your Croissant file via this online tool.
Camera-ready. All accepted papers must have their code and datasets documented and publicly available by the camera-ready deadline.
FAQs
Please see the Evaluations and Datasets Track FAQs.
Contact
For questions or additional information, contact the ED Track chairs at evaluationsdatasets@neurips.cc