NeurIPS 2026 Evaluations & Datasets Track Call for Papers

The NeurIPS Evaluations & Datasets (E&D) Track invites submissions that advance the science and practice of evaluation in AI/ML including the development and use of datasets and other resources.

Formerly the Datasets & Benchmarks Track, the E&D Track reflects a shift and broadening in scope: evaluation becomes an object of scientific study in its own right. Scientific debates increasingly hinge on evaluation - what is measured, under what assumptions, and how results are interpreted. We define evaluation as the full set of processes, tools, datasets, benchmarks, and practices used to test, stress-test, audit, compare, and interpret AI/ML systems across their lifecycle.

Datasets remain central to the track, both as components of evaluations and as resources used across the AI/ML lifecycle (training, fine-tuning, testing, auditing). However, dataset submissions should also clarify how they should be meaningfully used in evaluative practices rather than being endpoints in themselves. Submissions are expected to clearly articulate the evaluative role their contribution plays: what claims it supports, under what assumptions, and what limitations apply. Authors are strongly encouraged to review the accompanying blog post for a more detailed explanation of the revised scope.

The Call for Papers of the NeurIPS 2026 Evaluations and Datasets will follow the Call for Papers of the NeurIPS 2026 Main Track. Accepted papers will be published in the NeurIPS proceedings and presented at the conference alongside main track papers. We aim for an equally stringent review as the main conference track, while allowing for the track-specific guidelines described below. For details on formatting, code of conduct, ethics review, and other submission-related topics, please refer to the NeurIPS 2026 Main Track Call for Papers.

Scope

The ED Track welcomes contributions that advance the science of AI evaluation. These include, but are not limited to, work that:

Analyze strengths, limitations, or failure modes of existing benchmarks or evaluation practices
Study benchmark saturation or overfitting and their impact on scientific conclusions
Compare evaluation designs and demonstrate how different assumptions lead to different results or conclusions
Provide rigorous reproduction, auditing, and stress-testing of prior evaluations
Develop documentation methodologies (e.g., Data Cards, Model Cards, evaluation cards) that improve how evaluative claims are made, interpreted, and compared
Refine existing evaluation setups
Propose new evaluation protocols, practices, or methodologies
Conduct human- or interaction-centered evaluations (e.g., user studies, red-teaming)
Introduce datasets and clearly explain their scope, assumptions, limitations, and how they are intended to support or shape evaluative claims across the AI/ML lifecycle
Contribute tools, analyses, or frameworks that improve how evaluative claims are constructed or interpreted
Present negative results, critical analyses, and use-case-inspired evaluations

Moreover, data-centric and benchmarking submissions historically welcomed by the track remain fully in scope. These include, but are not limited to:

New datasets and dataset collections
Data generators and reinforcement learning environments
Data-centric AI methods and tools
Advanced data collection and curation practices
Responsible dataset development frameworks
Audits of existing datasets
Benchmarks on new or existing datasets, benchmarking tools, and methodologies
Systematic analyses of systems on novel datasets
In-depth analyses of machine learning challenges and competitions (by organisers and/or participants) that yield important new insights
Competition papers from prior NeurIPS competitions

Submissions need not introduce a new model or outperform prior work. Submissions may introduce a variety of artifacts (e.g., datasets, protocols, tools, documentation practices). What matters is how the contribution supports and/or advances meaningful evaluation. Authors should articulate how their work meaningfully changes, strengthens, or enables the evaluation of AI/ML systems.

Key Dates

The dates are identical to the main track:

Abstract submission deadline: May 4, 2026 (AoE). All authors must have an OpenReview profile when submitting.
Full paper submission deadline (including all supplementary materials): May 6, 2026 (AoE)
Author notification: September 24, 2026 (AoE)

Please note: The submission portal opens on April 15, 2026 (at the same time as the main track).

Key Links

Submission link: OpenReview (separate from the Main Track portal)
NeurIPS 2026 Main Track Call for Papers
Blog post on the ED Track scope update
How to choose your track (Main, Evaluations & Datasets, Position Papers): see FAQs.
Paper Template

ED Track-Specific Guidelines

The Evaluations and Datasets Track follows the NeurIPS Main Track Call for Papers in terms of requirements and timelines, with the following track-specific additions.

Review Policy

Unlike previous years, the default review mode for the E&D Track is now double-blind, reflecting the track's broader scope beyond dataset submissions. Authors whose submission centers on datasets that cannot be fully anonymized for scientific or ethical reasons (e.g., due to dataset collection, hosting, or existing infrastructure revealing institutional affiliation) should indicate this in the submission form by selecting single-blind review.

In specific cases where benchmarks cannot be fully anonymized (e.g., because they build on existing codebases), we ask that authors make a best-effort attempt to anonymize the submission as much as possible.

Format: To ensure fair and consistent use of space across submissions, all papers must use the default double-blind LaTeX format. See also our FAQ.

Note: NeurIPS does not tolerate any collusion whereby authors secretly cooperate with reviewers, ACs, or SACs to obtain favorable reviews.

Dataset and Code Submission

Datasets and code must be properly hosted, accessible, and clearly documented upon submission; these are essential submission requirements.

Dataset Hosting. New datasets should remain available long-term. Therefore, they should be hosted on a dedicated ML hosting site (Dataverse, Kaggle, Hugging Face, or OpenML), or on a bespoke hosting site if the dataset requires it. See the guidelines for data hosting for details.
Note: for large datasets (larger than 4GB), authors are required to include a small sample of the data to allow reviewers to inspect the data quality.

Benchmarks or datasets that include a collection of new or existing datasets must also ensure that the underlying datasets are compliant with the same guidelines (eg, availability, croissant files, etc.)

Code. As the track now encompasses a broad range of contributions — including executable artifacts, empirical audits, negative results, methodological analyses, and formal treatments of metrics and evaluation design — our code policy is contribution-dependent. We strongly encourage all authors to release code whenever feasible to promote transparency and reproducibility. However, code release is required at submission when the primary contribution is a reusable executable artifact, such as a benchmark suite, evaluation environment, data generator, or software tool, whose functionality must be inspected in order to evaluate the scientific claims. For submissions whose contributions are analytical, empirical, conceptual, or methodological and do not introduce such artifacts, code release is not mandatory, provided that the paper includes sufficient detail to enable meaningful review of the claims.

Code should be made accessible in an executable format via a hosting platform of your choice (e.g., GitHub, Bitbucket). Please make sure to properly anonymize your code based on the requirements of the above "Review Policy".

Accessibility. Datasets and code should be available and accessible to all reviewers, ACs, and SACs at the time of submission, and without a personal request to the PI. Code should be documented and executable. Non-compliance justifies the desk rejection of the paper.

Metadata and Responsible AI. Authors of datasets must use the Croissant machine-readable format to document their datasets and include the Croissant file with their paper submission in OpenReview. This year, we request that authors provide both core and Responsible AI (RAI) fields of the Croissant format.

If your data is hosted on one of the preferred platforms (Kaggle, OpenML, Hugging Face, Dataverse), a Croissant metadata file is automatically generated for you, including core Croissant fields. You will need to add the Croissant RAI fields. See the guidelines for data hosting for details.
If you host your data elsewhere, you must generate the Croissant metadata file yourself.
You can verify the validity of your Croissant file via this online tool.

Camera-ready. All accepted papers must have their code and datasets documented and publicly available by the camera-ready deadline.

FAQs

Please see the Evaluations and Datasets Track FAQs.

Contact

For questions or additional information, contact the ED Track chairs at evaluationsdatasets@neurips.cc