NeurIPS 2026 Evaluations & Datasets Hosting Guidelines

Please note: These guidelines are designed for submissions where a dataset is a key part of the contribution. If your contribution is an executable environment/codebase rather than a static dataset, there is no need for dataset hosting.
Update: Added more options to generate Croissant files (see step 2).

This webpage provides guidance for authors submitting to the Evaluations and Datasets Track. Authors of datasets are **required** to make their datasets available along with Croissant machine-readable metadata (core and minimal RAI) to streamline the review process and meet industry standards for high-quality documentation, reproducibility, and accessibility.

The dataset hosting process is outlined below:

Host your dataset. Choose one of the supported platforms: Harvard Dataverse, Kaggle, Hugging Face, and OpenML or set up a custom hosting solution. The goal is to ensure that the dataset is properly hosted, also after the conference.
Create the Croissant file - core fields. Use the platform’s tooling to download the automatically generated Croissant file, or create it manually. This file will initially include only the Croissant core fields.
Complete the Croissant file - RAI fields. Add the required Responsible AI (RAI) metadata. We provide an online RAI editing tool to facilitate this.
Validate the Croissant file. Use the online Croissant validator tool to ensure your file is complete and correctly formatted.
Provide dataset URL. Include a link to your hosted dataset and upload the completed Croissant file to OpenReview.
Post-acceptance requirement. If your submission is accepted, you must make your dataset publicly available by the camera-ready deadline.

Dataset collections

If your contribution contains multiple datasets, each dataset must be hosted using the guidelines below (e.g., with individual Croissant files for each dataset). If your datasets are processed versions of existing datasets, and this is allowed under their license and intended use, the guidelines below also apply. For instance, the datasets must be properly hosted (possibly on different platforms), and Croissant files are required describing how each datasets was processed. Since Croissant metadata is specific to each dataset, some information may be repeated in each file. The collection as a whole must then also be properly described, ideally in a dedicated website to accompany your submission.

1. Host your dataset

Harvard Dataverse, Kaggle, Hugging Face, and OpenML are the preferred platforms for hosting datasets. These platforms automatically generate a Croissant file and enable programmatic metadata verification and dataset assessment, thereby streamlining and standardizing the review process. When you make your dataset accessible via one of these platforms, submitting will be as simple as providing a dataset URL and uploading the generated Croissant file.

The table below outlines key platform features to help authors choose where to host their dataset. Authors may make their dataset accessible via multiple platforms at any time.

	Automatically generated Croissant file	API: Croissant download, data download, data loader	Hosting restrictions	Private preview URL access	Credentialized (gated) access	Supports Linking to Papers	DOIs
Harvard Dataverse	✅	✅✅✅	1TB per dataset (2.5GB per file) Any file types	✅	✅	✅	✅
Kaggle	✅	✅✅✅	200GB per dataset Any file types	✅	❌	❌	✅
Hugging Face	✅	✅✅✅	300GB per dataset public Any file types	❌	✅	✅	✅
OpenML	✅	✅✅✅	200GB per dataset Any file types	❌	❌	✅	❌

Authors are responsible for reviewing and complying with the Terms of Service of the platform(s) they choose to use.

If you need more details on how to publish your dataset on these platforms, please refer to the “How to publish on preferred hosting platforms” section at the end of this page.

Note on Croissant file: Please note that these platforms populate only the “core” Croissant fields and you need to ensure that you add the minimal Responsible AI as described below.

Self-hosting your dataset and other data storage platforms

If you choose NOT to release your dataset via one of these preferred platforms, you can self-host the data or use other platforms, but you will still be required to make your dataset accessible to reviewers via URL (e.g., to a GitHub repo, public cloud bucket, Zenodo, etc.) and manually generate and upload a Croissant file representation of your dataset as part of the OpenReview submission process.

2. Core Croissant metadata

As part of the submission process, we require submissions with dataset contributions to provide a Croissant file with both Core and (minimal) Responsible AI metadata fields.

If you hosted your dataset in one of the platforms listed above, they will generate the Croissant file for you. You can download it directly from the platform and jump to the next step. Step-by-step guidelines on downloading the Croissant file are in the table at the very bottom of this guide. If you use other or bespoke forms of dataset hosting, create the Croissant file as explained next.

Creating Croissant files manually

A Croissant file is a JSON file that describes your dataset in enough detail to load and analyse it automatically. For more information, check the Croissant specification.

If you prefer to generate your Croissant metadata locally, you can:

Create the JSON file manually (e.g., with a text editor or script)
Use the Croissant Python package
Use the Croissant Baker CLI tool (currently in alpha), which automatically analyzes your dataset folder and generates a Croissant file
Use the Croissant Generator (currently in alpha), which tries to generate a Croissant file based on your paper PDF alone

Minimal Croissant core metadata

NeurIPS follows the standard Croissant spec, meaning that these fields are required:

Class	Field	Description
Dataset	@context	Defines the namespaces used (e.g., sc for Schema.org and cr for Croissant).
	@type	Must be set to "sc:Dataset".
	name	The name of the dataset.
	url	The canonical URL of the dataset.
	license	The license of the dataset. Croissant recommends using the URL of a known license, e.g., one of the licenses listed at https://spdx.org/licenses/.
	conformsTo	Must declare conformity to the specification version (e.g., "http://mlcommons.org/croissant/1.1").
	distribution	Describes the physical resources (files/archives) of the dataset.
	recordSet	Defines the actual structure, schemas, and semantics of the data.
FileObject	@id	A unique identifier for the file resource within the dataset.
	contentUrl	The direct URL or path where the file's byte content can be fetched/downloaded.
	encodingFormat	The MIME type or encoding format of the file.
FileSet	@id	A unique identifier for the directory or subset of files.
	containedIn	References the parent FileObject if inside an archive.
	includes	A glob pattern specifying exactly which files are included.
	encodingFormat	The MIME type of the files within the set.
RecordSet	@id	A unique identifier for the set of records.
	field	A list of fields detailing the columns or attributes. (Must contain at least one).
Field	@id	A unique identifier (name) for the specific field.
	dataType	The atomic or semantic type of the expected values.
	source or value	Defines where the data comes from (extracted via source or a constant value).

Dataset license

Datasets should have a clear license that enables the intended use of the data. This helps ensure that others can understand the terms under which your data can be used and promotes responsible sharing.
We recommend using open data licenses that allow reuse and reproducibility. However, when the dataset contains ethically sensitive data (e.g., medical data), or when stricter licensing is unavoidable, please select an appropriate license that is as open as possible given the constraints. The license should be included in the Croissant file of your dataset.

3. Responsible AI metadata

It is important that datasets are created and used responsibly. Authors must provide a minimal set of Responsible AI (RAI) metadata, as described below. This information must be clearly identifiable in the submission, and reviewers will check for its presence.

The RAI metadata must be included in the Croissant file, which serves as the canonical, long-term record accompanying the dataset. Wherever possible, authors should provide the full RAI information directly in the corresponding Croissant fields.

If certain information cannot be fully expressed in Croissant (e.g., due to formatting constraints such as LaTeX equations), authors may instead use the Croissant fields to clearly point to the relevant sections of the paper (main text or appendix).

Authors may also include the RAI information in other parts of the paper (e.g., main paper or appendix) for clarity or elaboration. In all cases, we ask that the authors at least touch on these topics at a high level in the main part of the paper, even if only to guide the reader to where more detailed information can be found (e.g. appendix or Croissant).

All submissions that include a dataset as a contribution must provide answers to these RAI fields, at least in the Croissant file. Failure to do so may constitute grounds for rejection.

Minimal Responsible AI metadata

The table below contains a high-level overview of the minimal RAI data. A much more detailed description with examples can be found in the NeurIPS Evaluations and Datasets RAI guideline.

		Data limitations rai:dataLimitations		Known constraints on the dataset's applicability: distributional gaps, underrepresented populations, data quality issues, or domain restrictions. Also list any uses for which this dataset is explicitly not recommended.
		Data biases rai:dataBiases,		Any known or suspected biases in the data, including selection bias, label bias, or demographic skew. Which population groups or scenarios may be over- or under-represented, and how this may affect model behaviour.
		Personal or sensitive information rai:personalSensitiveInformation		Personal or sensitive information such as Gender, Socio-economic status, Geography, Language, Age, Culture, Experience or Seniority, Health or medical data, Political or religious beliefs.
		Data use cases rai:dataUseCases		Construct validity: what real-world concept the data is intended to measure or represent, and provide evidence that the data reliably captures that construct. List the use cases for which validity has been established (e.g. safety evaluation, fairness auditing, fine-tuning), and for which it has not.
		Social impact rai:dataSocialImpact		The potential positive and negative societal effects of using this dataset, including risks of misuse, fairness implications for specific communities, and any mitigations put in place.
		Synthetic data rai:hasSyntheticData		A boolean indicating the presence of synthetic data. If so, authors are encouraged to describe the synthetic data process in using the data collection and annotation fields.
		Source datasets prov:wasDerivedFrom		The URI of the dataset from which the present dataset is derived. Can be many URIs from different data sources. For synthetic data point to synthetic data seeds used.
		Provenance activities prov:wasGeneratedBy		Preprocessing: cleaning, or filtering steps applied to the data prior to use, to make the data pipeline reproducible. Data collection: collection period, geographic scope, and any instruments or protocols used. In case of synthetic data, document the seeds or prompts used or the synthetic data generator used. Data annotation: labeling schema, instructions provided to annotators, quality control measures, and inter-annotator agreement scores where available. Describe the human teams, synthetic agents and platforms used to collect, annotate or preprocess the data.

Including RAI metadata in the Croissant file

The hosting platforms above currently only populate the “core” Croissant fields, so you need to ensure that you add the RAI fields. There are several options to do this:

Croissant RAI tool
We will provide an online Croissant RAI editing tool where you can easily augment your existing Croissant file with the minimal RAI metadata.

Manually add the RAI fields
Adapt the Croissant file by adding the fields specified in the table and guidelines above. Afterwards, use the Croissant validator (see the next step) to check whether it was done correctly.

4. Validate the Croissant file

When you submit to Open Review, you will be required to provide a URL to your dataset accessible to reviewers and a Croissant metadata file. To avoid desk rejections, we ask authors to verify the validity of their Croissant file using the Croissant validator tool.

Please use this online tool to verify your Croissant file. This tool will check whether it is valid, sufficiently complete, and whether the data can be automatically accessed and loaded (the latter may not be possible for all datasets). The same information will be made available to reviewers.

Note: this tool may experience heavy usage near the submission deadline. Please run the validation well in advance. On the bottom of the tool website, you can also find workarounds to duplicate and run the tool yourself.

5. Submission

In your submission on OpenReview, you must always include the following:

The URL to your dataset *accessible* to reviewers
The Croissant metadata file (validated)

After submissions are closed, if your Croissant file is invalid or if your data is not accessible, your submission may be desk-rejected. Otherwise, reviewers will begin reviewing your submission, paper, dataset, and metadata.

If your dataset is accepted, you will be required to make it public by the camera-ready deadline. Failure to do so may result in removal from the conference and proceedings.

Appendix: Preferred Hosting Platforms

This section provides specific guidance and documentation on how to make your dataset and its Croissant metadata file accessible via the preferred hosting platforms (Harvard Dataverse, Kaggle, Hugging Face, and OpenML) and how to download your Croissant file.

	How to upload	How to download (files, Croissant)	How to get help, e.g., to request additional storage quota
Harvard Dataverse	Upload a Dataset via UI (after login) or CLI) Requirements An email-verified Harvard Dataverse user account Publicly shared (or “Link Sharing” turned on in the “Edit” menu to generate a Preview URL for a private dataset) at time of submission to E&D track Platform restrictions Documentation 1TB per dataset (2.5GB per file)	Download files Click “Access Dataset” and choose an option Download Croissant Click “Metadata”, Click “Export Metadata”, and select “Croissant” Via Python client	Contact support@dataverse.harvard.edu
Kaggle	Upload a Dataset via UI (after login) or Python client Requirements A phone-verified Kaggle user account Publicly shared (or “Link Sharing” turned on in “Settings” to generate a Preview URL for a private dataset) at time of submission to E&D track (Optional) An approved Organization profile to host the data under if preferred, e.g., your research lab Platform restrictions Documentation 200GB per dataset	Download files Click “Download” and choose “Download dataset as zip” Via kagglehub Python client Download Croissant Click “Download” and choose “Export metadata as Croissant” Click “Code”, select “Load via mlcroissant”, and copy Python code	Contact kaggle-datasets@kaggle.com
Hugging Face	Upload a Dataset via UI (after login) or Python client Requirements A Hugging Face user account Publicly shared at time of submission to E&D track Must be a format listed here in order to generate a Croissant file (Optional) An Organization profile to host the data under if preferred, e.g., your research lab Platform restrictions Documentation 300GB per dataset	Download files Click “API” and copy curl code Download Croissant Click “Croissant” underneath the dataset name (you may need to toggle +1 first) and choose “Download Croissant metadata”. Store the croissant description in a .json file. Via Python client Download files Click “API” and copy curl code Download Croissant Click “Croissant” and choose “Download Croissant metadata” Via Python client	Contact datasets@huggingface.co
OpenML	Upload Dataset via UI (after login) or Python client (recommended) Requirements An email-verified OpenML user account Publicly shared at time of submission to E&D track Platform restrictions Documentation 200GB per dataset	Download files Click “Download” Via Python client Download Croissant In the web UI, click “Croissant”.	Contact openmlhq@openml.org

FAQ

Please check the Evaluations and Datasets Track FAQs