NeurIPS 2025 Data Hosting Guidelines

Instruction Guide: Datasets & Benchmarks Track Dataset Hosting Platforms

This webpage serves as an instruction guide for authors making a submission to the Datasets & Benchmarks Track CFP. At a high level, authors of datasets will be required to make their dataset available along with Croissant machine-readable metadata in order to streamline the review process and meet industry standards for high quality documentation, reproducibility, and accessibility. As a part of these new requirements, we are recommending use of preferred hosting platforms–Dataverse, Kaggle, Hugging Face, and OpenML–which will make compliance simple for authors and reviewers. For more detailed information regarding the rationale for submission requirements new in 2025, please read our blog post.

The dataset hosting process as part of submitting to the Datasets & Benchmarks Track involves:

Choosing among 4 options to host your dataset: Harvard Dataverse, Kaggle, Hugging Face, and OpenML
Using platform tooling to download the automatically generated Croissant file
Including a URL to your dataset and uploading the generated Croissant file in Open Review
If your submission is accepted: making your dataset public by the camera ready deadline

Choose a Preferred Hosting Platforms

Harvard Dataverse, Kaggle, Hugging Face, and OpenML platforms are the preferred hosting platforms for datasets. These platforms automatically generate a Croissant file and allow for programmatic metadata verification and dataset assessment which will streamline and standardize the review process. When you make your dataset accessible via one of these platforms, making a submission will be as simple as providing a URL to the dataset and uploading a generated Croissant file.

The table below outlines key platform features to help authors choose where to host their dataset. Authors may make their dataset accessible via more than one platform at any time.

	Automatically generated Croissant file	Client libraries: Croissant download, data download, data loader	Hosting restrictions	Private preview URL access	Credentialized (gated) access	Paper Linking	DOIs
Harvard Dataverse	✅	✅✅✅	1TB per dataset (2.5GB per file) Any file types	✅	✅	✅	✅
Kaggle	✅	✅✅✅	200GB per dataset Any file types	✅	❌	❌	✅
Hugging Face	✅	✅✅✅	300GB per dataset public Any file types	❌	✅	✅	✅
OpenML	✅	✅✅✅	200GB per dataset Any file types	❌	❌	✅	❌

Authors are responsible for reviewing and complying with the Terms of Service of the platform(s) they choose to use.

Alternatives

Self-hosting your Dataset and Other Data Storage Platforms

If you choose NOT to release your dataset via one of these preferred platforms, you can self-host the data or use other platforms, but you will still be required to make your dataset accessible to reviewers via URL (e.g., to a GitHub repo, public cloud bucket, Zenodo, etc.) and manually generate and upload a Croissant file representation of your dataset as part of the Open Review submission process.

Generating a Croissant File

This Python tutorial demonstrates how to generate a Croissant file. Find more documentation on the Croissant GitHub repository.

You can also try the Croissant editor or run it locally (GitHub).

How to Publish on Preferred Hosting Platforms

This section provides specific guidance and documentation on how to make your dataset and its Croissant metadata file accessible via the preferred hosting platforms: Harvard Dataverse, Kaggle, Hugging Face, and OpenML.

	How to upload	How to download (files, Croissant)	How to get help, e.g., to request additional storage quota
Harvard Dataverse	Upload a Dataset via UI (after login) or CLI) Requirements An email-verified Harvard Dataverse user account Publicly shared (or “Link Sharing” turned on in the “Edit” menu to generate a Preview URL for a private dataset) at time of submission to D&B track Platform restrictions Documentation 1TB per dataset (2.5GB per file)	Download files Click “Access Dataset” and choose an option Download Croissant Click “Metadata”, Click “Export Metadata”, and select “Croissant” Via Python client	Contact support@dataverse.harvard.edu
Kaggle	Upload a Dataset via UI (after login) or Python client Requirements A phone-verified Kaggle user account Publicly shared (or “Link Sharing” turned on in “Settings” to generate a Preview URL for a private dataset) at time of submission to D&B track (Optional) An approved Organization profile to host the data under if preferred, e.g., your research lab Platform restrictions Documentation 200GB per dataset	Download files Click “Download” and choose “Download dataset as zip” Via kagglehub Python client Download Croissant Click “Download” and choose “Export metadata as Croissant” Click “Code”, select “Load via mlcroissant”, and copy Python code	Contact kaggle-datasets@kaggle.com
Hugging Face	Upload a Dataset via UI (after login) or Python client Requirements A Hugging Face user account Publicly shared at time of submission to D&B track Must be a format listed here in order to generate a Croissant file (Optional) An Organization profile to host the data under if preferred, e.g., your research lab Platform restrictions Documentation 300GB per dataset	Download files Click “API” and copy curl code Download Croissant Click “Croissant” and choose “Download Croissant metadata” Via Python client	Contact datasets@huggingface.co
OpenML	Upload Dataset via UI (after login) or Python client (recommended) Requirements An email-verified OpenML user account Publicly shared at time of submission to D&B track Platform restrictions Documentation 200GB per dataset	Download files Click “Download” Via Python client Download Croissant In the web UI, click “Croissant”.	Contact openmlhq@openml.org

What to Include in Your Submission

When you submit to Open Review, you will be required to provide:

A URL to your dataset accessible to reviewers
A Croissant metadata file
A confirmation that you verified the validity of your Croissant file.

Please use this online tool to verify your Croissant file. This tool will check whether it is valid, sufficiently complete, and whether the data can be automatically accessed and loaded (the latter may not be possible for all datasets). The same information will be made available to reviewers.

You may can also run the same validation inside your development environment by installing the following (experimental) MCP endpoint. This tool is not officially developed or maintained by NeurIPS D&B. Please use at your own discretion.

After submissions are closed, if your Croissant file is invalid or if your data is not accessible, your submission may be desk-rejected. Otherwise, reviewers will commence review of your submission paper, dataset, and metadata.

If your dataset is accepted, you will be required to make it public by the camera ready deadline. Failure to do so may result in removal from the conference and proceedings.

FAQ

Please see the Datasets and Benchmark track FAQs.