Instruction Guide: Datasets & Benchmarks Track Dataset Hosting Platforms
This webpage serves as an instruction guide for authors making a submission to the Datasets & Benchmarks Track CFP. At a high level, authors of datasets will be required to make their dataset available along with Croissant machine-readable metadata in order to streamline the review process and meet industry standards for high quality documentation, reproducibility, and accessibility. As a part of these new requirements, we are recommending use of preferred hosting platforms–Dataverse, Kaggle, Hugging Face, and OpenML–which will make compliance simple for authors and reviewers. For more detailed information regarding the rationale for submission requirements new in 2025, please read our blog post.
The dataset hosting process as part of submitting to the Datasets & Benchmarks Track involves:
- Choosing among 4 options to host your dataset: Harvard Dataverse, Kaggle, Hugging Face, and OpenML
- Using platform tooling to download the automatically generated Croissant file
- Including a URL to your dataset and uploading the generated Croissant file in Open Review
- If your submission is accepted: making your dataset public by the camera ready deadline
Choose a Preferred Hosting Platforms
Harvard Dataverse, Kaggle, Hugging Face, and OpenML platforms are the preferred hosting platforms for datasets. These platforms automatically generate a Croissant file and allow for programmatic metadata verification and dataset assessment which will streamline and standardize the review process. When you make your dataset accessible via one of these platforms, making a submission will be as simple as providing a URL to the dataset and uploading a generated Croissant file.
The table below outlines key platform features to help authors choose where to host their dataset. Authors may make their dataset accessible via more than one platform at any time.
Automatically generated Croissant file |
Client libraries: Croissant download, data download, data loader |
Hosting restrictions |
Private preview URL access |
Credentialized (gated) access |
Paper Linking |
DOIs |
|
Harvard Dataverse |
✅ |
✅✅✅ |
1TB per dataset (2.5GB per file) Any file types |
✅ |
✅ |
✅ |
✅ |
Kaggle |
✅ |
✅✅✅ |
200GB per dataset Any file types |
✅ |
❌ |
❌ |
✅ |
Hugging Face |
✅ |
✅✅✅ |
300GB per dataset public Any file types |
❌ |
✅ |
✅ |
✅ |
OpenML |
✅ |
✅✅✅ |
200GB per dataset Any file types |
❌ |
❌ |
✅ |
❌ |
Authors are responsible for reviewing and complying with the Terms of Service of the platform(s) they choose to use.
Alternative: Self-hosting your Dataset and Other Data Storage Platforms
If you choose NOT to release your dataset via one of these preferred platforms, you can self-host the data or use other platforms, but you will still be required to make your dataset accessible to reviewers via URL (e.g., to a GitHub repo, public cloud bucket, Zenodo, etc.) and manually generate and upload a Croissant file representation of your dataset as part of the Open Review submission process. This Python tutorial demonstrates how to generate a Croissant file. Find more documentation on the Croissant GitHub repository.
How to Publish on Preferred Hosting Platforms
This section provides specific guidance and documentation on how to make your dataset and its Croissant metadata file accessible via the preferred hosting platforms: Harvard Dataverse, Kaggle, Hugging Face, and OpenML.
How to upload |
How to download (files, Croissant) |
How to get help, e.g., to request additional storage quota |
|
Harvard Dataverse |
Upload a Dataset via UI (after login) or CLI) Requirements
Platform restrictions |
Download files
Download Croissant
|
Contact support@dataverse.harvard.edu |
Kaggle |
Upload a Dataset via UI (after login) or Python client Requirements
Platform restrictions
|
Download files
Download Croissant
|
Contact kaggle-datasets@kaggle.com |
Hugging Face |
Upload a Dataset via UI (after login) or Python client Requirements
Platform restrictions
|
Download files
Download Croissant
|
Contact datasets@huggingface.co |
OpenML |
Upload Dataset via UI (after login) or Python client (recommended) Requirements
Platform restrictions
|
Download files
Download Croissant
|
Contact openmlhq@openml.org |
What to Include in Your Submission
When you submit to Open Review, you will be required to provide:
- A URL to your dataset accessible to reviewers
- A Croissant metadata file
We will provide a tool to verify your dataset based on your Croissant URL. This tool will check whether it is valid, sufficiently complete, and whether the data could be accessed and loaded. The same information will be made available to reviewers.
After submissions are closed, if your Croissant file is invalid or if your data is not accessible, your submission may be desk-rejected. Otherwise, reviewers will commence review of your submission paper, dataset, and metadata.
If your dataset is accepted, you will be required to make it public by the camera ready deadline. Failure to do so may result in removal from the conference and proceedings.
FAQ
I don’t want to make my dataset publicly accessible at the time of submission. What are my options?
Harvard Dataverse and Kaggle platforms both offer private URL preview link sharing. This means your dataset is only accessible to those with whom you share its special URL, e.g., reviewers.
Note that you will be required to make your dataset public by the camera ready deadline. Failure to do so may result in removal from the conference and proceedings.
Can I make changes to my dataset after I’ve made my submission to Open Review?
You can make changes until the submission deadline. After the submission deadline, we will perform automated verification checks of your dataset to assist in streamlining and standardizing reviews.
If it changes in a way that invalidates the original reviews at any time between the submission deadline and by the camera ready deadline or publication of proceedings, we reserve the right to remove it from the conference or proceedings.
I’m experiencing problems with the platform I’m using to release my dataset. What should I do?
We have worked with maintainers of the dataset hosting platforms to identify the appropriate contact information authors should use to contact for support in case of issues or help with workarounds for storage quotas, etc. Find this contact information in the section “How to Publish on Preferred Hosting Platforms”.
I need to require credentialized (AKA gated) access to my dataset
This will be possible on the condition that a credentialization is necessary for the public good (e.g. because of ethically sensitive medical data), and that an established credentialization procedure is in place that is 1) open to a large section of the public; 2) provides rapid response and access to the data; and 3) is guaranteed to be maintained for many years. A good example here is PhysioNet Credentialing, where users must first understand how to handle data with human subjects, yet is open to anyone who has learned and agrees with the rules.
This should be seen as an exceptional measure, and NOT as a way to limit access to data for other reasons (e.g., to shield data behind a Data Transfer Agreement). Misuse would be grounds for desk rejection. During submission, you can indicate that your dataset involves open credentialized access, in which case the necessity, openness, and efficiency of the credentialization process itself will also be checked.