Skip to yearly menu bar Skip to main content


Deprecated Datasets

The following datasets are known to have been deprecated by their creators:

  • Full version of LAION-5B 
  • Versions of ImageNet 21k from prior to Dec 2019
  • 80 Million Tiny Images
  • MS-Celeb-IM
  • Duke MTMC
  • Brainwash
  • MegaFace
  • HRT-Transgender

Those involved in NeurIPS 2022 are asked to check for the use of deprecated datasets and make recommendations for removal/replacement. While the use of deprecated datasets is not currently a basis for rejection, it may be in future.

Context

Dataset deprecation, which is removing a dataset from circulation, is a critical part of the data stewardship process. Datasets can be deprecated for many reasons, such as technical or legal issues , but also for more mundane reasons like when a new version of a dataset is created. In all deprecation cases, to ensure datasets’ integrity and legitimacy, the NeurIPS community is developing mechanisms to ensure that information about deprecations are easy to find, well documented, and clearly communicated.This will also help researchers and practitioners to avoid technical, legal and ethical issues when using datasets that have been (or should be) deprecated.  

Reasons for Dataset Deprecation

  • Technical considerations: e.g. contamination and duplication (the presence of the same data in both the training and testing of a given dataset), or obsolete data 
  • Legal considerations: e.g. potential violations of laws governing privacy, discrimination, data protection, intellectual property licenses, fair decision-making processes, consumer protection, and use of an individual’s image or likeness.
  • Ethical considerations: datasets can produce different forms of harm, e.g. representational harms (such as reinforcing the negative or stereotyped depiction of particular groups by virtue of identity). As numerous audits of datasets have shown, dataset harms tend to disproportionately affect marginalized groups along the intersecting axes of race, ethnicity, gender, class etc.  

Dataset deprecation is an important part of the dataset life cycle, and good practices for deprecation make for better research and implementation. 

Further Reading

  1. Harvey, A., & LaPlace, J. (2019). Megapixels: Origins, ethics, and privacy implications of publicly available face recognition image datasets. Megapixels, 1(2), 6.
  2. Luccioni, A., Corry, F., Sridharan, H., Ananny, M., Schultz, J., & Crawford, K. (2022). Caring for Datasets: A Framework for Deprecating Datasets and Responsible Data Stewardship, FAccT '22: 2022 ACM Conference on Fairness, Accountability, and Transparency Proceedings
  3. Prabhu, V. U., & Birhane, A. (2020). Large image datasets: A pyrrhic win for computer vision?. arXiv preprint arXiv:2006.16923.