Skip to yearly menu bar Skip to main content


How Copyright Shapes Your Datasets and What To Do About It

Amanda Levendowski


Grappling with copyright law is unavoidable for ML researchers. Copyright protects works like text, photographs, and videos--all of which are used as ML training data, often without consent of the copyright owner. Relying on public domain works (like works published pre-1926), Creative Commons-licensed data (like Wikipedia) or ubiquitous data (like the Enron emails) seems like an easy way to avoid dealing with copyright. Unfortunately, only relying on those works predictably introduces bias into ML algorithms. This Workshop will not provide any legal advice, but it will equip researchers with the tools to understand copyright law and its relationship to ML bias, how the fair use doctrine may allow some copyrighted works to be used as training data without consent, and resources for obtaining legal advice related to copyright and ML research. Attendees will be able to participate in a Q&A after the presentation.

These are some of the resources mentioned in the discussion:

  • Friendly Neighborhood Tech Clinics (no single website, but offices are scattered throughout the US and possibly other countries)
  • How Copyright Law Can Fix Artificial Intelligence’s Implicit Bias Problem
  • Paper: Resisting Face Surveillance with Copyright Law
  • Paper: How Copyright Law Can Fix Artificial Intelligence's Implicit Bias Problem
  • Paper: Fair Learning by Mark Lemley + Bryan Casey

Chat is not available.