Timezone: »
One concern with the rise of large language models lies with their potential for significant harm, particularly from pretraining on biased, obscene, copyrighted, and private information. Emerging ethical approaches have attempted to filter pretraining material, but such approaches have been ad hoc and failed to take context into account. We offer an approach to filtering grounded in law, which has directly addressed the tradeoffs in filtering material. First, we gather and make available the Pile of Law, a ~256GB (and growing) dataset of open-source English-language legal and administrative data, covering court opinions, contracts, administrative rules, and legislative records. Pretraining on the Pile of Law may help with legal tasks that have the promise to improve access to justice. Second, we distill the legal norms that governments have developed to constrain the inclusion of toxic or private content into actionable lessons for researchers and discuss how our dataset reflects these norms. Third, we show how the Pile of Law offers researchers the opportunity to learn such filtering rules directly from the data, providing an exciting new research direction in model-based processing.
Author Information
Peter Henderson (Stanford University)
Mark Krass (Stanford University)
Lucia Zheng (Stanford University)
Neel Guha (Computer Science Department, Stanford University)
Christopher D Manning (Stanford University)
Dan Jurafsky (Stanford University)
Daniel Ho (Stanford Law)
More from the Same Authors
-
2021 : Beyond Ads: Sequential Decision-Making Algorithmsin Public Policy »
Peter Henderson · Brandon Anderson · Daniel Ho -
2021 : Neural Abstructions: Abstractions that Support Construction for Grounded Language Learning »
Kaylee Burns · Christopher D Manning · Li Fei-Fei -
2022 Panel: Panel 4C-5: Pile of Law:… & Multi-LexSum: Real-world Summaries… »
Zejiang Shen · Peter Henderson -
2022 Poster: Deep Bidirectional Language-Knowledge Graph Pretraining »
Michihiro Yasunaga · Antoine Bosselut · Hongyu Ren · Xikun Zhang · Christopher D Manning · Percy Liang · Jure Leskovec -
2022 Poster: Picking on the Same Person: Does Algorithmic Monoculture lead to Outcome Homogenization? »
Rishi Bommasani · Kathleen A. Creel · Ananya Kumar · Dan Jurafsky · Percy Liang -
2020 Workshop: ML Retrospectives, Surveys & Meta-Analyses (ML-RSA) »
Chhavi Yadav · Prabhu Pradhan · Jesse Dodge · Mayoore Jaiswal · Peter Henderson · Abhishek Gupta · Ryan Lowe · Jessica Forde · Joelle Pineau -
2020 Poster: Language Through a Prism: A Spectral Approach for Multiscale Language Representations »
Alex Tamkin · Dan Jurafsky · Noah Goodman -
2018 Poster: Embedding Logical Queries on Knowledge Graphs »
Will Hamilton · Payal Bajaj · Marinka Zitnik · Dan Jurafsky · Jure Leskovec -
2014 Poster: Learning Distributed Representations for Structured Output Prediction »
Vivek Srikumar · Christopher D Manning -
2014 Spotlight: Learning Distributed Representations for Structured Output Prediction »
Vivek Srikumar · Christopher D Manning -
2014 Poster: Simple MAP Inference via Low-Rank Relaxations »
Roy Frostig · Sida Wang · Percy Liang · Christopher D Manning -
2013 Poster: Reasoning With Neural Tensor Networks for Knowledge Base Completion »
Richard Socher · Danqi Chen · Christopher D Manning · Andrew Y Ng -
2013 Poster: Zero-Shot Learning Through Cross-Modal Transfer »
Richard Socher · Milind Ganjoo · Christopher D Manning · Andrew Y Ng -
2012 Poster: Recursive Deep Learning on 3D Point Clouds »
Richard Socher · Bharath Bath · Brody Huval · Christopher D Manning · Andrew Y Ng -
2011 Poster: Unfolding Recursive Autoencoders for Paraphrase Detection »
Richard Socher · Eric H Huang · Jeffrey Pennin · Andrew Y Ng · Christopher D Manning