Poster
in
Workshop: AI for Accelerated Materials Design (AI4Mat-2023)
Accurate Prediction of Experimental Band Gaps from Large Language Model-Based Data Extraction
Samuel Yang · Shutong Li · Subhashini Venugopalan · Vahe Tshitoyan · Muratahan Aykol · Amil Merchant · Ekin Dogus Cubuk · Gowoon Cheon
Keywords: [ Data Mining ] [ experimental band gap ] [ Large language models ]
Machine learning is transforming materials discovery by providing rapid predictions of material properties, which enables large-scale screening for target materials. However, such models require training data. While automated data extraction from scientific literature has potential, current auto-generated datasets often lack sufficient accuracy and critical structural and processing details that influence the properties. Using band gap as an example, we demonstrate LLM-prompt-based extraction yields an order of magnitude lower error rate. Combined with additional prompts to select a subset of experimentally measured properties from pure, single-crystalline bulk materials, this results in an automatically extracted dataset that's larger and more diverse than the largest existing human-curated database of experimental band gaps. Finally, compared to the existing human-curated database, we show the model trained on our extracted database achieves a 15\% reduction in the mean absolute error of predicted band gaps.