LLaMat: Large Language Models for Materials Science Information Extraction
Abstract
Large language models have emerged as an important tool for information extraction and as scientific assistants in materials science and discovery. However, their performance is limited due to a lack of domain expertise. In this work, we propose LLaMat models, namely, LLaMat-2-7B and LLaMat-3-8B, which are obtained by continuously pre-training META's LLaMA-2-7B and LLaMA-3-8B models, respectively, on a large corpus of 30B tokens of materials science text to improve their domain expertise. We also developed LLaMat-Chat models, instruction fine-tuned variants of LLaMat models tailored through a dataset of one million instruction-output pairs, enabling interaction and information extraction abilities for the materials science domain. We show that LLaMat achieves state-of-the-art performance on several information extraction tasks from materials science text, where LLaMat-3-8B emerges as the best model. We also demonstrate the application of the developed model on structured information extraction capabilities of the developed chat models and compare their performance on 4 datasets ranging from named entity and relation extraction from text and understanding composition tables from materials science research papers.