Automated, LLM enabled extraction of synthesis details for reticular materials from scientific literature
Abstract
Automated knowledge extraction from scientific literature can potentially acceleratematerials discovery. We have investigated an approach for extracting synthesisprotocols for reticular materials from scientific literature using large languagemodels (LLMs). To that end, we introduce a Knowledge Extraction Pipeline (KEP)that automatizes LLM-assisted paragraph classification and information extraction.By applying prompt engineering with in-context learning (ICL) to a set of open-source LLMs, we demonstrate that LLMs can retrieve chemical information fromPDF documents, without the need for fine-tuning or training and at a reduced riskof hallucination. By comparing the performance of five open-source families ofLLMs in both paragraph classification and information extraction tasks, we observeexcellent model performance even if only few example paragraphs are included inthe ICL prompts. The results show the potential of the KEP approach for reducinghuman annotations and data curation efforts in automated scientific knowledgeextraction.