MatPropXtractor: Generate to Extract

Aswathy Ajith, Marcus Schwarting, Zhi Hong, Kyle Chard, Ian Foster

01 Mar 2023 (modified: 01 Jun 2023)Submitted to Tiny Papers @ ICLR 2023Readers: Everyone

Keywords: scientific discovery, information extraction, large language models

TL;DR: This paper introduces a pipeline for extracting material-property pairs in scientific papers (PDFs), enabling researchers to conduct research efficiently.

Abstract: The field of materials science has amassed a wealth of information about materials in text publications, however, such information is often confined within the publication. A lack of standardized structure and naming consistency preclude the information from being effectively utilized for research and discovery. We introduce MatPropXtractor, an extraction system that uses pre-trained large language models (LLMs) in a generative setting to extract materials and their properties as reported in the materials science literature. MatPropXtractor consists of a three-step pipeline that includes 1) a document selection tool to identify related articles, 2) a paragraph classifier to identify passages containing important materials properties, and 3) a property extractor exploiting in-context learning in GPT-3. MatPropXtractor extracted 154 material-property pairs from five materials science papers. The extracted pairs were analyzed by an expert and obtained an average precision of 72.73% on paragraph classification and an average precision of 56.7% precision on material-property identification.

5 Replies