Driving Materials Innovation with Natural Language Processing


The majority of all materials data is currently scattered across the text, tables, and figures of millions of scientific publications. In my talk, I will present the work of our team at Lawrence Berkeley National Laboratory on the use of natural language processing (NLP) and machine learning techniques to extract and discover materials knowledge through textual analysis of the abstracts of several million journal articles. With this data we are exploring new avenues for materials discovery and design, such as how functional materials like thermoelectrics can be identified by using only unsupervised word embeddings for materials. To date, we have used advanced techniques for named entity recognition to extract more than 100 million mentions of materials, structures, properties, applications, synthesis methods, and characterization techniques from our database of over 3 million materials science abstracts. With this data, we are developing machine learning tools for autonomously building databases of materials-properties data extracted from unstructured materials text. Finally, my talk will also feature a sneak peek into the public-facing website and API we have developed to make this data freely available to the materials research community.