Natural Language Processing for Materials Discovery


The majority of all materials data is currently scattered across the text, tables and figures of millions of scientific publications. We present recently developed natural language processing and machine learning techniques to extract materials knowledge by textual analysis of the abstracts of several million journal articles. We describe our use of Word2Vec to map words in our corpus to vector representations, which we then use as inputs to named entity recognition (NER) classifiers to extract materials, structures, properties, applications, synthesis methods, and characterization techniques from the abstracts in our database. With this information, we have created new tools for materials literature review such as: searching within chemical systems, filtering articles by experiment/theory, summarizing the known attributes of a material, or finding similar materials to a target. Furthermore, we report how these techniques can be used not only to automatically summarize existing knowledge, but enable new ways of discovering novel materials such as thermoelectrics or ion-conductors by revealing previously undiscovered relationships between materials and their properties.