Guest Author

Natural Language Processing in Materials Science

Natural Language Processing in Materials Science
  • Challenges scientific community faces today
  • Natural Language Processing (NLP) in materials science
  • The word2vec revolution
  • Automated property-value extraction

Which country has the world’s highest divorce rate? You may not know but you could do what I did and google it. Indeed, Google has the answer: The Maldives. If you are interested further, you could read the source, find precise figures and understand the causes.

However, what if you wanted to find out the piezoelectric material with the highest Curie temperature? Googling the phrase doesn’t help. The elves who run the engine only see the terms ‘piezoelectric’ and ‘curie temperature’; all you get are a bunch of pages on high Curie temperature ceramics. You could spend hours going through the pages one by one but you may never get an answer.

Let Siri explain the problem to you: ‘I don’t understand your question’.

The problem above highlights a significant bottleneck in one of the most exciting ideas in today’s world: that of automated high-throughput materials design. Artificial intelligences (AI) that can learn from existing materials and predict compositions with desired properties are now a serious scientific pursuit. The age of AI in material science has dawned.

Read more about how AI is changing materials science in my previous articles, The Age of AI in Materials Science: parts one and two.

Subscribe to Matmatch blog

Machine learning requires a machine-readable database of materials and their properties from which inferences and interpolations can be extracted. This is where the new AI ideas might die before they even begin, due to utter data starvation. Most of the data from the prodigious amount of scientific research on materials are locked away in scientific publications such as books, journals, conference publications, engineering handbooks, encyclopaedias and white papers.

Challenges scientific community faces today

Materials science, like many branches of applied science, has been a closed community where information only flowed from one specialist to another, and at least a Master’s degree has been required for entry. With rapid scientific advancement, even smaller subfields have become specialised to the point that a paper on polyurethane is as incomprehensible to a ceramicist as nuclear engineering is to a biologist.

Natural Language Processing in Materials Science
Scientific community is actively expressing dissatisfaction with how scientific research gets disseminated. Too much is locked away in paywalled journals, difficult and costly to access, they said. Some respondents also criticized the publication process itself for being too slow, bogging down the pace of research.

An expert today finds it difficult to stay updated on every development in his/her field. Reading and comprehending all available literature is next to impossible even in niche specialisations and there is a genuine fear that an important discovery or insight can lie hidden in some obscure journal. Scientific literature databases such as Web of Knowledge, Elsevier, Google Scholar and Crossref have indexed publication lists matched with relevant keywords.

However, the ‘advanced search’ options of these tools are rather minimal and cannot look inside an individual text. Furthermore, scientific language and natural language have some key differences which mean that developments in natural language processing do not necessarily translate to scientific literature.

Technical language is obscure and abstract, uses jargon and notations (such as chemical symbols) that are specific to the subject matter, and key insights are often mathematical.

Amazon Comprehend Medical is a natural language processing service that makes it easy to use machine learning to extract relevant medical information from unstructured text. Using Amazon Comprehend Medical, you can quickly and accurately gather information, such as medical condition, medication, dosage, strength, and frequency from a variety of sources like doctors’ notes, clinical trial reports, and patient health records.
Example of text recognition by Amazon Comprehend Medical: the tool uses machine learning to extract relevant medical information (such as medical condition, medication, dosage, strength, and frequency) from unstructured text (e.g. doctors’ notes, clinical trial reports, and patient health records).

Importantly, the most meaningful part of many texts are not words but images that express information through the means of micrographs, topography images, graphs, measurements, models, diffractograms, spectrograms, etc. Reading and evaluating a specific type of image can require extensive training in the subject.

Despite knowing what he is seeing, a layman cannot hope to understand a chest X-ray the same way a trained radiologist does. When even experts struggle with Kikuchi bands and Pourbaix diagrams, how can we hope for a computer to outperform human comprehension?

NLP in materials science

Natural language processing (NLP), as a subfield of computer science, has earlier relied on extracting semantic information from texts based on the conditional frequencies of individual word tokens. Starting with a large body of text with millions of individual words, each word occurrence is counted individually and in conjunction with other words. When the corpus is large enough, this approach can lead to meaningful insights.

Natural language processing (NLP) is a branch of artificial intelligence that helps computers understand, interpret and manipulate human language. NLP draws from many disciplines, including computer science and computational linguistics, in its pursuit to fill the gap between human communication and computer understanding.
Natural language processing (NLP) is a branch of artificial intelligence that helps computers understand, interpret and manipulate human language. NLP draws from many disciplines, including computer science and computational linguistics, in its pursuit to fill the gap between human communication and computer understanding.

The lexical dispersion plot of figure 1 is one such analysis [1]. A number of texts on ‘quantum dots’ are joined end to end to form one long string of words. The position of an individual token within the string such as the word ‘gold’ is marked by a red dot. Thus, the middle line in the figure shows the position (and the occurrence) of the word gold in the selected literature on quantum dots.

Similarly, the position of the words ‘rod’ and ‘spheres’ are also marked in the plot. The texts considered being on quantum dots, the figure is a visual representation of the common morphologies of gold nanoparticles. A simple visual inspection confirms that quantum gold is more common as spheres than as rods. It is much harder to come to the same conclusion through any other means today other than intellectual human labour.

Fig 1: Lexical dispersion on the topic of ‘quantum dots’ from literature. The result visually confirms that gold nanoparticles are more commonly spherical in shape rather than rod-like.

The field of NLP underwent a revolution in 2013 when Tomas Mikolov at Google used neural networks to vectorise words [2]. In simple terms, the goal is to assign a unique vector to each word in such a way that similar words appear closer and dissimilar words appear farthest.

This is usually done in a high dimensional space of 300 to 1000 dimensions. Several vector space modelling methods such as the term-frequency-Inverse term-frequency (TFIDF) [3] had been used before but the neural network approach by the Google team was not only much more computationally efficient but also captured relations that were nothing short of extraordinary.

Subscribe to Matmatch Blog

The word2vec revolution

For example, a vector space model was generated from a large corpus of text on Google News. It was found that when the vector for ‘man’ is subtracted from the vector for ‘king’, and the vector for ‘woman’ is added, it results in the vector for ‘queen’.

king – man + woman = queen

This is shown visually in the left-hand side of figure 2. Several such relationships have been demonstrated for neural network-trained word vectors that show that this technique captures the semantic relationship between words much better than any other technique and has an eerie resemblance to human cognition. This technique has since undergone extensive development and now powers Google Translate and several bioinformatics packages, amongst others.

Fig 2- Left- The relationship between word vectors of gendered roles [4]. Right- words with the highest cosine similarity to ‘silicon’ and ‘use’ extracted from the Wikipedia page for silicon
Fig 2: Left: The relationship between word vectors of gendered roles [4]. Right: words with the highest cosine similarity to ‘silicon’ and ‘use’ extracted from the Wikipedia page for silicon.

The right-hand side of the figure shows a plot of all the words that are similar to ‘silicon’ and ‘use’ measured by their vector dot product. This was generated using a substantially smaller corpus of only 200 words from the Wikipedia page for silicon. Even then it is seen that well-known applications of silicon such as ‘phones’, ‘clinical’, ‘semiconductor’, ‘glass’ etc are seen to be the first choices.

Natural Language Processing in Materials Science

A notable application of NNs in materials science is the extraction of synthesis parameters of metal oxides by Olivetti and her team at MIT [5]. They parsed synthesis-specific information from over 12,000 articles using tools such as ChemDataExtractor and SpaCy [6]. These are tools that can recognise chemical symbols such as molecular formulas and connect them to their common names from the text. They are generally reliable, although they perform better for some topics than others.

The result of their study is summarised in figure 3, which shows the median value obtained for the calcination temperature of bulk and nano-metallic compounds. They have subsequently extended this technique by using a NN architecture called Variational Autoencoders that can intelligently interpolate synthesis parameters from existing data [7].

Fig 3: The median calcination temperatures of bulk and nanometallic compounds extracted from literature [11].

These results are promising and are a great step forward, however they are essentially very smart guesses. The endpoint for scientific language processing is to extract exact information from a text and not approximations.

Subscribe to Matmatch blog

Automated property-value extraction

Another notable work in this field is from the creators of ChemDataExtractor who auto-populated a database on the Curie and magnetic Néel temperatures of chemical compounds from a corpus of 68,078 journal articles using a newly adapted version of the ChemDataExtractor [8].

The semi-supervised algorithm was shown to have a precision of 82 % in correctly identifying the relevant numerical values. More complex extractions have been conducted for organic compounds and pharmaceuticals. Others have used simple labelled relationships to extract relationships between concepts from text such as those among {annealing, grain size, strength} [9].

An artificial neural network is an interconnected group of nodes, inspired by a simplification of neurons in a brain. Here, each circular node represents an artificial neuron and an arrow represents a connection from the output of one artificial neuron to the input of another.

All of these methods, however, rely on a large body of text to be effective. No tool currently exists that can cognitively read a single text and correlate what it knows with what it sees. Furthermore, there needs to be a lot more progress in the automated reading of figures from these texts.

A team at the University at Buffalo have demonstrated the digitised reading of binary phase diagrams and their application to metallic glass formation [10]. Image cognition from published literature embedding the graphical content within the body of the text is currently a rather complex and challenging problem.

In summary, much of the development of materials science in the future relies on effective machine comprehension of existing scientific literature. That also makes it one of the most important problems being studied today. Thankfully, this is a universal problem faced by all fields of science and so it’s likely that we will see rapid exponentiating progress in a relatively short time.

"I work in the cutting edge area of applying machine learning and artificial intelligence to one of the earliest endeavours of human civilization - in understanding, exploiting and developing new materials."


[1] Steven Bird, Ewan Klein, and Edward Loper, Natural Language Processing with Python, O’Reilly, 2009
[2] Mikolov, T.C., Kai; Corrado, Greg; Dean, Jeffrey, Efficient Estimation of Word representations in Vector Space. arXiv: 1301.781, 2013.
[3] Jones K, S., A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28(1), 1972
[4] TensorFlow, “Vector representations of words”, Available: Tensorflow.
[5] Kim, E.H., K;Tomala,A; Matthews,S; Strubell,E; Saunders, A; McCallum,A; Olivetti, E;, Machine-learned and codified synthesis parameters of oxide materials. Sci Data, 2017. 127.
[6] Swain, M.C.C., J.M, ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the Scientific Literature. J Chem Inf Model, 2016. 56: p. 1894-1904.
[7] Kim, E.H.K.J., Stefanie; Olivetti, Elsa, Virtual screening of inorganic materials synthesis parameters with deep learning. npj computational materials, 2017. 3(53).
[8] Callum J, C.C., Jaqueline M, Auto-generated materials database of curie and neel temperatures via semi-supervised relationship extraction. Scientific data 2018. 5.
[9] Takeshi Onishia , Takuya Kadohirab and Ikumu Watanabe, Relation extraction with weakly supervised learning based on process-structure-property-performance reciprocity, Science and Technology of Advanced Materials 2018
[10] Dasgupta, A. B., Scott.R; Mack, Connor; Bhargava, Kota. U; Subramanian, Ramachandran; Setlur, Srirangaraj; Govindaraju, Venu; Rajan, Krishna. Probabilistic Assessment of Glass Forming Ability Rules for Metallic Glasses Aided by Automated Analysis of Phase Diagrams. Nature Scientific Reports 9 (2019).
[11] Edward Kim, Kevin Huang, Adam Saunders, Andrew McCallum, Gerbrand Ceder, and Elsa Olivetti “Materials Synthesis Insights from Scientific Literature via Text Extraction and Machine Learning”. Chemistry of Materials 2017 29 (21), 9436-9444

*This article is the work of the guest author shown above. The guest author is solely responsible for the accuracy and the legality of their content. The content of the article and the views expressed therein are solely those of this author and do not reflect the views of Matmatch or of any present or past employers, academic institutions, professional societies, or organizations the author is currently or was previously affiliated with.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.