Talk to anyone today and they will tell you that Artificial Intelligence is the next big thing – the hot potato that everyone wants a piece of, but no one can chew.
A good majority of them will also tell you that much of what goes around as AI is really just a hype – a glorification of good old machine learning and mathematics dressed in Powerpoint. And for the most part, they would be right.
However, one area where the application of AI tools such as Deep Learning have been nothing short of revolutionary is in Natural Language Processing.
An easy example are the chatbots that man websites. They are run by relatively complicated deep learning architectures called Long Short Term Memory (LSTM) neural networks. These algorithms can ‘comprehend’ what we tell them and piece together legible coherent sentences in response. Sure, this bot is no Socrates but it doesn’t spew out a random jumble of words. There is the undeniable suggestion of some low level intelligence.
The modern era of Deep learning in language processing kick started with the publication in 2013 of Tomas Mikolov’s word2vec paper. Their triumph was in developing a computationally feasible method to generate word embeddings or word vectors using neural networks.
Consider the words man, woman, king and queen. If you were asked to group these words, you have a number of common-sense choices. I tend to see [man, woman] and [king, queen]. You could be seeing [man, king] and [woman, queen].
I also know that the word king and man are related the exact same way as ‘woman’ and ‘queen’.
man:king = woman:queen
Even if I had never heard these words before, I can learn these relationships by observing the sentences that I come across. ‘This man is a king’ , ‘The queen was a godly woman’, ‘She reigned as the queen of the silver screen’, ‘His kingdom will come’. These sentences suggest through the proximity of words alone that the king is mostly a man and that a queen is most likely a woman.
Word embeddings do the same thing, but for millions of words from thousands of documents. The key here is that the words are learned from context. What enables this mathematical analogy game are the powers of modern computation and the magic of deep learning.
Deep learning word embeddings
Let’s say that we want to find the embeddings for all the words in Harry Potter.
We first create a sort of mathematical vault-library-chamber. A monstrous multidimensional behemoth that is big enough to hold all the words we need. This is the vector space.
The goal is to go through Harry Potter word by word and put each word into a vault in the chamber. Similar words like Dress and Cloak go in the same vault. Quidditch and Snitch are in adjacent vaults. Car and Centaur are as far away as Banana and Voldemort.
The word embedding of a word is the address of the vault in which it shall be found. Mathematically, this makes it a vector in the vector space.
You can see why no human would ever want this job. There are far too many words and far too much moving around involved.
A neural net however does this exceptionally well. It does this by, well, magic.
A deep neural net is a sort of massive machine with millions of gears and levers. In the beginning it’s all chaos and nothing fits with anything even though there is shuffling all around. Then slowly some of the gears start locking. The levers fall into place – and order emerges from chaos. The machine starts moving. Frankenstein is alive!
The language here is deliberately vague. I want to take you to the applications of word embeddings, rather than how it is derived. Having said that, at a fundamental level we don’t quite know how neural networks do what they do. Thus in our experiments we have to play around with the number of layers, the activation functions, the number of neurons in each layer etc before we get to our task. But that is a topic for another day.
In a paper published in 2019, a team of researchers at the Lawrence Berkeley Lab generated the word embeddings of all the abstracts in around 3.3 million papers published across 1000 journals. This list is obviously huge and covers almost every topic published in material science over the last couple of decades.
When it comes to a scientific text, chemical formulas and symbols are also ‘words’. Therefore there is a word vector for LiCoO2 – which is a common battery cathode. You can then ask questions like: what are the closest word vectors to LiCoO2?
We know that LiCoO2 is a vector in the vector space. So all we need to do is to find vectors that are closeby.
The answer comes out as LiMn2O4, LiNi0.5Mn1.5O4, LiNi0.8Co0.2O2, LiNi0.8Co0.15Al0.05O2 and LiNiO2—all of which are also lithium-ion cathode materials.
See what we did here?
We were really trying to explore other materials that were similar to our favorite cathode. Instead of reading a thousand papers, making notes and coming up with a list of lithium compounds, the word embeddings solved the task in a few seconds.
This is the power of word embedding. By converting semantic enquiries to mathematical vector operations, this approach allows us to query and comprehend large text databases better and more efficiently.
As a further example, the researchers studied how often a chemical compound was found near the vector for ‘thermoelectrics’. (These are materials that convert electrical energy to heat or vice versa).
You can do this through a straight forward vector operation called the dot product. Vectors that are similar have a dot product approaching one. Dissimilar vectors have near zero dot product.
By performing the same operation on chemical compounds in the database and the word ‘thermoelectric’, the authors found all chemicals that were likely to be thermoelectric.
The authors go on to show that similar relationships can be demonstrated for several material properties such as crystal structure and ferroelectricity. Further, they show that using this technique several of the current thermoelectrics could have been predicted years ago from existing literature.
The analysis is a very beautiful, elegant yet deceivingly simple expression of the question ‘Of all the materials studied by man, which ones are likely to be thermoelectric’.
Material databases are the need of the hour
You would assume that we already have this list – clearly someone has been making note of all the work that we have been doing? Compiling materials handbooks and electronic databases?
The answer is a surprising no. The vast amount of the knowledge that we have amassed over the years are locked in texts such as books, journals and papers. There are so many of these that it is impossible for us to scan through them manually.
This is precisely why word embeddings and the techniques demonstrated in this paper are nothing short of revolutionary.
They promise to change the way we interact with text and to rapidly accelerate our database of materials.
What are some of the materials that have been studied for piezoelectricity? Is there a superconductor that we have missed in the literature? Is there a new drug that can cure Alzheimer’s?
Ask the word embeddings. They would know.
*This article is the work of the guest author shown above. The guest author is solely responsible for the accuracy and the legality of their content. The content of the article and the views expressed therein are solely those of this author and do not reflect the views of Matmatch or of any present or past employers, academic institutions, professional societies, or organizations the author is currently or was previously affiliated with.