Building a medical search engine — Step 1: medical word embeddings
When treating a patient, physicians have to choose the most suitable treatment among 3,600 different molecules. This requires going through long product descriptions to identify, for instance, the right drug for a given diagnosis that is compatible with the patient’s profile.
Summary
When treating a patient, physicians have to choose the most suitable treatment among 3,600 different molecules. This requires going through long product descriptions to identify, for instance, the right drug for a given diagnosis that is compatible with the patient’s profile.
Much of this information is publicly available on the French Public Drug Database or the European Medicines Agency, but finding the relevant piece of information is a time-consuming task.
One of the major challenges of finding the correct piece of information given a query in Natural Language is accurately identifying the specific entity or set of named entities in the query. This entails the detection of dosages, drugs and diseases and linking these to a unique identifier. This is exactly the job of a Named Entity Recognition and Linking model! So this is what we will dive into here.
But first, let’s go over the peculiar challenges of building a Machine Learning model based on French medical texts!
Why use Machine Learning ?
The main reason to implement Machine Learning models over a rule-based systems is their ability to generalize to situations that you have not encountered or envisioned. In our case, queries can come in varied form and diseases may be mentioned in a variety of manners.
For example in the query ulcère veineux (venous ulcer), the model should recognize Varices ulcérées, and propose céphalée when asked for mal de tête (headache). Spelling mistakes and diverse phrasing for a given entity make the space of possibilities too large to solely rely on manually defined rules or lists.
The trained model must respond to a list of criteria listed below:
- Trainable with a low amount of text data
- Fast inference: <100 ms
- Able to detect several types of entities
- Able to link a detected named entity to a specific entity among several thousand entities
- Able to detect previously unseen entities
To meet these criteria, we selected a usual Machine Learning pipeline (with a few twists) the first part of which is how words are translated into vectors, in order to be processed by the Name Entity Recognizer.
Training Medical word embeddings. Not in English
One major challenge in the definition of the NERL model was the choice of word representation. The rest of this post will focus on the choice we ultimately settled on, in consideration of the following aspects and constraints:
- Relatively low amount of freely available medical text data (in French). Training recent language models (commonly called transformer-based models) requires a very large amount of text (upwards of 10Go). Fine-Tuning, i.e. adapting models trained on a specific domain (such as wikipedia, news reports and so on) to a different domain, would be required however, because the BioMedical domain has a highly specific vocabulary. Our attempts at fine-tuning a pre-trained transformer-based model have not been fruitful, which may be due to too little text data, but this remains in the perspectives for further improvement.
- Crucial morphologic information. In the medical domain, an approximate meaning of a word can frequently be inferred from its subwords. This is especially useful for rare words (with few occurrences) because training word embeddings otherwise requires as many mentions as possible. For example, the words ‘hypothyroïdie’ and ‘thyroïdite’ share the subword ‘thyroïd’ which indicates that they are diseases related to the same body part, the thyroid.
Many more examples of diseases (‘encéphalopathie’ and ‘céphalée’), drug classes (‘déméclocycline’ and ‘chlortétracycline’) show that subwords contain very significant information that can be used to learn useful representations from rare words using their morphological similarity to other more frequent words.
These two important aspects of the medical domain led us to choose the FastText model [Bojanowski et al., 2016].
The FastText model
Straightforwardly, this model is an iteration over the Word2Vec model introduced in [Mikolov et al., 2013], which learns embeddings by iterating over each sentence in a corpus and learns to predict the context of each word. Words that are semantically similar and appear in similar contexts are then set in a close region of the space, which will be helpful for the NERL to detect different types of entities.
With the Word2Vec model, each word is associated to a vector. FastText iterates over this method by introducing a vector for each encountered n-gram. An n-gram is a sequence of n characters in a word,e.g.the wordcoeur(heart) is decomposed in the 3-grams <co, coe, eur, ur>. The ‘<’, ‘>’ characters are added to identify n-grams that start and end a word
This way, representations of words sharing several n-grams will be close in space and thus we can infer that they are probably semantically similar. Furthermore, this makes the model robust to spelling mistakes.
Finally, in the interest of time, we have omitted the description of several important steps we added to the pipeline such as
- Tokenization seperates the input sequence in a sequence of tokens
- Lemmatization, which decreases noise and the number of different words
- Phrasing, which enables learning more accurate representations for sets of words that often occur together, such as insuffisance rénale (kidney failure), following [Mikolov et al., 2013]
- Ontology Sequence Generation complements the text corpus with sequences of similar entities, generated with a random walk on Medical Ontologies, such as ICD-10 (International Classification of Diseases). The representations of entities missing from the text corpus can then be learnt from these sequences, while integrating the knowledge of these ontologies into the word vectors. This is inspired by [Zhang et al., 2019]
The encoding pipeline in action
Here is a short example of how the query “contre-indication à l’insuffisance rénale” is processed in this first step, from a sequence of characters into a sequence of vectors:
- Tokenized: contre-indication, a, l’, insuffisance, renale
- Lemmatized: contre-indication, a, le, insuffisance, renal
- Phrase: contre-indication, a, le, insuffisance_renal
- Encoding: [v_0, v_1, v_2, v_3] where v_i is the word embedding of the word i
In this part, we learnt that FastText requires a moderate amount of unlabeled text data and computing power, which makes it a very good candidate while being well adapted to the particular aspects of the medical domain. This is the heart of the encoding pipeline, as the resulting word vectors are the cornerstone of the ML pipeline, but there are important steps to make this model most effective. These vectors will be used in the NERL model to detect entities, recognize the type of entity and what entity they refer to. All this will be described in a post coming soon!
If you have any feedback or questions regarding this first step of the pipeline, our corpus or anything else, don’t hesitate to contact me :)
References
- Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111–3119).
- Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2016). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135–146.
- Garcia-Albornoz, M., Nielsen, J. (2015). Finding directionality and gene-disease predictions in disease associations. BMC Syst Biol 9, 35
- Zhang, Y., Chen, Q., Yang, Z., Lin, H. & Lu, Z. (2019). BioWordVec, improving biomedical word embeddings with subword information and MeSH. Sci Data 6, 52