Home
>
Blog
>
This is some text inside of a div block.

How to craft a reliable confidence score with random forests for automatically structured healthcare data?

In this article, we first describe our method and the reasoning behind it. Following that, we present how we built this model and what we learned along the way. Finally, we present results, examples, and potential areas for improvement.

Summary

Introduction

A few months ago, we introduced our hybrid automatic prescription structuring system. This system was built to extract a FHIR object model that represents a sentence from a prescription to be automatically imported into a patient’s electronic health record.

For this, we used a combination of a rule-based system and a Named Entity Recognition (NER) model. The latter is a supervised model trained to detect the different pieces of information, such as dosage, frequency, medication names, and other dosage-related information.

While the performance we reached (which has been improved since then) was satisfactory, a piece of the puzzle was missing. One omnipresent problem with artificial intelligence is that it will never be 100% accurate. Be it due to noise, out-of-distribution (OOD) examples (examples that are very different from those seen during training), or model bias, no model will be correct every time. One way to deal with this is the philosophy we have implemented in all our models: be as transparent as possible regarding the model output. This helps the health practitioner identify potential issues and correct them, however, when the model output is several dozens of attributes, doing so is very time-consuming. Thus, we opted for another method: estimating how reliable the model output is.

This task is difficult, first because the concept of reliability is not straightforward to quantity. Second, our model is comprised of many moving parts that depend on each other, some deterministic, and some stochastic. Thus we cannot simply use a combination of several model scores to estimate this reliability.

In this article, we first describe our method and the reasoning behind it. Following that, we present how we built this model and what we learned along the way. Finally, we present results, examples, and potential areas for improvement.

How to assess uncertainty to help detect model failure ?

The inspiration for the model comes from the article titled Detecting and Mitigating Test-time Failure Risks via Model-agnostic Uncertainty Learning, by Lahoti, Gummadi and Weikum [1].

One very valuable aspect of their method is its model-agnostic nature, which will allow us to evaluate the system’s overall reliability.

Additionally, the method defines model uncertainty via a risk score encompassing three distinct types:

  • Model uncertainty, which comes from having a model that cannot completely capture the data distribution (for example using a linear model for a non-linear problem)
  • Aleatoric uncertainty, which can occur due to the noise (in our case, this would mostly be spelling mistakes, faulty optical character recognition, or voice recognition)
  • Epistemic uncertainty, which refers to a lack of knowledge from the model, and it mostly arises when a specific example is an outlier within the training dataset, making the model unlikely to be accurate.

Each of these risks can be mitigated by respectively using another model class or model, using pre-processing steps to reduce noise, and finally by collecting new data samples.

While those actions can help mitigate risks, they cannot eliminate all of them. Thus we need a way to estimate the remaining uncertainties. The authors propose to train another supervised model to predict whether the underlying model will give a correct or incorrect response to a given input. They introduce the idea of a meta-learner to estimate these uncertainties. This meta-learner is an ensemble of models, whose outputs are averaged to give a single prediction.

  • Model uncertainty is estimated simply by using the estimated probability that there is an error, using the final model score
  • Aleatoric uncertainty is estimated by averaging the uncertainty (i.e. entropy) of each simple model
  • The total uncertainty is estimated by estimating the entropy of the model outputs, i.e. by measuring how each model agrees with each other. If there is a high degree of disagreement, then there is a high uncertainty. Thus the epistemic uncertainty can be computed by subtracting the aleatoric uncertainty from the total uncertainty

The sum of these uncertainties is then used to give a risk score to the example.

In practice

Dataset

As mentioned above, to estimate the uncertainty, we need an annotated dataset consisting of sentences labeled as correct or incorrect, indicating whether the current system is able to correctly extract the corresponding structured posology.For this, we combined two datasets:

  • End-to-end data: sentences that have been annotated with the complete posology structure. Thus we simply need to compare the output of the model with the annotated structure. This has the advantage of being easily updated whenever our system changes.
  • short sentences annotated with binary incorrect/correct labels. This allows the model to more easily predict which features are useful to predict (in)correctness.

This results in a dataset of approximately 600 sentences, a fifth of which are used for validation.

Model and features

Now the main questions are: what type of model with which features to use? We quickly ruled out using large language models, or even smaller recurrent models because running multiple instances in parallel would demand a large amount of computational power. Thus, and as suggested by Lahoti et al, we use an ensemble of Gradient Boosted classifiers. This way, every model of the ensemble is a regression tree [2] that outputs a score for the correct/incorrect label. To represent each example we used a mixture of sentence and posology features.

Sentence features

Firstly, the length of the input in terms of word count query_length provides a basic measure of textual complexity. Next, we extract a representation based on the top 400 most frequent words seen during training, this allows the model to identify words that are associated with examples that our system struggles with (which is also useful to improve it), and to associate some words that are associated with other posology features in correct examples. We also compute the proportion of words in the sentence that are out of vocabulary (ratio_ouf_of_vocab).

Some examples of words that have high feature importance are each, nostril,  ear, increase, minimum, first. The first three are often in sentences where the dose is not correctly structured such as 1 in each nostril , increase and minimum also occur in sentences where a change or range of dosage is incorrectly structured, and finally first is present in sentences where the posology changes over time.

Posology features

These features are extracted either by other models, such as classifiers that we have developed to filter out text or by the posology structuration system. This is a bit different from the method proposed in the article, where they rely solely on text features. We argue that this approach helps estimate the model uncertainty, i.e. to identify which our system systematically struggles with.

First we use the scores of our binary classifiers that offer a measure of medical and posology relevance (is_medical_score and is_posology_score). We also identify the proportion of words that have been used by the system to structure the posology (posology_string_ratio). This feature, along with the length of the string has the highest impact in the model risk, as it indicates alignment with established posology rules. We can also study the importance of words that appear outside of the structured posology which allows us to identify information the model tends to miss such as dose units with spelling errors. For example, the OCR might detect cm instead of cp (an abbreviation for tablet - comprimé)…

We also look at the proportion of words outside the 400 most frequent words, which helps identify specialized, rare terms or spelling errors.

Indicators such as whether a dose unit or value has been detected are also valuable indicators of whether the model has been able to identify crucial information (N.B. some correctly structured posologies do not have dose units, e.g. 1 in the morning).

Feature exploration

In this figure, we explore the distribution of features for correctly and incorrectly structured sentences. Each sub-graph in the diagonal shows the distribution of this feature in the dataset for both classes (we note that curves are smoothed which unfortunately causes some distribution graphs to go outside of real bounds), while sub-graphs out of the diagonal show the relationship between each pair of features.

We first notice a threshold of 30 words for query_length after which no structuration is correct (although is a rare occurrence). This is often the case for complicated alternated posologies or with intricate instructions, which the system does not handle well. We can also identify thresholds after which the proportion of correct examples becomes lower than incorrect, such as ratio_ouf_of_vocab > 0.75 and posology_string_ratio < 0.5.

On the other hand, when the whole sentence has been used, there is a very high chance that the posology is correct, as we can see on the posology_string_ratio bivariate graphs. But it is hard to extract clear relationships for any pair of variables for correct/incorrectness.

How does it perform?

To evaluate our model, we train 100 models with different random seeds, this allows us to measure the variability of the model and target metrics. The dataset is split with 1/10th of the examples used for validation.

As in [1], we use the area under the ROC curve [3]. The ROC curve gives the true positive rate given the false positive rate. The area under the curve is thus maximal when the true positive rate is at 1 with a 0 false positive rate, and it is at 0.5 when the classifier is random chance. The figure below shows the ROC curves of the three terms of the risk score and the risk score itself. This allows us to see that the expectation_incorrect score, which refers to the score of the model ensemble has usually the highest value, except for the lowest false positive rate.

The figure below shows the ROC AUC for each uncertainty and the risk score. The score expectation, computed by averaging the score of the incorrect class over each model, has a significantly higher AUC when compared to the aleatoric and epistemic uncertainties. This led us to increase the coefficient of the expectation_incorrect term to 2 when averaging all three scores, keeping the other score at 1.

With this configuration, we thus reach a mean AUC of 0.80, which could be slightly higher by increasing the coefficient of the expectation term, but this would significantly decrease the other two terms which have some interesting features.

By studying the incorrect examples, we notice that the aleatoric uncertainty is high when there are spelling mistakes, which confirms the intuition of this score. A common type of spelling mistake that often leads to an error occurs with dose units, which are usually abbreviated, for example 1 p le matin (1 tablet in the morning, where the abbreviation for table cp is missing a character)

Finally, this gives us the following distribution of reliability scores.

We notice that most negative examples are below the 0.45 breaking point, and every example with reliability below this threshold has a higher than 50% chance of being incorrect. However, there remain some negative examples with very high reliability. These examples are hard to differentiate from correct examples with the feature set we have devised.

For example when:

  • there are two posologies in the same input, and one attribute (as_needed) is not attached to the correct one. 1 le soir durant la prise d'antiinflammatoire qsp 14 j à renouveler selon besoin DOLENIO: 1 sachet /jour qsp 3 mois (1 in the evening while taking anti-inflammatory for 14 days, repeat as needed DOLENIO: 1 sachet/day for 3 months)
  • information that does not fit into the FHIR format e.g. rappel à 4 mois (booster at 4 months)
  • spelling mistakes or OCR errors in the dose unit, which can lead to the unit being omitted in the result
  • Thankfully, in most of these examples, this results in missing information rather than erroneous information, making these errors easier to identify.

Above the 0.8 threshold which is the threshold we selected for a green mark, there is a 0.917 chance that an example is correct. However, the dataset had been created to have varied and more frequent incorrect examples than in a real-world case. We estimate the true probability that an example is incorrect is 12%, using our evaluation dataset of the system, consisting of 300 fully annotated sentences. Using Bayes’ theorem, given that 65.6% of examples have reliability higher than 0.8, this results in a 1.5% chance of a example with reliability higher than 80% being incorrect. This way we can notify the user that an output with a green label can be confidently considered as correct.

Conclusion and perspectives

One improvement that could further boost the model would be to try to characterize the error types. Indeed different errors will have different degrees of gravity and one might want to target a given type. For example, detecting a wrong dose value is a very detrimental error, as opposed to not detecting when the drug should be taken as needed. By training the model to thus characterize the error, it might be easier to learn what features are linked to errors in which case.

Furthermore, it is also worth noting that the aleatoric and epistemic errors have a low performance when compared to the ensemble model score, but this shows once again the power of model ensembles.

By training a supervised ensemble of models to predict whether a structured posology is correct, we were able to provide a reliability score with very high accuracy. This both boosts the confidence in the structuring model and allows the user to prioritize verification to instructions that are likely to be incorrect.

This also allowed us to target some weaknesses of the underlying model by identifying what features had the highest importance in the final reliability score.

Defining this model involved continuous effort in defining what data to use, selecting features, choosing the model type, and establishing evaluation criteria, and we hope that some of the insights will prove useful to you as well.

References

  1. Detecting and Mitigating Test-time Failure Risks via Model-agnostic Uncertainty Learning, by Lahoti, Gummadi and Weikum in the 21st IEEE International Conference on Data Mining (2021).
  2. GradientBoostingClassifier
  3. Multiclass Receiver Operating Characteristic (ROC)
  4. Wikipedia Bayes' theorem
  5. Wikipedia Uncertainty Quantification

Goulven de Pontbriand
Head of Growth & Marketing

Other articles

No items found.

Essayez Posos gratuitement

Après testé Posos Premium pendant 60 jours, profitez de la version gratuite de Posos, pour toujours
Pour vous renseigner sur nos solutions pour l'hôpital, cliquez ici