Skip to main content

· 10 min read
Vimal Menon

No matter how accurate your trained model is, if your model doesn't perform well, or degrades in production it won't be useful for the business. In this article, we share a few techniques to mitigate performance degradation.

Why NER models underperform in production
Key reasons for NER model performance degradation

Poor data quality

Good quality data is a known precondition for a good quality model, however, it is observed that in many projects there is a lack of comprehensive data quality analysis. This is partly because there is no standardized approach for data quality analysis, each dataset would need a different approach to determine its quality. 

For a NER project, let us look at a few easily fixable issues to better the quality of data and in turn the model performance.

Too much and/or irrelevant data

Almost all NER projects are very data-centric, the idea behind data-centric ai is to train models on good data which is sized appropriately as compared to big data.

As a starting point, a small set of relevant data with good distribution covering the business use case should be curated. This should then be labeled consistently. An off-the-shelf model library (E.g., flair, allennlp, or spaCy) can be used as an experiment to train a model on the dataset. Additional data can then be introduced by studying the training and the model fit.

Feeding too much data can introduce too much noise (due to irrelevant data) as well as labeling inconsistencies in the data. This is one of the key issues we noticed when root-causing for degraded model performance in production.

It is therefore important to curate a bare-minimum amount of data that satisfies the business needs. In the beginning, training on too much data may be counterproductive.

Biased towards certain entities

It is very common in a NER project for the dataset to be biased towards certain entities more than others. E.g., we can observe in the Conll2003 dataset that the entities PER and ORG have 11416 and 10153 occurrences in the training data whereas LOC has a comparatively lesser 8643 and MISC only has 4640 occurrences.

We will be using acharya community edition to analyze the CONLL2003 dataset. Check acharya git repo here. Acharya can import IOB style records and it gives data quality insights about the dataset via its dashboard.

Entity distribution in Acharya
Screenshot of Acharya dashboard with Conll2003 dataset uploaded showing Entity Distribution.

Here we shouldn't be surprised if the trained model demonstrates better performance for PER and ORG as compared to MISC.

Outdated data

When your model has been in production for a while, the underlying data used to train may become outdated. You must plan to introduce new training data periodically. Production data can deviate based on wide-ranging factors - from major world events like a pandemic or a war to a new popular treatment/medicine (in case of clinical NER) or even a new law/regulation; many of these could easily impact the performance of the model. In these cases, the training data will need to be updated to make the model perform on the current trends.

Annotation Errors

Annotation errors are another key reason for a degraded NER performance. Even if you have a well-performing model running in production, it is important to check for annotation errors and fix them in the training and test data. These annotation errors can reveal why the model is underperforming for certain real-world data even though it has scored well in the test/evaluation dataset. The most common annotation errors include:

Wrong Classification

Wrong classifications heavily impact the neural weights of the final model. E.g., In the Conll2003 dataset, we see that the article "The" has been wrongly classified as LOC and in another instance, the word "American" has been wrongly classified as MISC.

We also spot that Wednesday has been classified as ORG but that is a genuine case where the news is reporting about the team "Sheffield Wednesday".

Identifying wrong classifications with acharya
2 screenshots showing wrong classifications were identified using Missed classification list of the Conll2003 dataset.

Missed Classifications

Missed classifications refer to words that have been classified once in the dataset and missed at other locations of the dataset. Missed classifications also impact the neural weights of the model. It is important to identify mistakes where the annotator might have either missed classifying the word or might have wrongly classified that word.

In the case of the Conll2003 dataset, we can see many such missed classifications, to highlight some, we see that "NYMEX" has been classified in 2 records but missed in 2 records. Similarly "Pakistan" has been missed in 1 record and "NATO" in 1 record, which is in the test dataset (impacting the score of a correct model).

Entity distribution in Acharya
"Missed classifications" highlights some of the many annotations missed at certain places of the dataset.

Annotator errors

We already discussed missed classification and wrong classification errors which annotators can miss, but when multiple annotators work on the same dataset, it is very important to establish a common understanding between the annotators and have annotation guidelines laid out before they start working. This would result in much more consistent labeling.

Another source of annotator errors is accidental mistakes like mouse select/marking errors. E.g., in the conll2003 dataset, we notice that two annotations of PER entity begin with a "."

Identifying annotation anomalies with acharya
Anomalies in annotations of the Conll2003 dataset.

Here we see that "Guiseppe Citterio" and "Robbie McEwen" annotations begin with a "." and this is an annotator error. When we open the record number 640 in acharya we see that the data of the record also needs some editing.

Fixing annotation errors using workbench
Editing record number 640 in acharya workbench to fix data text and annotation issues.

The text is inconsistent for entries 8 and 9 where there is a space between the number and "." and that is probably why the annotator error occurred. The data text needs to be edited to fix the issue. Here the data text is edited in the acharya workbench to fix the "space" issue and the annotation errors are fixed.

Multi classification

Annotation of the same word in the same context in different places would impact the neural weights in the final model and would impact the model's ability to confidently classify the word when seen in production data. With words like apple or CAD, it is difficult to avoid multi-classification since they depend on the usage context. However, this is a solved problem with today's advanced language models. Yet, multi-classification of the same word with the same context of the appearance does tend to impact model performance. E.g., in the Conll2003 dataset, we see that "China" has been classified as a PER in one occurrence and as a LOC in 149 other occurrences.

Identify multiple classification of annotations in acharya
Multi-classification of China where it has been classified as LOC in 149 instances but as a PER in 1 instance.

It could be argued that "China" can be indeed the name of the person and the model should be able to handle it, so just to confirm the context of the occurrence of the word "China" we check the actual data.

Workbench in acharya showing annotations
Workbench screen of record number 940 showing China labeled as a PER.

In the first line itself, we see that the article is about soccer news where "Japan" has been correctly marked as "LOC" and the context of the word "China" should be "LOC" instead of a "PER", thus this is a clear annotator error. In the same record, other occurrences of "China" have been correctly labeled as "LOC".

Incomplete language model

Identifying and choosing the appropriate language model is another important criterion. The latest state-of-the-art language model available might not be suitable for your business needs, often depending on the domain, there could be multiple language models available. Experiments should be run to compare the models on an even slate to determine the apt language model for the use case. E.g., BIOBert and SciBert are similar language models and a maybe a great starting point, however, your specific data might perform better with one than the other.

Often, for a more domain-specific use case, a fine-tuned language model would perform much better than a generic language model. Certain large models are difficult to fine-tune, in such a case, experiments should be performed with older models like Glove, word2vec, or ELMO fine-tuned with your domain-specific data.

Absence of a Data-centric MLOps pipeline

CI/CD stands for Continuous Integration/Continuous Deployment. The terms refer to a software engineering best practice followed by many organizations today. In broad strokes, CI refers to the practice of keeping test code in sync with new code and testing it immediately in an automated manner to identify bugs earlier in the cycle. This often requires the engineering teams to write test cases that exercise their code in conformance with technical and functional requirements (including security, performance, load, etc.). The practice also involves the use of other quality-of-life software/tools to help run the tests as soon as code is committed. CD or Continuous Deployment refers to the practice of deploying the code onto production once all the tests pass to an acceptable level. This enables organizations to get features into the hands of customers earlier.

Likewise, ML models are hardly ship-once and forget. As the data evolves, so should the models. The model should be monitored for real-world performance and newer relevant data should be curated and added to the training dataset. Based on the use case, data received in production can also be de-identified and added to the dataset to complete the learning loop.

This new data may need to be annotated and should be compared with the production model's output to gauge the accuracy of the production model with the new data, helping determine the performance of the model in production.

Simple data-centric mlops pipeline
A simple data-centric MLOps pipeline schematics for a NER project.

This is depicted in the simplified schematic above. The central part of the flow is curated data from all the enterprise data sources getting annotated and this annotated data getting used for training and testing of models. The notion of an experiment refers to a combination of algorithms, language models, and their respective hyperparameters. The key insight here is that all the experiments, including the model currently in production, are being evaluated/scored with updated data from production. Such a process helps teams identify the best-suited model for production.

Continuous training and continuous testing should be established as a pipeline.

Summing up

In our conversations with many teams, we have found that the points mentioned above have brought about significant improvement in model performance. Bringing in software engineering best practices benefits data science teams too. A well-setup MLOps pipeline not only frees the team from integration hassles with engineering and increases flexibility, but it also brings the ability to track what went wrong when there is a drop in performance with a point-in-time snapshot of data.

As more and more Data science teams adopt innovative practices, there is a need to share what works best in the community to take the practice forward. We look forward to hearing from you about your best practices.

· 3 min read
Vimal Menon

Continuing our tour of data-centric features in Acharya that we began in Part 1, in Part 2, we will continue our exploration of some of the data-centric features in Acharya that can increase your efficiency as an annotator as well as potentially improve your ML models. Let's say you have trained your algorithm on your dataset three times. Please refer to the Acharya documentation for details on how to start a training of your configured algorithm on your project dataset.

Let's say you have configured 3 models for training on your dataset. Please refer to the Acharya documentation for details on how to configure your models for training in Acharya.

Training details

In the screenshot above, the left side pane shows the three trainings. Trn3 is selected, which expands the details on the right side pane. You can see the algorithm, the time it took, its status and its Score.

Training algorithm details

As in the screenshot above, you can view the details of the training run by expanding the highlighted arrow (red square box). In the Algo Details tab, you can see the Precision, Recall and the F1 score. Do note the Git commit ID. For an MLOps tool, we believe that Git support is a key requirement and like engineering code, even ML code should be version controlled in a mature MLOps implementation.

Switching to the Reports tab, you are able to further drill down to the Entity level performance of the Trn3.

Training Entity scores

You can also see the Per Record scores by expanding the 'Per Record scores' section.

Training Per record scores

This helps you to identify how well the trained model performs on the individual records marked for evaluation. In this view, you must also peruse the columns like Precision, Recall, False Negative, True Positive etc. and find annotation errors and fix those annotations appropriately in the evaluation records. This can have an impact on the score. We have seen incorrectly annotated data being skipped by the model. When we fixed the annotation, which was highlighted to us by this view - the model was able to classify the validation data appropriately increasing the score of the model.  Likewise, you can go back to see all the Trainings, choose two runs to compare their details, and even see the Best performing training run to identify the potential production candidate. This comparison will show you a diff about the algorithm code as well as the data [ please note - in Acharya Community Edition, this comparison has to be executed using the command-line interface ].

Reviewing all the Training Runs in Acharya

The ease with which Training experiments can be executed is a key feature of Acharya. The insights about how the trained model performed on individual evaluation records combined with the other data centric features that was discussed in Part-1, the ability to tune the model by tweaking the data becomes much simpler with Acharya. This helps in improving the efficiency of your NER model development process.

· 4 min read
Vimal Menon

A key approach within data-centric AI is to use the available data to gain a better understanding about the data that is shaping the model and then let the model guide you towards anomalies in the data, which you then use to reconfigure your datasets or even include or remove certain data to see the impact that it has on the model behavior.

So you need a mechanism that allows you to not only understand the data better, but also be able to quickly see the impact that the changes have on the data.

As soon as you upload the data, the Acharya Dashboard provides you immediate feedback in the form of the following reports. These are simply based on your data without running your models yet.

Use entity distribution to determine your entity bias

Entity distribution shows how many annotations belongs to each entity. If the number of annotations of a particular entity is less or more as compared to other entities, you can make out that the dataset is biased against or towards that entity. So this view gives an insight into what kind of new data should be sourced into the project to balance the entity distribution.

Entity distribution in Acharya

Classifications

This table lists the words classified against each entity. This view helps in identifying words from your text corpus that are classified as entities. If a word is classified as more than one entity then it is important to verify such annotations and confirm the validity of that annotation. This table also helps to know the count of occurrences of each annotated word in the dataset. In the screenshot below, the work NEW YORK is classified as a LOC in 102 records and as ORG in 41 records. Clicking on NEW YORK expands the list, and helps you navigate to the specific record where you can review the text to confirm if the classification is valid.

Entity Classifications view in Acharya

Missed Classifications

Often it so happens that while annotating, you might have either missed classifying the word or might have wrongly classified that word. It may also happen that you marked that word once, but missed at other locations simply because of time constraints. It is to identify such misses that we built Missed classifications. This again is a very helpful view that displays the words that have been classified once in the dataset and missed at other locations of the dataset.

Missed Entity Classifications view in AcharyaSorting on Missed Records would let the user know annotations which might have been missed by the annotator And sorting on Classification Counts would let the user know annotations that might have been wrongly annotated by the annotator.

Anomalies

Anomalies highlight those annotations where Acharya feels the annotation might be a mistake. As seen in the screenshot below, the annotator has included unnecessary symbols into the word. Often when classifying multiple records, or when using a classification service such mistakes crop in.

Like with the other views, Acharya makes it easy to jump to the record in question and correct the annotation in the underlying data.

Anomalies view in Acharya

Unclassified words

This is another feature we felt is important as you review your data. This simple table helps in identifying unclassified words in the dataset. Instead of browsing the dataset, this table helps sort the words based on their word length and number of occurrences.

Unclassified Words view in Acharya

We will continue our exploration of the data centric features and reports in Acharya in Part 2. In the meantime, feel free to let us know what are your favorite features and also share what your experience has been using Acharya. The link to Acharya is in the bio.

· One min read
Vimal Menon

I am excited to announce the alpha version of Acharya, a data-centric MLops tool for your named entity recognition projects. Download Acharya from the home page.

Please reach out to me in case you feel Acharya will be helpful in your nlp/ner projects.

A big shoutout to Nithin Stephen and Saurabh Korgaonkar for their immense hard-work and dedication.

· 2 min read
Vimal Menon

As a life-long C and now a Go programmer, CSS was a blackbox for me. Once I became a founder/developer of my product, that had to change and I had to learn CSS. I generally dedicate my Sundays to learning/practicing it.

I was elated with Neeraj Chopra's gold medal - a first for India in athletics (in javelin throw). To honor that, I was trying to build a pure css animation of a Javelin throw. And then a brainwave hit me, what if I added this around the athlete's name as another style for tagging entities in a NER project.

The result is:

Javelin launch animation

This was achieved using only css animations. There is no Javascript at all. The major challenge was to plot a bezier curve mimicking a javelin throw. Once I was able to get the bezier curve, I had to translate the curve into pixels and percentages. The math for the curve was done on paper and the translations in pixels were hardcoded into each @keyframes percentage. (see this https://developer.mozilla.org/en-US/docs/Web/CSS/@keyframes ) The javelin was added to ::before of the tag and rotated in various angles as the animation progressed again using keyframes. The celebration animation is a static background SVG which grows and shrinks.

Although I started it as a fun task, it eventually became a bit difficult to crack. I am happy with the results and in the process learnt a lot about CSS animations 😀

Here is a closer look.

Closer look at Javelin launch animation

· 2 min read
Vimal Menon

I have spent the past two decades developing software professionally, solving some of the most intricate engineering problems across a variety of business domains. When I started re-discovering ML a few years ago to address some use cases around text / language processing (my prior experience with ML was during my engineering days nearly 20 years ago), I discovered that the tooling around ML, more specifically around integrating NLP workflows into an agile engineering team's workflow needed significant rejig. Even though the tooling has been getting better, I still feel we haven't reached a level of ease today as compared to say the level with which an engineering team can introduce rigor within their DevOps cycle. The typical developer in an agile team faces quite an uphill task if they need to do NLP. More so if they need to integrate NLP notions into their dev/test/deploy loop. The sheer number of frameworks, libraries and tools and then the plumbing and interfacing required to get them to a usable state is often slow and re-invented every single time by every team.
Our intent behind Astutic AI was to further the state of the art in a developer focused tooling around ML starting with addressing some of these challenges in NLP/NER. Over the course of next few blogs, I will spend some time discussing these in more depth. Thank you for reading, and feel free to drop me a note.