Skip to content

Building a Named Entity Recognition Model for the Legal Domain


In our previous blog post we defined NER in the legal domain and presented our approach towards generating ground truth data. In what follows, we go over the state-of-the-art in the NER domain and elaborate on the experiments we ran and the lessons we learned.

Existing Approaches to NER

In 2018, Google AI released a pre-trained language model (LM) with various configurations, called Bidirectional Encoder Representations from Transformers (BERT), building on top of transformer encoder and self-supervised learning ideas (Devlin et al. (2019) (opens in a new tab)).

The BERT model has been trained on two unsupervised tasks: (1) masked language modeling (MLM) and (2) next sentence prediction. The datasets used for training are Bookcorpus (opens in a new tab) and English Wikipedia.

Motivated by the idea of the BERT models, several LMs with a similar architecture have been introduced, such as RoBERTa (Liu et al. (2019) (opens in a new tab)) or ALBERT, a lighter version of the model (Lan et al. (2020) (opens in a new tab)).

The insights into the language structure in pre-training large LMs on gigantic datasets can be used on downstream NLP tasks, such as NER or question-answering with relatively small labeled data (transfer learning).

In particular, we can build an NER model by feeding the token-level embeddings of an LM into a classifier (feedforward network + softmax function). The process of training this classifier is called fine-tuning.

Initially, the BERT architecture performed very well in general purpose datasets, such as English Wikipedia. However, it reportedly under-performs for domain-specific datasets such as biomedical or legal text (see table 1 in Beltagi et al. (2019) (opens in a new tab)).

NLP researchers addressed this issue using two different approaches:

  • Continuing to train BERT with some domain-specific text data.
  • Training a domain-specific LM from scratch on a domain-specific corpus.

Some examples include: SciBERT (opens in a new tab), trained on biomedical and computer science literature corpora (Beltagi et al. (2019) (opens in a new tab)), and Legal-BERT (opens in a new tab), trained on English legal texts such as legislation and SEC filings (Chalkidis et al. (2020) (opens in a new tab)).

When applying either of these methods, the performance of the NER task improved: for example, Chalkidis et al. (2020) (opens in a new tab) show that fine-tuning Legal-BERT improves the test F1-score between 1% and 3%, compared to the BERT-base model.

RelationalAI’s NER Experiments

After having generated a ground truth dataset for the NER task, the next step is to experiment with different models.

We examine two types of models: different LMs with the BERT architecture with 110 million parameters and the BiLSTM-CRF model with 24 million parameters (Sharma (2019) (opens in a new tab)).

The goal is to identify the 11 entity types we mentioned in our previous post. The small dataset (25 loan agreement documents) annotated by humans is used as our test set, while the large number of documents produced by our approach (1000 loan agreement documents) is used as our training set.

Fine-Tuning Language Models

We consider two out-of-the-box models from HuggingFace (opens in a new tab): The BERT-base, as a general purpose LM, and the Legal-BERT-base (opens in a new tab) as a domain-specific LM.

We also built two other LMs by further pre-training (FP) the aforementioned models on the masked-language modeling task with more than 20,000 publicly available SEC loan agreement documents. We refer to these new models as MLM-FP-BERT-base and MLM-FP-Legal-BERT-base, respectively.

To assess the role of training data size, each model is fine-tuned using 100% and 20% of the Near-Gold Standard (NGS) annotated examples.

The table below summarizes the overall span-level F1-score of each model on the gold-standard test data using the seqeval (opens in a new tab) framework. In this framework, each entity is considered to be correctly predicted if it matches the gold-standard entity in both exact boundary and type.

The results in the table indicate that the further pre-trained models on the MLM task (abbreviated as MLM-FP) slightly outperform their out-of-the-box counterparts when fine-tuning on relatively small training data. However, the gap between MLM-FP and out-of-the-box models significantly reduces in the presence of larger training data.

Fine-tuning all four LMs on larger training data improves the test scores, despite some mislabeled annotations in the NGS dataset. Simply put, fine-tuning LMs on supervised downstream tasks with a large but not perfect corpus can still deliver very promising results.

Training DataBERT-baseLegal-BERT-baseMLM-FP-BERT-baseMLM-FP-Legal-BERT-base
20% of NGS examples92.9%93%93.1%93.3%
All NGS examples93.9%94%94%94%

Span-level test F1-scores for the NER fine-tuning of four different LMs.

Training a BiLSTM-CRF Model

In addition to the Transformer models, we further experimented with the BiLSTM-CRF model introduced in Sharma (2019) (opens in a new tab). The model consists of three main layers: a word-level embedding layer, a BiLSTM layer with 150 units, and a CRF layer for token classification.

We considered different approaches for initializing the word-level representation vectors. The experiments show that using 300 dimensional Global Vectors for Word Representation (opens in a new tab) (GloVe for short) for initial embeddings outperforms other methods by achieving a 94% overall F1-score on the test data.

Furthermore, we found that there is a negligible gap between fine-tuning a domain-specific LM, such as legal-BERT-base, and training BiLSTM-CRF from scratch when a large training dataset is available. See the image below for a detailed comparison.


Comparing the span-level test F1-scores for fined-tuned legal-BERT-base and BiLSTM-CRF with 300 dimensional GloVe embeddings.

Guidelines for Practitioners

Taking a data-centric AI approach, we proposed a scalable algorithm for generating a large number of high-quality named entity annotations with minimal supervision. High-quality labels are necessary for training models for NER.

We found that NER models, based on Transformer & BiLSTM-CRF architectures, are capable of generalizing from large and near gold standard training data.

In fact, the F1-score of these models can go as high as 94%. Further pre-training of out-of-the-box models (on the MLM task) improves NER performance, if we don’t have a large number of domain-specific annotations. The benefit of further pre-training diminishes when a huge corpus of annotated examples exists.

This shows that practitioners with access to a large annotated corpus don’t need to worry about further pre-training their models for this use case.

Finally, in the presence of large training data, less complex models such as the BiLSTM-CRF models can achieve performance comparable to the performance of more complex transformer-based architectures.

This tells us that in the presence of high-quality annotations, practitioners can get very similar results by deploying smaller and cheaper NER models in legal documents where entities are similar to those studied in our work.

We are excited that our work was accepted as a contribution in the Industry and Government Program of the IEEE Big Data 2022 (opens in a new tab) Conference, and will be presented on December 19, 2022.

You can read our paper here (opens in a new tab).

Get Started!

Start your journey with RelationalAI today! Sign up to receive our newsletter, invitations to exclusive events, and customer case studies.

The information you provide will be used in accordance with the terms of our Privacy Policy. By submitting this form, you consent to allow RelationalAI to store and process the personal information submitted above to provide you the content requested.