Building a Named Entity Recognition Model for the Legal Domain

In our previous blog post we defined NER in the legal domain and presented our approach towards generating ground truth data. In what follows, we go over the state-of-the-art in the NER domain and elaborate on the experiments we ran and the lessons we learned.

Existing Approaches to NER

In 2018, Google AI released a pre-trained language model (LM) with various configurations, called Bidirectional Encoder Representations from Transformers (BERT), building on top of transformer encoder and self-supervised learning ideas (Devlin et al. (2019)).

The BERT model has been trained on two unsupervised tasks: (1) masked language modeling (MLM) and (2) next sentence prediction. The datasets used for training are Bookcorpus and English Wikipedia.

Motivated by the idea of the BERT models, several LMs with a similar architecture have been introduced, such as RoBERTa (Liu et al. (2019)) or ALBERT, a lighter version of the model (Lan et al. (2020)).

The insights into the language structure in pre-training large LMs on gigantic datasets can be used on downstream NLP tasks, such as NER or question-answering with relatively small labeled data (transfer learning).

In particular, we can build an NER model by feeding the token-level embeddings of an LM into a classifier (feedforward network + softmax function). The process of training this classifier is called fine-tuning.

Initially, the BERT architecture performed very well in general purpose datasets, such as English Wikipedia. However, it reportedly under-performs for domain-specific datasets such as biomedical or legal text (see table 1 in Beltagi et al. (2019)).

NLP researchers addressed this issue using two different approaches:

  • Continuing to train BERT with some domain-specific text data.
  • Training a domain-specific LM from scratch on a domain-specific corpus.

Some examples include: SciBERT, trained on biomedical and computer science literature corpora (Beltagi et al. (2019)), and Legal-BERT, trained on English legal texts such as legislation and SEC filings (Chalkidis et al. (2020)).

When applying either of these methods, the performance of the NER task improved: for example, Chalkidis et al. (2020) show that fine-tuning Legal-BERT improves the test F1-score between 1% and 3%, compared to the BERT-base model.

RelationalAI’s NER Experiments

After having generated a ground truth dataset for the NER task, the next step is to experiment with different models.

We examine two types of models: different LMs with the BERT architecture with 110 million parameters and the BiLSTM-CRF model with 24 million parameters (Sharma (2019)).

The goal is to identify the 11 entity types we mentioned in our previous post. The small dataset (25 loan agreement documents) annotated by humans is used as our test set, while the large number of documents produced by our approach (1000 loan agreement documents) is used as our training set.

Fine-Tuning Language Models

We consider two out-of-the-box models from HuggingFace: The BERT-base, as a general purpose LM, and the Legal-BERT-base as a domain-specific LM.

We also built two other LMs by further pre-training (FP) the aforementioned models on the masked-language modeling task with more than 20,000 publicly available SEC loan agreement documents. We refer to these new models as MLM-FP-BERT-base and MLM-FP-Legal-BERT-base, respectively.

To assess the role of training data size, each model is fine-tuned using 100% and 20% of the Near-Gold Standard (NGS) annotated examples.

The table below summarizes the overall span-level F1-score of each model on the gold-standard test data using the seqeval framework. In this framework, each entity is considered to be correctly predicted if it matches the gold-standard entity in both exact boundary and type.

The results in the table indicate that the further pre-trained models on the MLM task (abbreviated as MLM-FP) slightly outperform their out-of-the-box counterparts when fine-tuning on relatively small training data. However, the gap between MLM-FP and out-of-the-box models significantly reduces in the presence of larger training data.

Fine-tuning all four LMs on larger training data improves the test scores, despite some mislabeled annotations in the NGS dataset. Simply put, fine-tuning LMs on supervised downstream tasks with a large but not perfect corpus can still deliver very promising results.

Training Data





20% of NGS examples





All NGS examples





Span-level test F1-scores for the NER fine-tuning of four different LMs.

Training a BiLSTM-CRF Model

In addition to the Transformer models, we further experimented with the BiLSTM-CRF model introduced in Sharma (2019). The model consists of three main layers: a word-level embedding layer, a BiLSTM layer with 150 units, and a CRF layer for token classification.

We considered different approaches for initializing the word-level representation vectors. The experiments show that using 300 dimensional Global Vectors for Word Representation (GloVe for short) for initial embeddings outperforms other methods by achieving a 94% overall F1-score on the test data.

Furthermore, we found that there is a negligible gap between fine-tuning a domain-specific LM, such as legal-BERT-base, and training BiLSTM-CRF from scratch when a large training dataset is available. See the image below for a detailed comparison.

Comparing the span-level test F1-scores for fined-tuned legal-BERT-base and BiLSTM-CRF with 300 dimensional GloVe embeddings.

Guidelines for Practitioners

Taking a data-centric AI approach, we proposed a scalable algorithm for generating a large number of high-quality named entity annotations with minimal supervision. High-quality labels are necessary for training models for NER.

We found that NER models, based on Transformer & BiLSTM-CRF architectures, are capable of generalizing from large and near gold standard training data.

In fact, the F1-score of these models can go as high as 94%. Further pre-training of out-of-the-box models (on the MLM task) improves NER performance, if we don’t have a large number of domain-specific annotations. The benefit of further pre-training diminishes when a huge corpus of annotated examples exists.

This shows that practitioners with access to a large annotated corpus don’t need to worry about further pre-training their models for this use case.

Finally, in the presence of large training data, less complex models such as the BiLSTM-CRF models can achieve performance comparable to the performance of more complex transformer-based architectures.

This tells us that in the presence of high-quality annotations, practitioners can get very similar results by deploying smaller and cheaper NER models in legal documents where entities are similar to those studied in our work.

We are excited that our work was accepted as a contribution in the Industry and Government Program of the IEEE Big Data 2022 Conference, and will be presented on December 19, 2022.

You can read our paper here.