Identifying people, places, and things in text, called named entity disambiguation (NED), is a fundamental AI problem. Machine based NED systems often struggle to correctly distinguish entities that appear infrequently in data even though the majority of entities people care about when searching for information or using personal assistants are rare. We show that conventional BERT-based approaches to NED succeed on common entities but perform 50 F1 points worse on rare ones. Bootleg addresses this "rare entity" problem by drawing inspiration from how humans reason about unfamiliar entities: they use cues about specific entity facts, relations, or types. Bootleg learns how to reason over factual, relation, and type information automatically, without hard-coded rules or features, making it easy to maintain and extend to new languages. We show that Bootleg meets or achieves state-of-the-art on three popular NED benchmarks and gives a 40 F1 point gain over rare entities compared to a prior state-of-art model. It turns out that Bootleg's learned knowledge can even transfer to other tasks, providing an 8% performance lift in highly optimized production search and assistant tasks at a major technology company and setting a new SotA on the TACRED relation extraction task.

Identifying people, places, and things in text, called named entity disambiguation (NED), is a fundamental AI problem. Machine based NED systems often struggle to correctly distinguish entities that appear infrequently in data even though the majority of entities people care about when searching for information or using personal assistants are rare. We show that conventional BERT-based approaches to NED succeed on common entities but perform 50 F1 points worse on rare ones. Bootleg addresses this "rare entity" problem by drawing inspiration from how humans reason about unfamiliar entities: they use cues about specific entity facts, relations, or types. Bootleg learns how to reason over factual, relation, and type information automatically, without hard-coded rules or features, making it easy to maintain and extend to new languages.

We show that Bootleg meets or achieves state-of-the-art on three popular NED benchmarks and gives a 40 F1 point gain over rare entities compared to a prior state-of-art model. It turns out that Bootleg's learned knowledge can even transfer to other tasks, providing an 8% performance lift in highly optimized production search and assistant tasks at a major technology company and setting a new SotA on the TACRED relation extraction task.

Our paper Bootleg: Chasing the Tail with Self-Supervised Named Entity Disambiguation is available on arXiv, and our code is available on GitHub.

Named Entity Disambiguation and the Long Tail of Entities

Named entity disambiguation (NED) is the process of mapping “strings” to “things” in a knowledge base. You have likely already used a system that requires NED multiple times today. Every time you ask a question to your personal assistant or issue a search query on your favorite browser, these systems use NED to understand what people, places, and things (entities) are being talked about.

Named entity disambiguation example. The ambiguous “Lincoln” refers to the car, not the person or location.

Take the example shown above. You ask your personal assistant “What is the average gas mileage of a Lincoln?”. The assistant would need NED to know that “Lincoln” refers to Lincoln Motors (the car company)—not the former president or city in Nebraska. The ambiguity of mentions in text is what makes NED so challenging as it requires the use of subtle cues.

The spectrum of entities. Popular (head) entities occur frequently in data while rare (tail) entities are infrequent.

NED gets more interesting when we examine the full spectrum of entities shown above, specifically the more rare tail and unseen entities. These are entities that occur infrequently or not at all in data. Performance over the tail is critical because the majority of entities are rare. In Wikidata, only 13% of entities even have Wikipedia pages as a source of textual information.

Bootleg compared to a BERT-based baseline model Févry et el. 2020 showing average F1 versus number of times an entity occurred in the training data.

Prior approaches to NED use BERT-based systems to memorize textual patterns associated with an entity (e.g., Abraham Lincoln is associated with “president”). As shown above, the SotA BERT-based baseline from Févry does a great job at memorizing patterns over popular entities (it achieves 86 F1 points over all entities). For the rare entities, it does much worse (58 F1 points lower on the tail). One possible solution to better tail performance is to simply train over more data, but this would likely require training over data 1,500x the size of Wikipedia for the model to achieve 60 F1 points over all entities!

Tail Disambiguation through NED Reasoning Patterns

The question we are left with is how to disambiguate these rare entities? Our insight is that humans disambiguate entities, including rare entities, by using signals from text as well as from entity relations and types. For example, the sentence “What is the gas mileage of a Lincoln?” requires reasoning that cars have a gas mileage, not people or locations. This can be used to reason that the mention of “Bluebird” in “What is the average gas mileage of a Bluebird?” refers to the car, a Nissan Bluebird, not the animal. Our goal in Bootleg is to train a model to reason over entity types and relations and better identify these tail entities.

Through empirical analysis, we found four reasoning patterns for NED, shown and defined in the figure below.

Four reasoning patterns of NED. Each pattern uses some combination of entity, type, and relation information.

These patterns rely on signals from entities, types, and relations. Luckily, tail entities do not have equally rare types and relations. This means we should be able to learn type and relation patterns from our data that can apply to tail entities.

Bootleg: A Model for Tail NED

Bootleg takes as input a sentence, determines the possible entity candidates that could be mentioned in the sentence, and outputs the most likely candidates. The core insight into how Bootleg can better identify rare entities is in how Bootleg internally represents entities.

The creation of an entity candidate representation. Each candidate is a combination of an entity, type, and relation learned embedding.

Similar to how words are often represented by continuous word embeddings (BERT, ELMo), Bootleg represents entity candidates as a combination of an unique entity embedding, type embedding, and relation embedding, as shown above. This means, for example, each car entity will get the same car type embedding (likewise for relations). This car embedding will encode patterns learned over all cars in the training data. A rare car can then use this global “car type” knowledge for disambiguation as it will have the car embedding as part of its representation.

To output the correct entities, Bootleg uses these representations in a stacked Transformer module to allow the model to naturally learn the useful patterns for disambiguation without hard-coded rules (see our paper for details). Bootleg then scores the output candidate representations and returns the most likely candidates.

There are other exciting techniques we present in our paper regarding regularization and weak labeling to improve tail performance.

Bootleg Improves Tail Performance and Allows for Knowledge Transfer

Our simple insight of training a model to reason over types and relations provides state-of-the-art performance on three standard NED benchmarks – matching or exceeding SotA by up to 5.6 F1 points – and outperforms a BERT-based NED baseline by 5.4 F1 points over all entities and 40 F1 points over tail entities (see F1 versus entity occurrence plot above).

Benchmark System Precision Recall F1
KORE50 Hu et al., 2019 80.0 79.8 79.9
Bootleg 86.0 85.4 85.7
RSS500 Phan et al., 2019 82.3 82.3 82.3
Bootleg 82.5 82.5 82.5
AIDA CoNLL YAGO Févry et al., 2020 - 96.7 -
Bootleg 96.9 96.7 96.8

We’ll now show how the entity knowledge encoded in Bootleg’s entity representations can transfer to non-NED tasks. We extract our entity representations and use them in both a production task at a major technology company and relation extraction task. We find that the use of Bootleg embeddings in the production task provides a 8% lift in performance and even improves quality over Spanish, French, and German languages. We repeat this experiment by adding Bootleg representations to a SotA model for the TACRED relation extraction task (see tutorial). We find this Bootleg-enhanced model sets a new SotA by 1 F1 point.

Model TACRED F1
Bootleg-Enhanced 80.3
KnowBERT 79.3
SpanBERT 78.0

These results suggest that Bootleg entity representations can transfer entity knowledge to other language tasks!

Recap

To recap, we described the problem of the tail of NED and showed that existing NED systems fall short at disambiguating these rare, yet important entities. We then introduced four reasoning patterns for NED and described how we trained Bootleg to learn these patterns through the use of embeddings and Transformer modules. We finally showed that Bootleg is a SotA NED system that better disambiguates rare entities than prior methods. Further, Bootleg learns representations that can transfer entity knowledge to non-NED tasks.

We are actively developing Bootleg and would love to hear your thoughts. See our website, source code, and paper.