Entity Recognition and LLMs

What is Entity Recognition?

Entity recognition, also known as Named Entity Recognition (NER), is a key technique in natural language processing (NLP) that involves identifying and categorizing entities in text into predefined categories such as names of people, organizations, locations, dates, monetary amounts, percentages, and more.

For example, in the sentence:

"Apple Inc. announced its latest iPhone at a conference in California on September 12, 2023." A NER system might produce:

Apple Inc. → ORGANIZATION
iPhone → PRODUCT
California → LOCATION
September 12, 2023 → DATE

Why is this useful?

Enhanced Explanation of NER Applications

Entity Recognition can help with finding and classifying of critical information from unstructured text

Data or Information Extraction: NER can parse large volumes of unstructured text, such as news articles, contracts, and reports, to extract specific entities like names, dates, monetary values, or legal clauses. For instance, it might identify stakeholders, contract durations, or payment terms in a collection of legal documents, streamlining tasks like contract review and risk assessment.
Knowledge Graph Construction and Updates: By identifying entities and the relationships between them, NER can automatically populate and maintain knowledge graphs. For example, it can analyze web articles to create a graph connecting companies, their founders, and headquarters locations. This application supports industries like media and research, where accurate, up-to-date relational data is crucial.
Automated Data Entry: NER simplifies data entry processes by identifying and categorizing information from structured or semi-structured sources like receipts and invoices. For instance, it can extract customer names, addresses, purchased items, and prices from scanned receipts, reducing manual effort and error rates in accounting and retail operations.

NER's ability to process unstructured data into structured, actionable insights makes it an essential tool for improving operational efficiency and enabling advanced analytics across diverse domains.

Challenges of Entity Recognition

Entity recognition faces several challenges, primarily due to the complexity, variability, and ambiguity of natural language.

A word can refer to multiple entities depending on the context, determining the type of entity often requires understanding the surrounding words. (Apple the fruit vs Apple the company)

Entities can be referred to in multiple ways. The classic example is "New York City" vs. "NYC". Capitalization, punctuation, or spacing inconsistencies can make recognition difficult -- iPhone vs. i-Phone.

In fields like medicine, law, or finance, NER systems need to recognize domain-specific terms. This is often solved with fine-tuning, but then models trained on one domain (e.g., news articles) may not perform well in another (e.g., legal documents).

Informal text (like "tmrw meet @ cafe") with abbreviations, slang, and lack of punctuation can confuse NER systems. Typos and misspellings add another layer of complexity.

Entities in different languages may have unique characteristics, like Arabic names often include prefixes like "Al-" or "Bin". You can also have many languages in one sentence.

New terms, brands, or names frequently emerge, making it hard for systems to stay updated. "ChatGPT" might not have been recognized before 2022. Some entities may even only be relevant for a specific time, like "FIFA World Cup" or "Super Bowl".

Rules-based, statistical, and machine learning methods

Before LLMs, entity recognition was approached through a combination of rule-based systems, statistical methods, and traditional machine learning models.

Rule-based systems

Rules based on regular expressions or patterns were created to identify entities, like matching dates with patterns like \d{2}/\d{2}/\d{4} or names with common prefixes like "Dr.", "Mr.", or "Ms.". If the entities are well known, predefined lists of known names, organizations, and locations were used for matching.

Part-of-speech tags (e.g. "NNP" for proper nouns) or syntactic structures (e.g. "NP" for noun phrases) were also used as signals. These systems were hard to generalise and had little contextual understanding.

Statistical methods

Statistical models were introduced to overcome the rigidity of rule-based systems by learning patterns from annotated data.

Hidden Markov Models (HMMs) modeled sequences of words as probabilistic transitions between states (e.g., "Person", "Organization").
Maximum Entropy Models captured the probability of an entity given its context by considering features like surrounding words and capitalization.
Conditional Random Fields (CRFs) improved on HMMs by considering the entire sentence context for predicting the label of each word.

These were more flexible than rule-based systems, capable of learning from labeled examples without manual rule creation; but they still required significant feature engineering to perform well and were dependent on annotated data for training, which could be scarce or domain-specific.

Machine learning methods

Machine learning, the sort of 'bigger data' cousin of statistical methods, helped advance entity recognition even more.

There was first the introduction of word embeddings (e.g., Word2Vec) to capture semantic relationships, reducing reliance on manual feature engineering compared to statistical methods. They handled high-dimensional, contextual, and orthographic features (e.g. capitalization, punctuation) automatically, enabling better generalization across domains and datasets.

Unlike statistical models (e.g., HMMs, CRFs), ML models (e.g., SVMs, Decision Trees) captured complex, non-linear patterns and adapted to variations in text structure, improving accuracy and robustness. With better feature representation, flexibility, and handling of unseen data, ML methods addressed the rigidity, limited feature scope, and domain specificity of statistical approaches.

While ML methods still required significant effort in preparing labeled datasets and designing features, they paved the way for even more advanced deep learning techniques that further reduced dependency on manual intervention.

Entity Recognition with LLMs

LLMs present new ways to solve challenges of entity recognition.

LLMs generate contextual embeddings, allowing them to adapt the representation of a word based on its context.

LLMs are pre-trained on massive corpora, enabling them to recognize entities and their contexts without domain-specific training. Fine-tuning or prompting requires significantly less labeled data, making NER adaptable and scalable across domains. They use attention mechanisms to identify and weigh the importance of surrounding context, making entity classification more accurate.

Multilingual models (e.g., mBERT, XLM-R) can process multiple languages in a single model. They also handle code-switching (mixed-language sentences) effectively.

LLMs inherently recognize variations due to the diversity of training data. They can link entities across synonyms and aliases without explicit feature engineering.

They can handle noisy and informal language better, as their pre-training includes exposure to diverse datasets with real-world variability.

With fine-tuning, excel at distinguishing nuanced entities (e.g., distinguishing "California" as a STATE and "California Institute of Technology" as an ORGANIZATION).

Enough talk, show me the code

Let's look at a simple example of how to use Python to call an LLM to perform entity recognition.

import json
from typing import List
from pydantic import BaseModel, ValidationError

# Define a Pydantic model for the expected JSON output
class Entity(BaseModel):
    entity: str
    type: str
    start: int
    end: int

class NEROutput(BaseModel):
    entities: List[Entity]

# Comprehensive prompt for NER
prompt = """
Perform named entity recognition on the following text. Identify entities like:
- ORGANIZATION
- PERSON
- LOCATION
- DATE
- PRODUCT
- EVENT

Provide the output as JSON with the following format:
{
  "entities": [
    {"entity": "<entity text>", "type": "<entity type>", "start": <start index>, "end": <end index>}
  ]
}

Ensure that:
- The entity text matches exactly from the input.
- The start and end indices are character offsets in the input text.

Text: "Apple Inc. announced its latest iPhone at a conference in California on September 12, 2023."
"""

# Simulating a call to a language model API (replace this with actual API call, e.g., OpenAI's API)
def mock_language_model_api(prompt):
    # Mocking a response from an LLM
    response = {
        "entities": [
            {"entity": "Apple Inc.", "type": "ORGANIZATION", "start": 0, "end": 10},
            {"entity": "iPhone", "type": "PRODUCT", "start": 27, "end": 33},
            {"entity": "California", "type": "LOCATION", "start": 54, "end": 64},
            {"entity": "September 12, 2023", "type": "DATE", "start": 68, "end": 87}
        ]
    }
    return json.dumps(response)

# Get the model output (in a real scenario, replace this with API call response)
response = mock_language_model_api(prompt)

# Validate the JSON response using Pydantic
try:
    validated_output = NEROutput.parse_raw(response)
    print("Validated Output:", validated_output)
except ValidationError as e:
    print("Validation Error:", e.json())

Validating outputs with Pydantic

Entity: Ensures each entity in the output JSON has the correct structure, including entity, type, start, and end.
NEROutput: Wraps a list of Entity objects to validate the entire JSON structure.

Code outputs

If the response is valid:

Validated Output: entities=[Entity(entity='Apple Inc.', type='ORGANIZATION', start=0, end=10), Entity(entity='iPhone', type='PRODUCT', start=27, end=33), Entity(entity='California', type='LOCATION', start=54, end=64), Entity(entity='September 12, 2023', type='DATE', start=68, end=87)]

If the response is invalid (e.g., missing fields or incorrect types):

Validation Error: [
  {
    "loc": ["entities", 0, "type"],
    "msg": "field required",
    "type": "value_error.missing"
  }
]

Entity recognition in production

Replace mock_language_model_api with an actual API call, such as:

import openai

response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[{"role": "user", "content": prompt}],
    temperature=0
)
response_json = response.choices[0].message['content']

Then pass response_json to the NEROutput.parse_raw() method for validation.