AI Pipelines

Data extraction using generative AI [2025 guide]

By Jason Llama

Updated:

Data extraction is the process of extracting data from a source and converting it into a structured format. It is a crucial step in the data pipeline.

In the last 24 months, the performance of data extraction solutions improved significantly due to new generative artificial intelligence (AI) models and large language models (LLMs).

Let's look at why this happened.

What is data extraction?

Data extraction is the process of retrieving data from various sources—whether it's structured, semi-structured, or unstructured—and converting it into a format that can be easily analyzed, processed, or stored. It often serves as the first step in a data pipeline, enabling businesses to gather information from different systems and make informed decisions.

Structured data sources include databases, APIs, spreadsheets, or XML/JSON files. Unstructured data sources include documents, PDFs, emails, images, or web pages. Semi-structured data sources include JSON, XML, or NoSQL databases.

How is data extracted?

In the early days, data extraction was very manual. Imagine someone sitting down with a stack of documents, copying information by hand into a digital format. This approach worked, but it was incredibly slow and expensive, especially as the amount of data grew.

As the need for speed and efficiency increased, people turned to automation. They developed scripts to pull data from databases using SQL queries and built tools to scrape information from websites. These automated methods quickly became popular because they could handle large volumes of data much faster than manual labor. However, automation wasn't a magic bullet. These systems often struggled when faced with unstructured or semi-structured data, like messy text from web pages or scanned images. They required constant tweaking to keep up with changes in data formats and could miss important context, leading to errors or incomplete data extraction.

This challenge sparked the evolution toward more intelligent solutions. NLP techniques tried to understand and interpret human language, making it possible to extract meaning and context from unstructured text in a limited way.

NLP systems were rule-based approaches that struggled with the rich complexity of human language. Ambiguity, idioms, or irony were not easily handled. Statistical models were heavily dependent on large amounts of annotated data and can't generalize.

In the last 24 months, the latest breakthrough has been the rise of generative AI. Powered by the Transformer architecture, these models can learn from planet scale amounts of data - the whole internet, basically. They can not only extract data but also understand the context in which it appears. This is especially useful for unstructured data like text, images, and videos.

In reality, any data extraction system will combine both rule-based and generative AI approaches, with a layer of human-in-the-loop overseeing the process. All of these techniques come together to help us manage the challenges of data extraction.

What are challenges of data extraction?

First, you need to control inaccuracies, missing values, or duplications. Data integrity is crucial to maintain trustworthiness in decision-making. Combining data from multiple sources with different formats and structures can be complex.

Unstructured inputs, such as text documents, images, or multimedia content requires advanced techniques and tools capable of interpreting and organizing such data effectively.

Extracted data may vary in format, structure, or accuracy due to different sources using different standards. Web pages frequently update their layout or use JavaScript-heavy content, and PDFs that your customers and suppliers use can have no format at all.

You'll also face technical challenges not related to AI, including:

  • Anti-Scraping Mechanisms – Websites often implement measures like CAPTCHA, IP blocking, rate limiting, and JavaScript rendering to prevent automated data extraction.

  • Scalability – Extracting large volumes of data efficiently requires handling infrastructure, concurrency, and performance optimization.

  • Data Quality and Cleaning – Extracted data may contain duplicates, missing values, or incorrect information, requiring preprocessing before use.

  • API Limitations – When using APIs for extraction, limitations like rate limits, authentication, and restricted endpoints can be barriers.

  • Multilingual and Encoding Issues – Extracting and processing data in multiple languages and different character encodings can be challenging.

  • Storage and Processing – Handling large datasets will break the bank, so you need to think about hot and cold storage, parallel processing, tracking data freshness and caching data that you do not need to process again.

As data volumes grow exponentially, extraction may break under massive datasets. Also, you need to think about legal frameworks and ethical standards, particularly when handling sensitive or personal information. Some data sources are protected by terms of service, privacy laws (like GDPR), or copyright restrictions, limiting what can be extracted.

Because of all these challenges, making sure your data extraction finishes on time to be impactful adds itself to the mix. This means that building and maintaining data extraction processes may require skilled personnel, software tools, and hardware infrastructure, which can be costly.

ROI needs to be carefully calculated. You don't want to invest all the time and money, just to discover that its cheaper for you to outsource the data extraction to a firm with cheaper labor.

What are the use cases for data extraction in 2025?

Generative AI and Large Language Models (LLMs) have unlocked numerous data extraction use cases across various industries, significantly improving efficiency and accuracy.

  • In finance, they help extract critical information from invoices, such as vendor details, invoice numbers, dates, and amounts. They also assist in processing loan applications by extracting relevant features from user-written text, analyzing financial reports and contracts to derive insights like debt-to-asset ratios, and automating receipt data extraction for expense management.
  • In the insurance industry, it facilitates the extraction of policyholder details and policy numbers from insurance cards, as well as streamlining the claims process.
  • In retail, they analyze unstructured data from customer reviews and purchase histories, helping businesses identify trends in customer preferences and top-performing products.
  • In the legal sector, these models streamline data extraction from contracts and legal documents while quickly identifying relevant information in large volumes of legal text.
  • Within healthcare, generative AI aids in processing and analyzing medical forms and extracting data from scientific literature for research purposes.

How do you set up a data extraction workflow?

We'll walkthrough a data extraction workflow with an example for market research, focusing on analyzing 10-K filings, company presentations, earnings transcripts, and news articles to create a company analysis deck.

Define the Scope

Before setting up extraction, define what insights you need:

  • ✅ Financial performance (Revenue, EBITDA, growth trends)
  • ✅ Strategic moves (M&A, partnerships, market expansion)
  • ✅ Industry trends and competitor positioning
  • ✅ Management sentiment (earnings call tone analysis)

Identify Data Sources & Access Methods

SourceTypeAccess Method
SEC Filings (10-K, 10-Q)Structured (HTML, XML, PDF)SEC EDGAR API, XBRL parsing, PDF extraction
Company PresentationsSemi-structured (PDF, PPT)Download from IR websites, PDF parsing
Earnings TranscriptsText (TXT, PDF)Seeking Alpha, Nasdaq, AI transcription
News ArticlesUnstructured (HTML)Web Scraping (BeautifulSoup, Scrapy), News APIs (Google News, Bing API)

SEC Filings (10-K, 10-Q)

To get financial filings, connect to the SEC EDGAR database. You'll need to:

  • Access the SEC EDGAR API with proper headers
  • Search for specific company filings using their CIK number
  • Download and parse the HTML/XML content

Earnings Call Transcripts

For earnings calls, you can:

  • Access transcripts through financial websites or paid services
  • Use web scraping to collect the transcript text
  • Parse the content to separate speakers and dialogue

Company Presentations

For company presentations:

  • Download PDFs from company Investor Relations pages
  • Use PDF parsing tools to extract text and data
  • Organize the content by slides or sections

News Articles

For news coverage:

  • Use RSS feeds or news APIs to get recent articles
  • Filter for relevant company news
  • Extract headlines, dates, and article content

Data Cleaning & Transformation

  • Remove boilerplate text (disclaimers, footnotes).
  • Extract key financial metrics using regex.
  • Sentiment analysis of earnings calls. // ... existing code ...

Data Cleaning & Transformation

Several things you can do to clean and transform the data:

  • Remove boilerplate text (disclaimers, footnotes).
  • Extract key financial metrics using regex.
  • Sentiment analysis of earnings calls.

Here's an example prompt to clean financial data using an LLM:

def clean_financial_text(text):
    prompt = """
    You are a financial data extraction expert. Clean the following text by:
    1. Removing standard disclaimers and footnotes
    2. Extracting key financial metrics (Revenue, EBITDA, Growth Rate)
    3. Analyzing the sentiment of the language used

    Format the output as JSON with these keys:
    - cleaned_text: The text with boilerplate removed
    - metrics: Dictionary of financial metrics
    - sentiment: Overall sentiment score (-1 to 1)

    Text to clean:
    {text}
    """

    # Send to your LLM of choice
    response = llm.generate(prompt.format(text=text))
    return response

# Example usage
raw_text = """
Forward-Looking Statements
This presentation contains forward-looking statements...

Q4 2024 Results:
Revenue increased 12% YoY to $394.3M
EBITDA margin expanded to 28.5%
Strong momentum in enterprise segment
"""

cleaned_data = clean_financial_text(raw_text)
# Returns structured JSON with cleaned text, metrics, and sentiment

This approach leverages LLMs to intelligently:

  • Identify and remove standard legal disclaimers
  • Extract numerical metrics in a consistent format
  • Assess the overall tone and sentiment of the text

Store & Organize Extracted Data

Data TypeStorage Option
Text Data (Filings, Transcripts, News)PostgreSQL, MongoDB
Extracted Metrics (Revenue, EBITDA)CSV, Pandas DataFrame
PDF/Raw FilesAWS S3, Google Drive

Automate the Workflow

  • Scheduling: Use cron jobs, Apache Airflow, trigger.dev or Prefect.
  • Real-time alerts: Use Slack/Webhooks/email alerts for news updates.
  • Pipeline Execution: Sequential execution in a script or orchestration tool.

Generate a PowerPoint Report

Use python-pptx to create slides with key findings.

Final thoughts

  • Prediction: Future models will use self-supervised learning to autonomously adapt to new formats and structures, significantly reducing the need for human intervention in data extraction from PDFs, spreadsheets, and messy text.

  • Prediction: Enhanced multi-modal LLMs will seamlessly extract insights by combining text with images, diagrams, and videos—allowing automated extraction of financial data from scanned documents, or scientific insights from research papers with embedded charts.

  • Prediction: AI agents will become capable of autonomous, continuous data extraction—detecting new, relevant data sources, dynamically adjusting extraction methods, and integrating extracted data into structured databases or knowledge graphs without human input.

Ready to ship human level AI?