AI Pipelines

How to extract data from PDFs with Python

By Jason Llama

Updated:

Extracting data from PDFs is a common task in various applications, from data analysis to automated workflows. In this tutorial, we'll explore how to extract data from PDF files using Python. We'll cover several libraries and tools, including PyPDF2, pdfplumber, and Tesseract OCR, providing code snippets and explanations to guide you through the process.

Understanding PDF Structure

PDFs (Portable Document Format) are designed to present documents consistently across platforms. They can contain text, images, tables, and other elements, often in complex layouts. This complexity can make data extraction challenging, as PDFs are not inherently structured for easy data retrieval.

Setting Up Your Environment

Before we begin, ensure you have Python installed on your system. You can download it from the official Python website. We'll use several Python libraries for PDF data extraction:

  • PyPDF2: For basic text extraction.
  • pdfplumber: For more advanced text extraction and layout analysis.
  • pytesseract: For Optical Character Recognition (OCR) on scanned PDFs.
  • pdf2image: To convert PDF pages to images for OCR processing.

Install these libraries using pip:

pip install PyPDF2 pdfplumber pytesseract pdf2image

Additionally, for OCR functionality, you'll need to install Tesseract-OCR on your system. Instructions for installation can be found on the Tesseract GitHub page.

Extracting Text with PyPDF2

PyPDF2 is a pure-Python library that allows for basic PDF operations, including text extraction. Here's how to use it:

import PyPDF2

def extract_text_from_pdf(pdf_path):
    text = ""
    with open(pdf_path, "rb") as file:
        reader = PyPDF2.PdfReader(file)
        for page in reader.pages:
            text += page.extract_text() + "\n"
    return text

pdf_path = 'sample.pdf'  # Replace with your PDF file path
extracted_text = extract_text_from_pdf(pdf_path)
print(extracted_text)

This function opens a PDF file, reads each page, and extracts the text. While PyPDF2 is useful for simple PDFs, it may struggle with complex layouts or PDFs containing images.

Better text extraction with pdfplumber

For more advanced text extraction, especially from PDFs with complex layouts, pdfplumber is a powerful tool. There are many advantages over PyPDF2:

  • PyPDF2 extracts text sequentially without considering the spatial arrangement, often producing jumbled text if the document has multiple columns or tables; pdfplumber, on the other hand, preserves the positional context of text elements, making it more effective for multi-column layouts and structured data.

  • PyPDF2 sometimes struggles with extracting text from PDFs that use embedded fonts or special character encodings, leading to garbled or missing text; pdfplumber has better support for handling different font encodings and character maps.

  • PyPDF2 does not have built-in support for table extraction. If a table is embedded as text, it is extracted in an unstructured way; pdfplumber includes specific methods to detect and extract tables using heuristics and grid detection techniques.

Using it is straightforward:

import pdfplumber

def extract_text_with_pdfplumber(pdf_path):
    text = ""
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            text += page.extract_text() + "\n"
    return text

pdf_path = 'complex_sample.pdf'  # Replace with your PDF file path
extracted_text = extract_text_with_pdfplumber(pdf_path)
print(extracted_text)

pdfplumber provides more accurate text extraction and handles complex layouts better than PyPDF2.

Extracting Text from Scanned PDFs Using OCR

Scanned PDFs are essentially images of text, requiring OCR to extract the textual content. We'll use Tesseract OCR in combination with pdf2image to handle this:

from pdf2image import convert_from_path
import pytesseract

def extract_text_from_scanned_pdf(pdf_path):
    text = ""
    images = convert_from_path(pdf_path)
    for image in images:
        text += pytesseract.image_to_string(image) + "\n"
    return text

pdf_path = 'scanned_sample.pdf'  # Replace with your scanned PDF file path
extracted_text = extract_text_from_scanned_pdf(pdf_path)
print(extracted_text)

This approach converts each page of the PDF into an image and then applies OCR to extract the text.

Extracting Tables from PDFs

Extracting tables from PDFs is fraught with challenges, primarily because PDFs are designed for accurate rendering rather than structured data representation. Common issues include:

  • Merged Cells: Cells that span multiple rows or columns can disrupt the logical flow of data extraction.
  • Irregular Table Structures: Variations in table layouts, such as nested tables or inconsistent column widths, complicate extraction.
  • Lack of Table Boundaries: Tables without explicit borders or separators make it difficult to distinguish between table data and surrounding text.
  • Multi-line Cells: Cells containing multiple lines of text can be misinterpreted as multiple cells.

To deal with these issues and more, we turn to a commercial service that addresses many edge cases of PDF extraction: AWS Textract. Textract is an OCR service that can identify and extract table structures from both scanned and text-based PDFs. It recognizes table layouts, even in scanned documents, by identifying the bounding boxes of each cell and the relationships between them. It can also handle multi-line cells and merged cells.

import boto3
import json

def extract_tables_textract(pdf_path):
    textract = boto3.client('textract')
    with open(pdf_path, 'rb') as document:
        response = textract.analyze_document(
            Document={'Bytes': document.read()},
            FeatureTypes=['TABLES']
        )
    return response

pdf_path = "example.pdf"
response = extract_tables_textract(pdf_path)
print(json.dumps(response, indent=2))

Once Textract returns JSON data, you can process it into a Pandas DataFrame. Pandas is a powerful tool for data manipulation and analysis, and it can be used to clean and structure the data for further use.

import pandas as pd

def parse_textract_tables(response):
    blocks = response["Blocks"]
    table_blocks = [b for b in blocks if b["BlockType"] == "TABLE"]
    # (This is a simplified example—actual implementation will depend on your Textract response structure)
    tables = []
    for table in table_blocks:
        # Here you would iterate over the table’s cells and reconstruct rows/columns.
        table_data = []  # Process and assemble table data here
        tables.append(table_data)
    # For demonstration, assume we return a DataFrame from one table.
    df = pd.DataFrame(tables[0] if tables else [])
    return df

table_df = parse_textract_tables(response)
table_df.to_csv("extracted_table.csv", index=False)

When tables include merged cells that Textract or other tools might not handle perfectly, you can leverage an LLM like GPT-4 to “clean up” the extracted data. GPT-4 can infer missing cell values based on context.

import openai
from io import StringIO

openai.api_key = "your_openai_api_key"  # Replace with your API key

def reconstruct_merged_cells(table_text):
    prompt = f"""
    The following CSV data from a table has merged cell issues. Please reconstruct the table by inferring and filling in missing cell values:

    {table_text}

    Return the corrected table in CSV format.
    """
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=500
    )
    return response["choices"][0]["message"]["content"]

table_text = table_df.to_csv(index=False)
fixed_table_csv = reconstruct_merged_cells(table_text)

fixed_table_df = pd.read_csv(StringIO(fixed_table_csv))
print(fixed_table_df)

Not all problems are solved by LLMs, so pdfplumber can be an excellent alternative or validation tool, particularly for text-based PDFs. It provides granular control over the extraction process.

import pdfplumber

def extract_table_pdfplumber(pdf_path):
    with pdfplumber.open(pdf_path) as pdf:
        first_page = pdf.pages[0]
        table = first_page.extract_table()
    return pd.DataFrame(table)

pdfplumber_df = extract_table_pdfplumber(pdf_path)
print("pdfplumber Extracted Table:\n", pdfplumber_df)

Once you have a table DataFrame, cleaning up multi-line cells ensures that all cell data appears as a single, coherent line.

def clean_multiline_cells(df):
    return df.applymap(lambda x: " ".join(str(x).splitlines()) if isinstance(x, str) else x)

fixed_table_df = clean_multiline_cells(fixed_table_df)
print(fixed_table_df)

Integrate the above steps into a robust extraction pipeline:

  1. Extract raw table data using AWS Textract.
  2. Process the Textract JSON output into a Pandas DataFrame.
  3. Use GPT-4 to reconstruct and fix issues like merged cells.
  4. Clean multi-line cell issues.
  5. Optionally compare or combine with pdfplumber output for verification.
  6. Save or export the final cleaned table.
response = extract_tables_textract(pdf_path)
table_df = parse_textract_tables(response)
table_text = table_df.to_csv(index=False)
fixed_table_csv = reconstruct_merged_cells(table_text)

from io import StringIO

fixed_table_df = pd.read_csv(StringIO(fixed_table_csv))
fixed_table_df = clean_multiline_cells(fixed_table_df)
pdfplumber_df = extract_table_pdfplumber(pdf_path)
print("pdfplumber Extracted Table:\n", pdfplumber_df)

fixed_table_df.to_csv("cleaned_table.csv", index=False)
print("Final Cleaned Table:\n", fixed_table_df)

Extracting and Inferring PDF Metadata

PDF metadata is a set of information embedded in a PDF file that describes its properties. This metadata is often stored in the PDF header and can include details such as:

  • Title: The name of the document.
  • Author: The creator of the document.
  • Subject: A short description of the document’s content.
  • Keywords: Tags or topics related to the document.
  • Creation Date: When the document was originally created.
  • Modification Date: When the document was last edited.
  • Producer: The software used to create the PDF.
  • Format Version: The PDF version.

Why is this useful? Using the metadata, you can categorize and search large volumes of PDFs. You can also verify the source and history of a document, and provide evidence of document integrity and authorship. Metadata can serve as features for AI models that analyze documents.

Now, let’s explore how to extract and infer metadata using Python. First, we'll use the pypdf library (installed above) to extract metadata from a PDF file:

from pypdf import PdfReader

def extract_metadata(pdf_path):
    """Extracts metadata from a PDF file."""
    reader = PdfReader(pdf_path)
    metadata = reader.metadata
    return metadata if metadata else "No metadata found"

pdf_path = "sample.pdf"  # Replace with your file path
metadata = extract_metadata(pdf_path)
print("Extracted Metadata:")
for key, value in metadata.items():
    print(f"{key}: {value}")

This should output something like:

Extracted Metadata:
/Title: Annual Report 2023
/Author: John Doe
/Subject: Company financial overview
/Keywords: finance, report, 2023
/CreationDate: D:20240101090000
/ModDate: D:20240315084500
/Producer: Adobe Acrobat

When metadata is missing, we can infer details from the content using heuristics. For example, we can infer the title by extracting the first few lines of text:

from pypdf import PdfReader

def infer_title_from_content(pdf_path):
    """Guesses the title by extracting the first few lines of text."""
    reader = PdfReader(pdf_path)
    first_page = reader.pages[0]
    text = first_page.extract_text()

    if not text:
        return "No text found on the first page."

    lines = text.split("\n")
    for line in lines:
        if line.strip():  # Return the first non-empty line
            return line.strip()
    return "Title not found"

pdf_path = "sample.pdf"
title = infer_title_from_content(pdf_path)
print(f"Inferred Title: {title}")

If heuristics fail, we can use a Large Language Model (LLM) like OpenAI’s GPT-4 to infer metadata by analyzing document content. Install the OpenAI client:

pip install openai

Now, we can use the OpenAI API to infer metadata. We basically feed the first 2000 characters of the document to the LLM and ask it to extract the metadata. Only the first 2000 characters are used to avoid exceeding the token limit, and we are also guessing that the title, created date etc. is the first few lines of text:

import openai

openai.api_key = "your_openai_api_key"  # Replace with your API key

def extract_metadata_with_gpt4(text):
    """Uses GPT-4 to infer metadata from document content."""
    prompt = f"""
    Extract the following metadata from this document:
    - Title
    - Author
    - Subject
    - Creation Date (if mentioned)

    Document Text:
    {text[:2000]}  # Limiting text input for efficiency
    """

    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=150
    )

    return response["choices"][0]["message"]["content"].strip()

metadata = extract_metadata_with_gpt4(pdf_text)
print("GPT-4 Inferred Metadata:\n", metadata)

This should output something like:

Title: Annual Financial Review 2023
Author: Jane Smith
Subject: Overview of financial performance in 2023
Creation Date: January 2, 2024

For the best accuracy, we can combine both approaches:

  1. Try Extracting Metadata from PDF Headers (PyPDF2).
  2. If Metadata is Missing, Use Heuristics (First-line detection).
  3. If Heuristics Fail, Use GPT-4 to Analyze Text Content.
def extract_or_infer_metadata(pdf_path):
    """Combines direct extraction, heuristics, and GPT-4 inference."""
    metadata = extract_metadata(pdf_path)

    if metadata == "No metadata found":
        title = infer_title_from_content(pdf_path)
        pdf_text = extract_text_from_pdf(pdf_path)
        metadata = extract_metadata_with_gpt4(pdf_text)

        return {
            "Title": title if title != "Title not found" else "Unknown",
            "GPT-4 Metadata": metadata
        }
    return metadata

pdf_path = "sample.pdf"
final_metadata = extract_or_infer_metadata(pdf_path)
print(final_metadata)

// ... existing code ...

Conclusion

PDF data extraction is a powerful capability that unlocks valuable information trapped in document formats. In this tutorial, we've explored various approaches to extract different types of data from PDFs:

  • Basic text extraction with PyPDF2 for simple documents
  • Advanced text extraction with pdfplumber for handling complex layouts
  • OCR processing with Tesseract for scanned documents
  • Table extraction techniques using AWS Textract, pdfplumber, and even LLM assistance
  • Metadata extraction and inference using PDF headers, heuristics, and AI

Each method has its strengths and limitations. The right approach depends on your specific use case, the structure of your PDFs, and whether you're dealing with native digital PDFs or scanned documents. For production environments, consider combining multiple approaches to create a robust pipeline that can handle various PDF types and edge cases.

With these techniques, you can transform unstructured PDF data into structured formats ready for analysis, visualization, or integration with other systems. This unlocks new possibilities for automating document processing workflows and extracting insights from previously inaccessible information.

Further Reading

To deepen your understanding of PDF data extraction, here are some valuable resources:

By combining these tools and techniques, you can build sophisticated document processing pipelines tailored to your specific needs, whether for data analysis, information retrieval, or document automation workflows.

Ready to ship human level AI?