AI Pipelines

Web Scraping with AWS Lambda

By Jason Llama

Updated:

Web scraping is a powerful way to extract data from websites. There are many tools and services that can do this, but AWS Lambda and S3 are a great way to automate this process with fine grained control over the data you scrape and your costs.

Let's get started!

What is AWS Lambda?

AWS Lambda is a serverless compute service that lets you run code without provisioning or managing servers. You pay only for the compute time you consume - there is no charge when your code is not running.

It's useful for web scraping because each lambda can handle a single task, you can chain them up to make a pipeline, and you can schedule them to run at specific intervals.

What is AWS S3?

AWS S3 is a storage service that lets you store and retrieve any amount of data from anywhere on the internet. It's designed for data that's unstructured, like web pages.

Set Up an S3 Bucket

  1. AWS S3 → Click "Create bucket" → Name the bucket (e.g., web-scraper-ai) and choose the desired region.
  2. Configure permissions as required (ensure you allow the Lambda function to write to this bucket). Go to the bucket, and navigate to the Permissions tab. Change the Bucket Policy to the below:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "ses.amazonaws.com"
            },
            "Action": "s3:PutObject",
            "Resource": "arn:aws:s3:::web-scraper-ai/*",
            "Condition": {
                "StringEquals": {
                    "aws:Referer": "your-aws-account-id"
                }
            }
        },
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "lambda.amazonaws.com"
            },
            "Action": [
                "s3:GetObject",
                "s3:PutObject"
            ],
            "Resource": "arn:aws:s3:::web-scraper-ai/*",
            "Condition": {
                "StringEquals": {
                    "aws:Referer": "your-aws-account-id"
                }
            }
        }
    ]
}

Create the Lambda Function

  1. Go to the Lambda console: Navigate to the AWS Lambda service.
  2. Click "Create function":
    • Choose "Author from scratch."
    • Name the function (e.g., web-scraper).
    • Runtime: Select "Python 3.x" (latest available version).
    • Execution role: Create a new one. We will be adding permissions to this later.
  3. Add the scraping code: Replace the default code with the following example. Modify it as needed for your use case:
import requests
import boto3
from datetime import datetime

# Initialize S3 client
s3 = boto3.client('s3')

# List of URLs to scrape
URLs = [
    "https://example.com",
    "https://another-example.com",
]

BUCKET_NAME = "your-s3-bucket-name"

def lambda_handler(event, context):
    for url in URLs:
        try:
            # Fetch the HTML content
            response = requests.get(url, timeout=10)
            response.raise_for_status()  # Raise an error for bad status codes

            # Generate file name and content
            timestamp = datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%SZ")
            filename = f"{url.replace('https://', '').replace('/', '_')}-{timestamp}.html"
            content = response.text

            # Upload to S3
            s3.put_object(
                Bucket=BUCKET_NAME,
                Key=filename,
                Body=content,
                ContentType="text/html"
            )
            print(f"Successfully scraped and uploaded: {url}")

        except Exception as e:
            print(f"Error scraping {url}: {str(e)}")

Tip: Replace "your-s3-bucket-name" with the name of your S3 bucket.

  1. Configure Environment Variables (Optional):
    • Add the S3 bucket name or URL list as environment variables for easier management and to avoid hardcoding them in the code.
  2. Save and deploy the function.

Assign Necessary Permissions

Ensure your Lambda function has the necessary permissions to write to the S3 bucket:

  1. Go to the IAM console.
  2. Find the role attached to your Lambda function.
  3. Edit the role to include the following policy:

Replace "your-s3-bucket-name" with your actual bucket name.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "s3:PutObject",
            "Resource": "arn:aws:s3:::your-s3-bucket-name/*"
        }
    ]
}

Create EventBridge Rules

  1. Go to Amazon EventBridge → Schedules → Create schedule (link)
  2. Click "Create rule".
  3. For each URL, repeat the following steps:
    1. Define the Rule:
      • Name the rule (e.g., ScrapeExampleCom).
      • Event Source: Choose "Schedule".
      • Schedule Expression:
        • For monthly scraping: cron(0 0 1 * ? *) (midnight UTC on the 1st of each month).
        • For other frequencies: Use a rate or cron expression.
  4. Add Input Parameters:
    • Under Define Target, select AWS ServiceLambda function.
    • Choose the web-scraper Lambda function.
    • Expand Additional SettingsInput:
      • Select Constant (JSON text).

      • Add the URL in JSON format, for example:

        {
            "url": "https://example.com"
        }
        
  5. Save the rule.

Repeat this for each URL with a unique EventBridge rule.

Test the Setup

  1. Manually test the Lambda function:
    • Go to the Lambda console and click "Test."
    • Create a test event (e.g., a blank event for this setup) and invoke the function.
    • Check the S3 bucket for the uploaded HTML files.
  2. Verify the CloudWatch rule:
    • Confirm that the rule is listed under the "Triggers" tab in the Lambda function.
    • Wait for the scheduled time to ensure it runs automatically.

Optimise your scraping with Datograde

When building web scrapers, it's crucial to ensure your data quality remains high over time. Datograde helps you track and evaluate your web scraping results with:

  1. Visual Debugging: Compare scraped content side-by-side with the original webpage to quickly spot any mismatches or formatting issues.

  2. Automated Testing: Track your scraper's performance with just a few lines of code:

from datograde import Datograde, Displays

datograde = Datograde(api_key="YOUR_API_KEY")

# After scraping
datograde.attempt(
    project="Web Scraping Monitor",
    project_name=url,
    files=[
        (scraped_content, Displays.HTML),
        (response.text, Displays.HTML)  # Original page
    ]
)
  1. Quality Monitoring: Set up continuous evaluation of your scraped data to catch issues like:

    • Missing content due to website changes
    • Malformed data structures
    • Rate limiting or blocking detection
    • HTML parsing errors
  2. Performance Tracking: Monitor success rates, response times, and data completeness across all your scraped websites.

Get started with Datograde to ensure your web scraping pipeline maintains high quality data collection.

Ready to ship human level AI?