Analyzing Blog Traffic with CloudFront Logs and Athena

In this guide, we'll explore how to set up and analyze CloudFront logs using Amazon Athena to gain insights into blog traffic patterns. We'll use CDK to create the necessary infrastructure and write queries to extract meaningful data about how users interact with our blog.

Overview

The solution consists of several components:

CloudFront Distribution with logging enabled
S3 bucket for log storage
Athena table definition matching the CloudFront log format
Predefined queries for common analytics needs

flowchart LR
    Client((Client))
    CF[CloudFront<br/>Distribution]
    S3Site[(S3 Site<br/>Bucket)]
    S3Log[(S3 Logging<br/>Bucket)]
    Glue[Glue<br/>Database]
    Athena[Athena<br/>Query]
    S3Results[(Athena Results<br/>Bucket)]

    Client --> CF
    CF --> S3Site
    CF --> S3Log
    S3Log --> Glue
    Glue --> Athena
    Athena --> S3Results

    style CF fill:#8C4FFF,color:#fff
    style S3Site fill:#569A31,color:#fff  
    style S3Log fill:#569A31,color:#fff
    style Glue fill:#8C4FFF,color:#fff
    style Athena fill:#8C4FFF,color:#fff
    style S3Results fill:#569A31,color:#fff

Infrastructure Setup

First, we'll configure CloudFront to write logs to an S3 bucket. In our website-construct.ts:

this.distribution = new Distribution(this, "Distribution", {
  // ... other configuration ...
  logBucket: props.logBucket,
  logFilePrefix: "cloudfront-logs/",
  logIncludesCookies: true,
});

Glue Table and Partitioning

When working with CloudFront logs, proper table configuration in AWS Glue is crucial for efficient querying. Let's look at how we set up our Glue table to handle the log data:

Table Structure

The Glue table is configured to match CloudFront's log format exactly:

new CfnTable(this, "CloudFrontLogsTable", {
  catalogId: Stack.of(this).account,
  databaseName: database.ref,
  tableInput: {
    name: "cloudfront_logs",
    description: `CloudFront logs for ${props.environment} ${props.domainName}`,
    tableType: "EXTERNAL_TABLE",
    parameters: {
      "skip.header.line.count": "2",
    },
    storageDescriptor: {
      location: `s3://${props.logBucket.bucketName}/cloudfront-logs/`,
      inputFormat: "org.apache.hadoop.mapred.TextInputFormat",
      outputFormat:
        "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
      serdeInfo: {
        serializationLibrary:
          "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe",
        parameters: {
          "field.delim": "\t",
          "serialization.format": "\t",
          "timestamp.formats": "yyyy-MM-dd HH:mm:ss",
          "ignore.malformed.fields": "true",
        },
      },
      columns: [
        { name: "date", type: "date" },
        { name: "time", type: "string" },
        { name: "location", type: "string" },
        { name: "bytes", type: "bigint" },
        { name: "request_ip", type: "string" },
        { name: "method", type: "string" },
        { name: "host", type: "string" },
        { name: "uri", type: "string" },
        { name: "status", type: "int" },
        { name: "referrer", type: "string" },
        { name: "user_agent", type: "string" },
        { name: "query_string", type: "string" },
        { name: "cookie", type: "string" },
        { name: "result_type", type: "string" },
        { name: "request_id", type: "string" },
        { name: "host_header", type: "string" },
        { name: "request_protocol", type: "string" },
        { name: "request_bytes", type: "bigint" },
        { name: "time_taken", type: "float" },
        { name: "xforwarded_for", type: "string" },
        { name: "ssl_protocol", type: "string" },
        { name: "ssl_cipher", type: "string" },
        { name: "response_result_type", type: "string" },
        { name: "http_version", type: "string" },
        { name: "fle_status", type: "string" },
        { name: "fle_encrypted_fields", type: "int" },
        { name: "c_port", type: "int" },
        { name: "time_to_first_byte", type: "float" },
        { name: "x_edge_detailed_result_type", type: "string" },
        { name: "sc_content_type", type: "string" },
        { name: "sc_content_len", type: "bigint" },
        { name: "sc_range_start", type: "bigint" },
        { name: "sc_range_end", type: "bigint" },
      ],
    },
  },
});

Flat vs Partitioned Storage

In this implementation, our CloudFront logs are stored in a flat structure directly in the S3 bucket. While it's possible to use partitioning with CloudFront logs (by year/month/day), we've chosen a simpler approach since:

Our log volume is moderate
Our queries typically look at recent data (last 7 days)
We're using date-based filtering in our WHERE clauses

If you were dealing with much larger volumes of data or needed to frequently query historical data, you might consider using partitioning with a structure like:

s3://bucket/cloudfront-logs/year=2024/month=01/day=11/

This would require additional configuration in the Glue table:

parameters: {
  "projection.enabled": "true",
  "projection.year.type": "integer",
  "projection.year.range": "2024,2026",
  "projection.month.type": "integer",
  "projection.month.range": "1,12",
  "projection.day.type": "integer",
  "projection.day.range": "1,31",
},
partitionKeys: [
  { name: "year", type: "string" },
  { name: "month", type: "string" },
  { name: "day", type: "string" },
],

For our use case, the simpler flat structure provides good performance while keeping the configuration straightforward.

Athena Workgroup

First, we will create an Athena Workgroup with a name that includes the environment and domain name. This allows us to easily identify and manage the workgroup for each environment.

// Create Athena Workgroup with environment-specific name
const workgroup = new CfnWorkGroup(this, "CloudFrontLogsWorkgroup", {
  name: resourcePrefix,
  description: `Workgroup for analyzing CloudFront logs for ${props.environment} ${props.domainName}`,
  recursiveDeleteOption: true,
  workGroupConfiguration: {
    resultConfiguration: {
      outputLocation: `s3://${this.athenaResultsBucket.bucketName}/`,
    },
    publishCloudWatchMetricsEnabled: true,
    enforceWorkGroupConfiguration: true,
    engineVersion: {
      selectedEngineVersion: "Athena engine version 3",
    },
  },
});

Understanding CloudFront Log Format

CloudFront logs are tab-delimited text files with a specific structure. Each log entry contains fields like:

#Fields: date time location bytes request_ip method host uri status referrer user_agent query_string cookie result_type request_id host_header request_protocol request_bytes time_taken xforwarded_for ssl_protocol ssl_cipher response_result_type http_version

The skip.header.line.count parameter is set to "2" because CloudFront log files contain two header lines:

The version header
The field names header

Handling Malformed Data

CloudFront logs can sometimes contain malformed entries. We handle this gracefully with:

parameters: {
  "ignore.malformed.fields": "true",
}

This ensures our queries don't fail if they encounter malformed log entries.

Query Optimization

To optimize query performance and cost, we've structured our queries to:

Filter by date first to minimize data scanned
Exclude irrelevant paths and file types
Use appropriate WHERE clauses

For example, when analyzing blog post traffic:

SELECT
    uri,
    COUNT(*) as views,
    COUNT(DISTINCT request_ip) as unique_visitors,
    AVG(time_taken) as avg_time_taken
FROM cloudfront_logs
WHERE date >= CURRENT_DATE - INTERVAL '7' DAY
    AND uri LIKE '/blog/%'
    AND uri NOT LIKE '%.txt'
    AND uri NOT LIKE '%._next/%'
    AND uri NOT LIKE '%.png'
    AND uri NOT LIKE '%.jpg'
    AND uri NOT LIKE '%.ico'
    AND uri NOT LIKE '%.js'
    AND uri NOT LIKE '%.css'
    AND uri NOT LIKE '%.woff2'
    AND uri NOT LIKE '%.json'
    AND uri NOT LIKE '%.xml'
GROUP BY uri
HAVING COUNT(*) > 1
ORDER BY views DESC;

This query:

Limits the date range to the last 7 days
Focuses only on blog post URLs
Excludes asset requests and text versions of pages
Filters out single-hit entries that might be bot traffic

Cost Management

Since Athena charges based on the amount of data scanned, efficient table design and query patterns are essential. Some tips:

Use date-based filtering in your WHERE clause
Avoid SELECT * queries
Only include necessary columns in your queries
Consider using Athena workgroups with query result reuse enabled

By following these practices, we can efficiently analyze our blog traffic while keeping costs under control.