Analyzing Blog Traffic with CloudFront Logs and Athena

A detailed guide on setting up CloudFront logging with Athena to analyze blog traffic patterns and user behavior using CDK

Analyzing Blog Traffic with CloudFront Logs and Athena

Analyzing Blog Traffic with CloudFront Logs and Athena

In this guide, we'll explore how to set up and analyze CloudFront logs using Amazon Athena to gain insights into blog traffic patterns. We'll use CDK to create the necessary infrastructure and write queries to extract meaningful data about how users interact with our blog.

Overview

The solution consists of several components:

  • CloudFront Distribution with logging enabled
  • S3 bucket for log storage
  • Athena table definition matching the CloudFront log format
  • Predefined queries for common analytics needs

Client

CloudFront
Distribution

S3 Site
Bucket

S3 Logging
Bucket

Glue
Database

Athena
Query

Athena Results
Bucket

Infrastructure Setup

First, we'll configure CloudFront to write logs to an S3 bucket. In our website-construct.ts:

this.distribution = new Distribution(this, "Distribution", {
  // ... other configuration ...
  logBucket: props.logBucket,
  logFilePrefix: "cloudfront-logs/",
  logIncludesCookies: true,
});

Glue Table and Partitioning

When working with CloudFront logs, proper table configuration in AWS Glue is crucial for efficient querying. Let's look at how we set up our Glue table to handle the log data:

Table Structure

The Glue table is configured to match CloudFront's log format exactly:

new CfnTable(this, "CloudFrontLogsTable", {
  catalogId: Stack.of(this).account,
  databaseName: database.ref,
  tableInput: {
    name: "cloudfront_logs",
    description: `CloudFront logs for ${props.environment} ${props.domainName}`,
    tableType: "EXTERNAL_TABLE",
    parameters: {
      "skip.header.line.count": "2",
    },
    storageDescriptor: {
      location: `s3://${props.logBucket.bucketName}/cloudfront-logs/`,
      inputFormat: "org.apache.hadoop.mapred.TextInputFormat",
      outputFormat:
        "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
      serdeInfo: {
        serializationLibrary:
          "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe",
        parameters: {
          "field.delim": "\t",
          "serialization.format": "\t",
          "timestamp.formats": "yyyy-MM-dd HH:mm:ss",
          "ignore.malformed.fields": "true",
        },
      },
      columns: [
        { name: "date", type: "date" },
        { name: "time", type: "string" },
        { name: "location", type: "string" },
        { name: "bytes", type: "bigint" },
        { name: "request_ip", type: "string" },
        { name: "method", type: "string" },
        { name: "host", type: "string" },
        { name: "uri", type: "string" },
        { name: "status", type: "int" },
        { name: "referrer", type: "string" },
        { name: "user_agent", type: "string" },
        { name: "query_string", type: "string" },
        { name: "cookie", type: "string" },
        { name: "result_type", type: "string" },
        { name: "request_id", type: "string" },
        { name: "host_header", type: "string" },
        { name: "request_protocol", type: "string" },
        { name: "request_bytes", type: "bigint" },
        { name: "time_taken", type: "float" },
        { name: "xforwarded_for", type: "string" },
        { name: "ssl_protocol", type: "string" },
        { name: "ssl_cipher", type: "string" },
        { name: "response_result_type", type: "string" },
        { name: "http_version", type: "string" },
        { name: "fle_status", type: "string" },
        { name: "fle_encrypted_fields", type: "int" },
        { name: "c_port", type: "int" },
        { name: "time_to_first_byte", type: "float" },
        { name: "x_edge_detailed_result_type", type: "string" },
        { name: "sc_content_type", type: "string" },
        { name: "sc_content_len", type: "bigint" },
        { name: "sc_range_start", type: "bigint" },
        { name: "sc_range_end", type: "bigint" },
      ],
    },
  },
});

Flat vs Partitioned Storage

In this implementation, our CloudFront logs are stored in a flat structure directly in the S3 bucket. While it's possible to use partitioning with CloudFront logs (by year/month/day), we've chosen a simpler approach since:

  1. Our log volume is moderate
  2. Our queries typically look at recent data (last 7 days)
  3. We're using date-based filtering in our WHERE clauses

If you were dealing with much larger volumes of data or needed to frequently query historical data, you might consider using partitioning with a structure like:

s3://bucket/cloudfront-logs/year=2024/month=01/day=11/

This would require additional configuration in the Glue table:

parameters: {
  "projection.enabled": "true",
  "projection.year.type": "integer",
  "projection.year.range": "2024,2026",
  "projection.month.type": "integer",
  "projection.month.range": "1,12",
  "projection.day.type": "integer",
  "projection.day.range": "1,31",
},
partitionKeys: [
  { name: "year", type: "string" },
  { name: "month", type: "string" },
  { name: "day", type: "string" },
],

For our use case, the simpler flat structure provides good performance while keeping the configuration straightforward.

Athena Workgroup

First, we will create an Athena Workgroup with a name that includes the environment and domain name. This allows us to easily identify and manage the workgroup for each environment.

// Create Athena Workgroup with environment-specific name
const workgroup = new CfnWorkGroup(this, "CloudFrontLogsWorkgroup", {
  name: resourcePrefix,
  description: `Workgroup for analyzing CloudFront logs for ${props.environment} ${props.domainName}`,
  recursiveDeleteOption: true,
  workGroupConfiguration: {
    resultConfiguration: {
      outputLocation: `s3://${this.athenaResultsBucket.bucketName}/`,
    },
    publishCloudWatchMetricsEnabled: true,
    enforceWorkGroupConfiguration: true,
    engineVersion: {
      selectedEngineVersion: "Athena engine version 3",
    },
  },
});

Understanding CloudFront Log Format

CloudFront logs are tab-delimited text files with a specific structure. Each log entry contains fields like:

#Fields: date time location bytes request_ip method host uri status referrer user_agent query_string cookie result_type request_id host_header request_protocol request_bytes time_taken xforwarded_for ssl_protocol ssl_cipher response_result_type http_version

The skip.header.line.count parameter is set to "2" because CloudFront log files contain two header lines:

  1. The version header
  2. The field names header

Handling Malformed Data

CloudFront logs can sometimes contain malformed entries. We handle this gracefully with:

parameters: {
  "ignore.malformed.fields": "true",
}

This ensures our queries don't fail if they encounter malformed log entries.

Query Optimization

To optimize query performance and cost, we've structured our queries to:

  1. Filter by date first to minimize data scanned
  2. Exclude irrelevant paths and file types
  3. Use appropriate WHERE clauses

For example, when analyzing blog post traffic:

SELECT
    uri,
    COUNT(*) as views,
    COUNT(DISTINCT request_ip) as unique_visitors,
    AVG(time_taken) as avg_time_taken
FROM cloudfront_logs
WHERE date >= CURRENT_DATE - INTERVAL '7' DAY
    AND uri LIKE '/blog/%'
    AND uri NOT LIKE '%.txt'
    AND uri NOT LIKE '%._next/%'
    AND uri NOT LIKE '%.png'
    AND uri NOT LIKE '%.jpg'
    AND uri NOT LIKE '%.ico'
    AND uri NOT LIKE '%.js'
    AND uri NOT LIKE '%.css'
    AND uri NOT LIKE '%.woff2'
    AND uri NOT LIKE '%.json'
    AND uri NOT LIKE '%.xml'
GROUP BY uri
HAVING COUNT(*) > 1
ORDER BY views DESC;

This query:

  1. Limits the date range to the last 7 days
  2. Focuses only on blog post URLs
  3. Excludes asset requests and text versions of pages
  4. Filters out single-hit entries that might be bot traffic

Cost Management

Since Athena charges based on the amount of data scanned, efficient table design and query patterns are essential. Some tips:

  1. Use date-based filtering in your WHERE clause
  2. Avoid SELECT * queries
  3. Only include necessary columns in your queries
  4. Consider using Athena workgroups with query result reuse enabled

By following these practices, we can efficiently analyze our blog traffic while keeping costs under control.

Analyzing Blog Traffic

With our infrastructure in place, we can now analyze blog traffic patterns. Here are some useful queries we've implemented:

This query shows the most visited blog posts, excluding asset requests:

SELECT
    uri,
    COUNT(*) as views,
    COUNT(DISTINCT request_ip) as unique_visitors,
    AVG(time_taken) as avg_time_taken
FROM cloudfront_logs
WHERE date >= CURRENT_DATE - INTERVAL '7' DAY
    AND uri LIKE '/blog/%'
    AND uri NOT LIKE '%.txt'
    AND uri NOT LIKE '%._next/%'
    AND uri NOT LIKE '%.png'
    AND uri NOT LIKE '%.jpg'
    AND uri NOT LIKE '%.ico'
    AND uri NOT LIKE '%.js'
    AND uri NOT LIKE '%.css'
    AND uri NOT LIKE '%.woff2'
    AND uri NOT LIKE '%.json'
    AND uri NOT LIKE '%.xml'
GROUP BY uri
HAVING COUNT(*) > 1
ORDER BY views DESC;

This query:

  1. Filters for blog post URLs
  2. Excludes asset requests (images, stylesheets, etc.)
  3. Counts total views and unique visitors
  4. Calculates average load time

Traffic Patterns by Hour

To understand when our blog receives the most traffic:

SELECT
    HOUR(PARSE_DATETIME(time, 'HH:mm:ss')) as hour_utc,
    COUNT(*) as requests,
    AVG(time_taken) as avg_response_time
FROM cloudfront_logs
WHERE date >= CURRENT_DATE - INTERVAL '7' DAY
GROUP BY HOUR(PARSE_DATETIME(time, 'HH:mm:ss'))
ORDER BY hour_utc;

Saving Common Queries

We store these queries in Athena for easy reuse:

const queries = [
  {
    name: `Top Pages - ${props.environment}`,
    description: `Shows most visited pages excluding assets for ${props.environment}`,
    query: `...`,
  },
  {
    name: `Traffic by Hour - ${props.environment}`,
    description: `Shows traffic patterns by hour for ${props.environment}`,
    query: `...`,
  },
  // Additional queries...
];
 
queries.forEach((queryDef, index) => {
  new CfnNamedQuery(this, `SavedQuery${index}`, {
    database: database.ref,
    description: queryDef.description,
    name: queryDef.name,
    queryString: queryDef.query,
    workGroup: workgroup.name,
  });
});

Cost Considerations

Remember that Athena charges based on the amount of data scanned. To optimize costs:

  1. Use date filters to limit the time range
  2. Exclude unnecessary file types and paths
  3. Use appropriate WHERE clauses to scan only relevant data

Conclusion

By combining CloudFront logs with Athena, we've created a powerful analytics system for understanding our blog's traffic patterns. This setup allows us to:

  • Track popular content
  • Understand user behavior
  • Monitor performance
  • Make data-driven decisions about content strategy