Analyzing Blog Traffic with CloudFront Logs and Athena
A detailed guide on setting up CloudFront logging with Athena to analyze blog traffic patterns and user behavior using CDK

Analyzing Blog Traffic with CloudFront Logs and Athena
In this guide, we'll explore how to set up and analyze CloudFront logs using Amazon Athena to gain insights into blog traffic patterns. We'll use CDK to create the necessary infrastructure and write queries to extract meaningful data about how users interact with our blog.
Overview
The solution consists of several components:
- CloudFront Distribution with logging enabled
- S3 bucket for log storage
- Athena table definition matching the CloudFront log format
- Predefined queries for common analytics needs
Infrastructure Setup
First, we'll configure CloudFront to write logs to an S3 bucket. In our website-construct.ts
:
this.distribution = new Distribution(this, "Distribution", {
// ... other configuration ...
logBucket: props.logBucket,
logFilePrefix: "cloudfront-logs/",
logIncludesCookies: true,
});
Glue Table and Partitioning
When working with CloudFront logs, proper table configuration in AWS Glue is crucial for efficient querying. Let's look at how we set up our Glue table to handle the log data:
Table Structure
The Glue table is configured to match CloudFront's log format exactly:
new CfnTable(this, "CloudFrontLogsTable", {
catalogId: Stack.of(this).account,
databaseName: database.ref,
tableInput: {
name: "cloudfront_logs",
description: `CloudFront logs for ${props.environment} ${props.domainName}`,
tableType: "EXTERNAL_TABLE",
parameters: {
"skip.header.line.count": "2",
},
storageDescriptor: {
location: `s3://${props.logBucket.bucketName}/cloudfront-logs/`,
inputFormat: "org.apache.hadoop.mapred.TextInputFormat",
outputFormat:
"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
serdeInfo: {
serializationLibrary:
"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe",
parameters: {
"field.delim": "\t",
"serialization.format": "\t",
"timestamp.formats": "yyyy-MM-dd HH:mm:ss",
"ignore.malformed.fields": "true",
},
},
columns: [
{ name: "date", type: "date" },
{ name: "time", type: "string" },
{ name: "location", type: "string" },
{ name: "bytes", type: "bigint" },
{ name: "request_ip", type: "string" },
{ name: "method", type: "string" },
{ name: "host", type: "string" },
{ name: "uri", type: "string" },
{ name: "status", type: "int" },
{ name: "referrer", type: "string" },
{ name: "user_agent", type: "string" },
{ name: "query_string", type: "string" },
{ name: "cookie", type: "string" },
{ name: "result_type", type: "string" },
{ name: "request_id", type: "string" },
{ name: "host_header", type: "string" },
{ name: "request_protocol", type: "string" },
{ name: "request_bytes", type: "bigint" },
{ name: "time_taken", type: "float" },
{ name: "xforwarded_for", type: "string" },
{ name: "ssl_protocol", type: "string" },
{ name: "ssl_cipher", type: "string" },
{ name: "response_result_type", type: "string" },
{ name: "http_version", type: "string" },
{ name: "fle_status", type: "string" },
{ name: "fle_encrypted_fields", type: "int" },
{ name: "c_port", type: "int" },
{ name: "time_to_first_byte", type: "float" },
{ name: "x_edge_detailed_result_type", type: "string" },
{ name: "sc_content_type", type: "string" },
{ name: "sc_content_len", type: "bigint" },
{ name: "sc_range_start", type: "bigint" },
{ name: "sc_range_end", type: "bigint" },
],
},
},
});
Flat vs Partitioned Storage
In this implementation, our CloudFront logs are stored in a flat structure directly in the S3 bucket. While it's possible to use partitioning with CloudFront logs (by year/month/day), we've chosen a simpler approach since:
- Our log volume is moderate
- Our queries typically look at recent data (last 7 days)
- We're using date-based filtering in our WHERE clauses
If you were dealing with much larger volumes of data or needed to frequently query historical data, you might consider using partitioning with a structure like:
s3://bucket/cloudfront-logs/year=2024/month=01/day=11/
This would require additional configuration in the Glue table:
parameters: {
"projection.enabled": "true",
"projection.year.type": "integer",
"projection.year.range": "2024,2026",
"projection.month.type": "integer",
"projection.month.range": "1,12",
"projection.day.type": "integer",
"projection.day.range": "1,31",
},
partitionKeys: [
{ name: "year", type: "string" },
{ name: "month", type: "string" },
{ name: "day", type: "string" },
],
For our use case, the simpler flat structure provides good performance while keeping the configuration straightforward.
Athena Workgroup
First, we will create an Athena Workgroup with a name that includes the environment and domain name. This allows us to easily identify and manage the workgroup for each environment.
// Create Athena Workgroup with environment-specific name
const workgroup = new CfnWorkGroup(this, "CloudFrontLogsWorkgroup", {
name: resourcePrefix,
description: `Workgroup for analyzing CloudFront logs for ${props.environment} ${props.domainName}`,
recursiveDeleteOption: true,
workGroupConfiguration: {
resultConfiguration: {
outputLocation: `s3://${this.athenaResultsBucket.bucketName}/`,
},
publishCloudWatchMetricsEnabled: true,
enforceWorkGroupConfiguration: true,
engineVersion: {
selectedEngineVersion: "Athena engine version 3",
},
},
});
Understanding CloudFront Log Format
CloudFront logs are tab-delimited text files with a specific structure. Each log entry contains fields like:
#Fields: date time location bytes request_ip method host uri status referrer user_agent query_string cookie result_type request_id host_header request_protocol request_bytes time_taken xforwarded_for ssl_protocol ssl_cipher response_result_type http_version
The skip.header.line.count
parameter is set to "2" because CloudFront log files contain two header lines:
- The version header
- The field names header
Handling Malformed Data
CloudFront logs can sometimes contain malformed entries. We handle this gracefully with:
parameters: {
"ignore.malformed.fields": "true",
}
This ensures our queries don't fail if they encounter malformed log entries.
Query Optimization
To optimize query performance and cost, we've structured our queries to:
- Filter by date first to minimize data scanned
- Exclude irrelevant paths and file types
- Use appropriate WHERE clauses
For example, when analyzing blog post traffic:
SELECT
uri,
COUNT(*) as views,
COUNT(DISTINCT request_ip) as unique_visitors,
AVG(time_taken) as avg_time_taken
FROM cloudfront_logs
WHERE date >= CURRENT_DATE - INTERVAL '7' DAY
AND uri LIKE '/blog/%'
AND uri NOT LIKE '%.txt'
AND uri NOT LIKE '%._next/%'
AND uri NOT LIKE '%.png'
AND uri NOT LIKE '%.jpg'
AND uri NOT LIKE '%.ico'
AND uri NOT LIKE '%.js'
AND uri NOT LIKE '%.css'
AND uri NOT LIKE '%.woff2'
AND uri NOT LIKE '%.json'
AND uri NOT LIKE '%.xml'
GROUP BY uri
HAVING COUNT(*) > 1
ORDER BY views DESC;
This query:
- Limits the date range to the last 7 days
- Focuses only on blog post URLs
- Excludes asset requests and text versions of pages
- Filters out single-hit entries that might be bot traffic
Cost Management
Since Athena charges based on the amount of data scanned, efficient table design and query patterns are essential. Some tips:
- Use date-based filtering in your WHERE clause
- Avoid SELECT * queries
- Only include necessary columns in your queries
- Consider using Athena workgroups with query result reuse enabled
By following these practices, we can efficiently analyze our blog traffic while keeping costs under control.
Analyzing Blog Traffic
With our infrastructure in place, we can now analyze blog traffic patterns. Here are some useful queries we've implemented:
Popular Blog Posts
This query shows the most visited blog posts, excluding asset requests:
SELECT
uri,
COUNT(*) as views,
COUNT(DISTINCT request_ip) as unique_visitors,
AVG(time_taken) as avg_time_taken
FROM cloudfront_logs
WHERE date >= CURRENT_DATE - INTERVAL '7' DAY
AND uri LIKE '/blog/%'
AND uri NOT LIKE '%.txt'
AND uri NOT LIKE '%._next/%'
AND uri NOT LIKE '%.png'
AND uri NOT LIKE '%.jpg'
AND uri NOT LIKE '%.ico'
AND uri NOT LIKE '%.js'
AND uri NOT LIKE '%.css'
AND uri NOT LIKE '%.woff2'
AND uri NOT LIKE '%.json'
AND uri NOT LIKE '%.xml'
GROUP BY uri
HAVING COUNT(*) > 1
ORDER BY views DESC;
This query:
- Filters for blog post URLs
- Excludes asset requests (images, stylesheets, etc.)
- Counts total views and unique visitors
- Calculates average load time
Traffic Patterns by Hour
To understand when our blog receives the most traffic:
SELECT
HOUR(PARSE_DATETIME(time, 'HH:mm:ss')) as hour_utc,
COUNT(*) as requests,
AVG(time_taken) as avg_response_time
FROM cloudfront_logs
WHERE date >= CURRENT_DATE - INTERVAL '7' DAY
GROUP BY HOUR(PARSE_DATETIME(time, 'HH:mm:ss'))
ORDER BY hour_utc;
Saving Common Queries
We store these queries in Athena for easy reuse:
const queries = [
{
name: `Top Pages - ${props.environment}`,
description: `Shows most visited pages excluding assets for ${props.environment}`,
query: `...`,
},
{
name: `Traffic by Hour - ${props.environment}`,
description: `Shows traffic patterns by hour for ${props.environment}`,
query: `...`,
},
// Additional queries...
];
queries.forEach((queryDef, index) => {
new CfnNamedQuery(this, `SavedQuery${index}`, {
database: database.ref,
description: queryDef.description,
name: queryDef.name,
queryString: queryDef.query,
workGroup: workgroup.name,
});
});
Cost Considerations
Remember that Athena charges based on the amount of data scanned. To optimize costs:
- Use date filters to limit the time range
- Exclude unnecessary file types and paths
- Use appropriate WHERE clauses to scan only relevant data
Conclusion
By combining CloudFront logs with Athena, we've created a powerful analytics system for understanding our blog's traffic patterns. This setup allows us to:
- Track popular content
- Understand user behavior
- Monitor performance
- Make data-driven decisions about content strategy