Updating Athena Queries for Enhanced Blog Traffic Analysis
Expanding on our CloudFront log analysis with new Athena queries to differentiate bots vs humans, analyze external referrers, and get detailed bot traffic insights.

In our previous post, "Analyzing Blog Traffic with CloudFront Logs and Athena", we set up a robust system for capturing and querying CloudFront logs using AWS Glue and Athena. This provided valuable insights into page views and traffic patterns.
Today, we're enhancing this setup by adding several new Athena queries to gain deeper insights into our traffic sources and bot activity. These queries are added to our LoggingConstruct
in the CDK infrastructure code (apps/infra/lib/constructs/logging-construct.ts
).
New Athena Queries
We've added three new queries to our collection:
- Bot vs Human Analysis: Differentiates traffic based on user agent strings.
- Referrer Analysis (External Only): Identifies top external traffic sources.
- Detailed Bot Traffic: Breaks down traffic by specific bot user agents.
Let's look at each query in detail.
1. Bot vs Human Analysis
Understanding how much of your traffic comes from automated bots versus actual human visitors is crucial. This query categorizes requests based on a list of known bot user agent strings. We've also included metrics like total bytes transferred and distinct pages hit for each category.
SELECT
CASE
WHEN LOWER(user_agent) LIKE '%ahrefsbot%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%anthropic-ai%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%applebot%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%awariorssbot%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%awariosmartbot%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%bytespider%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%ccbot%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%chatgpt-user%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%claudebot%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%claude-web%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%cohere-ai%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%dataforseobot%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%diffbot%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%facebookbot%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%friendlycrawler%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%google-extended%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%gptbot%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%imagesiftbot%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%img2dataset%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%linkedinbot%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%maggie-bot%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%omgili%' THEN 'Bot' -- Includes omgilibot
WHEN LOWER(user_agent) LIKE '%perplexitybot%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%piplbot%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%scoop.it%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%seekr%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%semrushbot%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%youbot%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%yandex%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%bingbot%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%googlebot%' THEN 'Bot' -- Excludes Google-Extended
WHEN LOWER(user_agent) LIKE '%slurp%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%duckduckbot%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%baiduspider%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%sogou%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%exabot%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%facebot%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%ia_archiver%' THEN 'Bot'
-- Add more specific bot checks if needed
ELSE 'Human'
END as source_type,
COUNT(*) as requests,
COUNT(DISTINCT request_ip) as unique_visitors,
SUM(bytes) as total_bytes,
COUNT(DISTINCT uri) as distinct_pages_hit
FROM cloudfront_logs
WHERE date >= CURRENT_DATE - INTERVAL '7' DAY
GROUP BY
CASE
WHEN LOWER(user_agent) LIKE '%ahrefsbot%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%anthropic-ai%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%applebot%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%awariorssbot%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%awariosmartbot%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%bytespider%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%ccbot%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%chatgpt-user%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%claudebot%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%claude-web%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%cohere-ai%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%dataforseobot%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%diffbot%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%facebookbot%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%friendlycrawler%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%google-extended%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%gptbot%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%imagesiftbot%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%img2dataset%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%linkedinbot%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%maggie-bot%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%omgili%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%perplexitybot%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%piplbot%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%scoop.it%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%seekr%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%semrushbot%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%youbot%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%yandex%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%bingbot%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%googlebot%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%slurp%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%duckduckbot%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%baiduspider%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%sogou%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%exabot%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%facebot%' THEN 'Bot'
WHEN LOWER(user_agent) LIKE '%ia_archiver%' THEN 'Bot'
ELSE 'Human'
END;
2. Referrer Analysis (External Only)
Knowing where your traffic originates is key to understanding your audience and marketing effectiveness. This query identifies the top referring domains, specifically excluding internal referrers (traffic coming from your own domain).
SELECT
referrer,
COUNT(*) as hits
FROM cloudfront_logs
WHERE date >= CURRENT_DATE - INTERVAL '7' DAY
AND referrer IS NOT NULL
AND referrer <> '-'
AND referrer NOT LIKE '%${props.domainName}%' -- Exclude internal referrers
GROUP BY referrer
ORDER BY hits DESC
LIMIT 50;
Note: ${props.domainName}
is dynamically replaced with your actual domain name by the CDK during deployment.
3. Detailed Bot Traffic
While the first query gives a high-level split, this one dives deeper into which specific bots are accessing your site. It groups traffic by the full user agent string for identified bots.
SELECT
user_agent,
COUNT(*) as requests,
COUNT(DISTINCT request_ip) as unique_ips,
SUM(bytes) as total_bytes
FROM cloudfront_logs
WHERE date >= CURRENT_DATE - INTERVAL '7' DAY
AND (
LOWER(user_agent) LIKE '%bot%' OR
LOWER(user_agent) LIKE '%spider%' OR
LOWER(user_agent) LIKE '%crawler%' OR
LOWER(user_agent) LIKE '%slurp%' OR
LOWER(user_agent) LIKE '%ahrefsbot%' OR
LOWER(user_agent) LIKE '%googlebot%' OR
LOWER(user_agent) LIKE '%bingbot%' OR
LOWER(user_agent) LIKE '%semrushbot%' OR
LOWER(user_agent) LIKE '%anthropic-ai%' OR
LOWER(user_agent) LIKE '%applebot%' OR
LOWER(user_agent) LIKE '%bytespider%' OR
LOWER(user_agent) LIKE '%ccbot%' OR
LOWER(user_agent) LIKE '%chatgpt-user%' OR
LOWER(user_agent) LIKE '%claudebot%' OR
LOWER(user_agent) LIKE '%claude-web%' OR
LOWER(user_agent) LIKE '%cohere-ai%' OR
LOWER(user_agent) LIKE '%diffbot%' OR
LOWER(user_agent) LIKE '%facebookbot%' OR
LOWER(user_agent) LIKE '%google-extended%' OR
LOWER(user_agent) LIKE '%gptbot%' OR
LOWER(user_agent) LIKE '%linkedinbot%' OR
LOWER(user_agent) LIKE '%omgili%' OR
LOWER(user_agent) LIKE '%perplexitybot%' OR
LOWER(user_agent) LIKE '%yandex%' OR
LOWER(user_agent) LIKE '%duckduckbot%' OR
LOWER(user_agent) LIKE '%baiduspider%' OR
LOWER(user_agent) LIKE '%exabot%' OR
LOWER(user_agent) LIKE '%facebot%' OR
LOWER(user_agent) LIKE '%ia_archiver%'
)
GROUP BY user_agent
ORDER BY requests DESC
LIMIT 100;
Implementation
These queries are added to the queries
array within the LoggingConstruct
class in apps/infra/lib/constructs/logging-construct.ts
. The CDK automatically creates CfnNamedQuery
resources for each definition, making them readily available in the Athena console under the specified workgroup.
// Example snippet from logging-construct.ts
const queries = [
// ... other queries ...
{
name: `Bot vs Human Analysis - ${props.environment}`,
description: `Analyzes traffic source (Bot vs Human)...`,
query: `... SQL ...`,
},
{
name: `Referrer Analysis - ${props.environment}`,
description: `Shows top external referring domains...`,
query: `... SQL ...`,
},
{
name: `Detailed Bot Traffic - ${props.environment}`,
description: `Shows request counts for specific known bot user agents...`,
query: `... SQL ...`,
},
];
queries.forEach((queryDef, index) => {
new CfnNamedQuery(this, `SavedQuery${index}`, {
database: database.ref,
description: queryDef.description,
name: queryDef.name,
queryString: queryDef.query,
workGroup: workgroup.name,
});
});
Conclusion
By adding these targeted queries, we can now get a much clearer picture of our blog's traffic composition. We can distinguish bot traffic, understand external referral sources, and identify specific crawler activity. This enhanced visibility allows for better decision-making regarding content strategy, performance optimization, and security monitoring. Remember to periodically review and update the bot user agent lists as new crawlers emerge.