Modernizing a CDK Project - WebSocket Server with Fargate

I recently revisited cdk-websocket-server, a demo I published in 2023 for running a WebSocket server on ECS Fargate behind CloudFront. The original worked, but it carried unnecessary complexity — EC2 instances backing a Fargate cluster, a custom Lambda to patch CloudFront headers, and a single-stage Docker build running as root. Here's what changed and why.

Dropping the EC2 Auto Scaling Group

The original created an EC2 Auto Scaling Group alongside the Fargate cluster:

// Old approach
const autoScalingGroup = new AutoScalingGroup(this, 'AutoScalingGroup', {
  vpc: props.vpc,
  instanceType: new InstanceType('m6i.large'),
  machineImage: EcsOptimizedImage.amazonLinux2(),
  desiredCapacity: 1,
});

const capacityProvider = new AsgCapacityProvider(this, 'capacityProvider', {
  autoScalingGroup: autoScalingGroup,
});

this.cluster.addAsgCapacityProvider(capacityProvider);

This was wrong. Fargate is serverless containers — the entire point is that you don't manage the underlying compute. Having an EC2 ASG capacity provider on a Fargate cluster doesn't add anything; it's a leftover from when the project might have used EC2-backed ECS.

The new version just creates the cluster:

this.cluster = new Cluster(this, 'Cluster', {
  vpc: props.vpc,
  clusterName: 'websocket-service',
  containerInsightsV2: ContainerInsights.ENHANCED,
});

No EC2 instances, no ASG, no capacity provider. Fargate handles compute provisioning. We also enabled Container Insights in enhanced mode — better metrics with zero additional infrastructure.

Task-Level Auto Scaling

The original had no auto scaling at the task level. If traffic increased, you were stuck with whatever desiredCount you set at deploy time.

The new version scales Fargate tasks based on active connections:

const scalableTarget = websocketService.autoScaleTaskCount({
  minCapacity: 1,
  maxCapacity: 5,
});

scalableTarget.scaleOnRequestCount('RequestScaling', {
  requestsPerTarget: 5,
  targetGroup: webSocketTargetGroup,
});

ALBs track active WebSocket connections as requests. When connections per target exceed the threshold, ECS automatically provisions new Fargate tasks. This is the right level to scale at — task instances, not EC2 instances.

Native Custom Headers (No More Custom Resource)

The biggest cleanup was the CloudFront-to-ALB security mechanism. The original used a Custom Resource Lambda to patch the CloudFront distribution config after creation, because CDK didn't support custom origin headers natively:

// Old approach - Custom Resource to patch headers
new CustomResource(this, 'customHeaderCustomResource', {
  serviceToken: customHeaderCustomResourceProvider.serviceToken,
  properties: {
    DistributionId: this.distribution.distributionId,
    Origins: [
      {
        OriginId: 'defaultOrigin',
        CustomHeaders: [
          {
            HeaderName: props.customHeader,
            HeaderValue: props.randomString,
          },
        ],
      },
    ],
  },
});

This involved a Lambda function that called GetDistributionConfig, merged in the custom headers, and called UpdateDistribution. It worked, but it was fragile — you had to handle the ETag for optimistic concurrency, match origin IDs correctly, and deal with the Lambda runtime and IAM permissions.

CDK now supports customHeaders directly on LoadBalancerV2Origin:

const defaultOrigin = new LoadBalancerV2Origin(props.applicationLoadBalancer, {
  httpPort: 80,
  protocolPolicy: OriginProtocolPolicy.HTTP_ONLY,
  originId: 'defaultOrigin',
  customHeaders: {
    [props.customHeader]: props.randomString,
  },
});

Three lines replace an entire Custom Resource stack. The headers are set during synthesis, deployed as part of the CloudFormation template, and managed through the normal update lifecycle. No Lambda, no runtime API calls, no race conditions.

Multi-Stage Docker Build

The original Dockerfile was a single stage running as root:

# Old approach
FROM --platform=linux/arm64 node:20
WORKDIR /usr/src/app
COPY . .
RUN yarn && yarn build
EXPOSE 8080
CMD [ "node", "dist/server.js" ]

The new version uses a multi-stage build:

FROM --platform=linux/arm64 node:20-alpine AS build
WORKDIR /usr/src/app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build

FROM --platform=linux/arm64 node:20-alpine
ENV NODE_ENV=production
RUN apk add --no-cache curl
USER node
WORKDIR /usr/src/app
COPY --chown=node:node package*.json ./
COPY --from=build --chown=node:node /usr/src/app/node_modules ./node_modules
COPY --from=build --chown=node:node /usr/src/app/dist ./dist
EXPOSE 8080
CMD [ "node", "dist/server.js" ]

What this buys you:

Smaller image — Alpine base, no dev dependencies or source files in the final image
Non-root execution — Runs as the built-in node user. If the container is compromised, the attacker doesn't have root
Deterministic installs — npm ci instead of yarn, locked to package-lock.json
Layer caching — package*.json is copied first, so dependency installation is cached unless dependencies change

Simplified Security Groups

The original manually created security groups and wired them together:

// Old approach
const albSecurityGroup = new SecurityGroup(this, 'ALBSecurityGroup', {
  vpc: props.vpc,
  description: 'Security Group for ALB',
  allowAllOutbound: true,
});

webSocketServiceSecurityGroup.connections.allowFrom(
  new Connections({
    securityGroups: [albSecurityGroup],
  }),
  Port.tcp(8080),
  'allow traffic on port 8080 from the ALB security group',
);

The new version uses CDK's higher-level connections API:

websocketService.connections.allowFrom(
  props.applicationLoadBalancer,
  Port.tcp(8080),
  'Allow traffic from ALB on port 8080',
);

CDK's FargateService and ApplicationLoadBalancer both implement IConnectable. When you call allowFrom between them, CDK creates the security group rules automatically. No need to create explicit security group resources or wrap them in Connections objects.

Graceful Shutdown

The original server had no shutdown handling. When ECS stopped a task — whether scaling down, deploying, or replacing an unhealthy container — every active WebSocket connection dropped immediately with no warning.

The new server handles SIGTERM and SIGINT:

const shutdown = () => {
  websocketServer.clients.forEach((client: WebSocket) => {
    if (client.readyState === WebSocket.OPEN) {
      client.close(1001, 'Server shutting down');
    }
  });

  server.close(() => {
    process.exit(0);
  });

  setTimeout(() => {
    process.exit(1);
  }, 10000);
};

process.on('SIGTERM', shutdown);
process.on('SIGINT', shutdown);

Each connected client receives a 1001 (Going Away) close code, giving client-side code the opportunity to reconnect to a different task. The HTTP server drains, and a 10-second timeout ensures the process exits even if something hangs.

For REST APIs, this is nice to have. For WebSocket servers, it's mandatory.

Security Testing with cdk-nag

The project now includes cdk-nag as part of the test suite:

test('No unsuppressed Errors', () => {
  const app = new App();
  const stack = new WebSocketServer(app, 'test', {});
  Aspects.of(stack).add(new AwsSolutionsChecks());

  const errors = Annotations.fromStack(stack).findError(
    '*',
    Match.stringLikeRegexp('AwsSolutions-.*'),
  );
  expect(errors).toHaveLength(0);
});

This runs the AWS Solutions rule pack at synth time. Security findings either get fixed or explicitly suppressed with documented reasons. It catches things like unencrypted buckets, overly permissive IAM policies, and missing TLS enforcement — before you deploy.

ALB Hardening

A small addition: dropInvalidHeaderFields: true on the ALB:

this.applicationLoadBalancer = new ApplicationLoadBalancer(
  this,
  'ApplicationLoadBalancer',
  {
    vpc: this.vpc,
    internetFacing: true,
    dropInvalidHeaderFields: true,
  },
);

This prevents HTTP desync attacks by dropping requests with malformed headers. It's one line and costs nothing.

VPC Simplification

The VPC was also cleaned up — no NAT gateways (Fargate tasks have public IPs), public subnets only:

this.vpc = new Vpc(this, 'VPC', {
  natGateways: 0,
  subnetConfiguration: [
    {
      cidrMask: 24,
      name: 'ServerPublic',
      subnetType: SubnetType.PUBLIC,
      mapPublicIpOnLaunch: true,
    },
  ],
  maxAzs: 2,
});

For a demo that doesn't need private networking, this keeps costs at zero for the VPC itself.

Summary

Most of these changes are about removing things that shouldn't have been there — EC2 instances backing a Fargate cluster, a custom resource working around a CDK limitation that no longer exists, a single-stage Dockerfile running as root. The additions (auto scaling, graceful shutdown, cdk-nag, load testing) are the things you'd want in any production WebSocket deployment.

The updated post and repo are at: