Skip to content
← Back to projects

EKS GPU control panel

A reusable pattern for turning expensive EKS GPU nodegroups on and off from a custom web panel. Built on CloudFront, S3, API Gateway and Lambda, with simple authentication. Designed for labs and demonstrations where GPUs are consumed by the hour, not by the day.

Year
2026
Stack
AWS EKS CloudFront API Gateway Lambda S3 IAM
Diagram of the control panel that starts and stops GPU nodegroups on EKS from a web interface
Diagram of the control panel that starts and stops GPU nodegroups on EKS from a web interface

Context

GPU nodes on AWS (g6, g5 and p4 families) cost between one and thirty dollars per hour. In the context of labs and demonstrations, keeping these resources active around the clock translates into hundreds or thousands of dollars in monthly spend. Manual shutdown works — until a weekend oversight produces a significant cost overrun.

Conventional alternatives (CloudWatch schedules, Karpenter, Spot) are either too rigid or assume continuous workloads. The actual need was simple: a web button the user activates at the start of a session and releases when finished.

Architecture

The panel is served as a static site from S3 behind CloudFront, communicating with a protected endpoint on API Gateway. The Lambda function behind the endpoint fulfills three responsibilities:

  1. Validation of a simple token (signed cookie or header).
  2. Call to EKS to scale the GPU nodegroup’s desired-capacity.
  3. Return of current state to the panel to reflect transitions (starting, stopping, ready).

The frontend polls the status endpoint every few seconds, providing a near real-time user experience.

Stack rationale

Decisions that added value

State in the cluster, not in a database

Rather than maintaining an external table with each nodegroup’s state, the system queries EKS directly for desired-capacity. The source of truth resides in a single location, fully eliminating drift issues.

Automatic shutdown with timeout

An optional parameter enables starting the node “for two hours”. The Lambda schedules an EventBridge task that shuts down the resource if the user forgets to do so manually. A minor design decision that consistently translates into savings.

Deliberately simple authentication

For a personal or small-team panel, OAuth is over-engineered. A shared secret signed in a cookie is sufficient. The design contemplates a future migration to Cognito or IAM Identity Center without requiring modifications to the rest of the architecture.

Reuse

I currently apply this pattern to the FortiAIGate lab, but the code is conceived as a template: replacing the target nodegroup and branding extends its use to FortiEDR, malware analysis sandboxes or any GPU workload consumed in sessions.

Next steps

I am consolidating the panel into a Terraform module that takes the cluster and nodegroup as input and automatically provisions S3, CloudFront, API Gateway, Lambda and IAM. The goal is to enable any engineer to incorporate a “GPU power switch” into their EKS deployment with a single terraform apply.