Organizing a Capture The Flag (CTF) competition is not just about crafting challenges; it is fundamentally an exercise in systems design under adversarial conditions.
Every design decision must account for unreliable infrastructure, unpredictable user behavior, and intentional abuse.
For TriNetra CTF (a part of ACM SNIOE’s Hackdata event), I helped design and deploy the challenge infrastructure for a ~30-participant CTF using Docker, Traefik, and a single-node VPS, while also exploring how challenge design can reduce trivial LLM-assisted solving.
The TriNetra CTF took place in two rounds. The first was an online qualifying round and the second round was conducted on-site at Shiv Nadar University.
This post details the architecture, deployment model, security posture, design decisions behind the systems deployed in the on-site round.
System Scale, Requirements & Constraints #
A key step in the design process was identifying the requirements.
From the previous round, out of the qualified teams we knew only 10 (~30 individuals) would be able to attend. We had planned a 5 hour CTF with a total of 10 challenges across various domains (reverse, web, steg, etc).
We also chose to host certain challenge elements after observing in the previous round that providing full artifacts made it trivial for participants to upload them to LLMs for effortless solving.
Based on the above, we defined the following requirements:
- ~30 active participants
- ~10 concurrent challenge instances
Deployment Strategy #
For a CTF of our chosen duration and participant strength, implementing a distributed deployment strategy (i.e., Kubernetes) would increase the complexity and maintainability without meaningful benefits. Thus we opted for a single node deployment using a VPS with an on-site backup server in case of emergency.
Our decision to use a single-node deployment was driven by lower operational overhead compared to a distributed strategy.
Challenge Isolation Model #
Since all challenges were to be deployed on the same machine, it was imperative that files and execution of one challenge not hinder another.
Docker was used to ensure challenge isolation as well as cross-machine reproducibility.
Using Docker also ensured that an unintended vulnerability in one challenge would not lead to the compromise of other challenges or the VPS itself.
Docker also allowed rapid scaling by deploying multiple challenge instances when needed.
Handling of multiple users was handed down the stack to the individual challenges themselves. This decision was made in light of the fact that very few of our challenges would need to track attacking user state.
Implications #
Advantages:
- Minimal resource overhead
- Simplified orchestration and deployment
Limitations:
- Potential cross-user interference
- No per-user state isolation at the container level
This tradeoff is acceptable at small scales but may not generalize to high-concurrency environments.
High-Level Architecture #
The architecture was designed around a few key principles:
- Isolation over complexity — minimize cross-challenge interference without introducing heavy orchestration
- Operational simplicity — ensure the system could be managed under time pressure
- Fast recovery — prioritize quick redeployment over perfect fault tolerance
With these constraints, the architecture was divided into two major components:
Challenge Infrastructure: The exploitation surface, their associated environments and their deployment strategy fell under the domain of ACM’s cybersecurity team.
Front-end: metadata, flag validation, challenge lifecycle The user-facing UI, along with its associated logic (including flag validation), was handled by ACM’s web development team due to their ability to rapidly build a clean UI/UX.
This separation decouples the exploitation surface and scoring logic which prevents a failure on one side from propagating to the other.
This article focuses on the Challenge Infrastructure and only briefly touches on front-end details.
Infrastructure Stack #
VPS Layer (Linode) #
Linode was chosen as the VPS due to prior familiarity and the ability to accurately estimate compute costs from previous experience.
During testing, it became apparent that a single vCPU could not handle multiple challenges, as we observed significant latency and occasional crashes.
During production, we opted for the smallest dedicated CPU available to us, a 4 vCPU machine with 24GB of RAM billed at $0.0645/hr. As the machine was only required for the 5-hour duration of the event, the cost was acceptable.
Billing for VPSs on Linode is based on the amount of time a machine is provisioned for, not for how long it is up and running.
Container Runtime (Portainer + Docker Setup) #
As previously mentioned, each challenge was packaged as an individual docker image.
Challenge developers were instructed to package their challenge with a Dockerfile and an example docker-compose file to indicate challenge requirements. They were also instructed to use environment variables to configure challenges wherever required.
Images generated via the provided Dockerfiles were then deployed via Docker Compose files orchestrated via Portainer.
Portainer provided us the right amount of flexibility in being able to quickly bring up and take down challenges without fumbling around in the command line.
Manually typing out individual Docker commands during a failure would introduce unnecessary operational friction.
No docker container directly exposed its port, rather all traffic was proxied through particular ports which will be discussed in further sections.
Key properties:
- Reproducibility
- Process-level isolation
- Rapid teardown/redeployment
Reverse Proxy (Traefik) #
As briefly mentioned previously all docker containers were behind a reverse proxy setup.
Our chosen reverse proxy was Traefik. Based on my previous experience in Home Labbing, Traefik was the most ideal reverse proxy due to its dynamic configuration over proxies like Nginx or Caddy.
This allowed for rapid deployment and tear-down of containers without any repeated configuration of the proxy.
Service Discovery #
Traefik is able to interact directly with the Docker socket running on a machine to handle running containers.
It supports specifying the reverse proxy configuration required for each container by adding labels to the containers themselves.
It automatically monitors for any active containers / deploying containers, reads the configuration required from their labels and routes information accordingly.
This means that deploying another instance of a container can be handled without any downtime required by configuration.
TLS Automation #
Another important reason for the choice to use Traefik as a reverse proxy was TLS certificate automation.
When configured correctly, Traefik is able to provision Let’s Encrypt TLS certificates using ACME DNS challenges (via Cloudflare in our case) as required by individual containers.
Load Balancing #
Traefik also provides load balancing capabilities. This was used to route users across multiple containers in cases where we felt a single container would not be able to handle a rapid influx of requests or where latency for multiple users needed to be small.
Example (Conceptual) #
labels:
- "traefik.enable=true"
- "traefik.http.routers.challenge.rule=Host(`challenge.ctf.domain`)"
- "traefik.http.routers.challenge.entrypoints=websecure"
- "traefik.http.routers.challenge.tls.certresolver=cloudflare"Summary of Benefits #
- Zero-downtime service exposure
- No manual configuration reloads
- Deterministic routing based on container lifecycle
- Automatic TLS certificate provisioning
- In built load balancing services
Uptime Kuma (Infrastructure Monitoring) #
Similar to Traefik, Uptime Kuma was selected due to previous experience from home labbing.
It provided a simple dashboard where all team members could quickly and easily monitor the infrastructure situation.
Its functionality to automatically notify people through a variety of methods such a Discord, Slack, ntfy.sh in the event of detected downtime allowed for individuals capable of fixing issues to be notified as soon as possible.
Request Lifecycle #
-
Client resolves
challenge.ctf.domain -
Request reaches Traefik entrypoint
-
Traefik:
- Resolves routing rule via labels
- Ensures TLS certificate availability
-
Request forwarded to container backend
-
Challenge logic executes
-
Flag exposed upon successful exploitation
-
Submission sent to frontend (Supabase-backed validation)
Deployment Pipeline #
No CI/CD pipeline was used since development and deployment was heavily time restricted (working around exam schedules). To ensure a smooth deployment process which would be familiar to all, deployment was intentionally made lightweight and simple.
- Challenge developer validates locally (Docker runtime)
- Source code with Dockerfile is transferred to VPS
- Image built on host
- Deployment via Docker Compose managed in Portainer
- Pre-release validation on private domain
As no CI/CD pipeline was used, all deployments and testing were performed manually. This was a slight hurdle since only I was fully aware of the deployment process, with one other team member aware of the partial process. This could have been easily remedied if we had more time, however since deployment and testing took place over the span of 72 hours, there wasn’t much time to bring everyone up to speed.
Proper documentation and making everyone aware of the process will need to be a priority next time to avoid a single point of failure in case someone is unavailable temporarily.
Flag Handling #
Flags were injected into containers and challenges wherever possible. This meant that in case of a flag leak, we could simply alter the environment variable to quickly alter the challenge flag.
Validation on the front-end, was handled by validating against flags stored in Supabase. In hindsight, this was extremely insecure. Since we didn’t have too much time to work with the front-end team, this managed to slip through.
In the future, validation on the front end should only have the hashes stored in the database instead of the actual flag.
Observability #
Our infrastructure had minimal observability.
We used Uptime Kuma to monitor challenge uptime and be notified in case of any failure before participants could notice.
This monitoring, combined with Portainer logs, served as our primary mechanism for handling failures. Fortunately, we did not encounter any issues during the event despite the risks introduced by our limited observability.
In the future, as a team we need to look into centralized logging and metrics aggregation solutions like Grafana, Prometheus and Loki stack. This would give organizers a clean dashboard to look at to get a lay of the challenge environment quickly.
Reliability & Failure Analysis #
With the amount of testing we did, we actually managed to achieve a 100% uptime which in technology terms is a literal miracle. I fully expect things to go wrong the next time we conduct such an event.
The only major, confirmed issues we had were on the front-end where we made occasional mistakes while communicating with the front-end team on challenge uploads.
Reported Incident #
During the event, we did have one participant report challenge downtime, however cross validation revealed that the issue was a client-side cache issue which couldn’t be replicated on the participant’s team mate’s laptop.
Our systems reported full operation through Uptime Kuma during this time.
Pre-Deployment Failures #
Prior to the competition, we did encounter issues due to misconfiguration of Traefik labels, however these kinks were worked out before the challenge start through iterative testing with different team members each time across various systems and environments to ensure a smooth experience for all.
Security Hardening #
As the competition was a CTF, a security focused event, we anticipated that participants would try and breach challenges or the VPS themselves.
Containers were made as minimal as possible to minimize the attack surface.
In the following sections, I will discuss what measures we took to harden the security posture of our VPS.
Network Layer #
From a networking standpoint, we ensured only required ports were open by using the UFW Firewall. The firewall was set up in conjunction with Fail2Ban to mitigate brute-force attempts.
The only ports we chose to keep open were ports required by Traefik (80 & 443), as well as TCP ports required for certain select challenges (Note: These TCP ports were also proxied through Traefik).
Inbound traffic on all other ports was disabled. The only exception was a port for SSH so we could configure the system. SSH was moved from its default port (22) to a higher port to reduce trivial access attempts. This was done as unlike a extreme high security environment, we could simply identify any unauthorized port scans or malicious activity and disqualify the teams behind it and ban their IPs. This meant we could afford a slightly weakened security posture.
Host Hardening #
The VPS was hardened by using a password manager to generate strong credentials, particularly for the root user.
We also enabled security updates to take place automatically. Realistically speaking this was unnecessary considering our competition took place over the span of a few hours, however it was just good practice.
SSH access to the VPS was made secure by disabling root login & password based authentication. Key based authentication was used instead with a non-root sudo user.
Runtime Constraints #
Challenge containers had hard resource limits enforced based on prior testing under load. This was mainly done to ensure participants couldn’t hijack the containers and use it for other purposes like crypto mining.
In the event that we did hit our resource limits, the plan was to quickly scale the number of containers and allow Traefik to handle load balancing, rather than adjust the individual resource limits themselves. However this was never an issue during the competition.
Privilege Escalation Auditing #
To ensure attack surface on the VPS itself was limited, we ran LinPEAS to discover privilege escalation vectors, weak permissions and misconfigurations.
Anything found was quickly patched up.
This was an excellent suggestion from our team lead, something I was aware of but hadn’t considered applying in this context, and will adopt in future deployments.
Summary of Security Posture #
Overall our systems were setup to minimize attack surface and resist trivial compromise.
As in most cases security trades off simplicity. As we didn’t have a nation-state or persistent adversary threat model, we opted for operation simplicity in favor of extreme security wherever we felt we could detect and handle an attack.
Adversarial Challenge Design (LLM Resistance) #
Challenge design was a primary focus from the start, especially in the context of increasingly capable LLMs.
From our previous experience, we observed that LLMs were able to solve most basic to medium difficulty challenges without human assistance. This fundamentally changes the assumptions CTF designers can make.
Instead of attempting to prevent LLM usage, we focused on designing challenges that degrade automation-first approaches and require sustained human reasoning.
Techniques Used #
1. State-Dependent Systems #
From our previous experience, we realized that LLMs struggle with handling states. Their limited textual memory struggles to accurately capture the state of a system from one interaction to the next.
Thus we chose to build challenges were participants had to move applications through multiple stages or states and apply logic between steps based on the current state.
Example
One web challenge we had designed, showed application error states on a completely different page and required state manipulations on a completely different page. This was something players had to realize themselves while examining the state of the system.
2. Large Artifact Constraints #
For challenges where it made sense to have large files (Eg: Linux Forensic Image challenges), we made sure to have file sizes large enough that uploading to an LLM would not be reasonable possible.
3. Reverse Engineering #
LLMs are trained on large amounts of text. Binary or low-level reversing engineering challenges with a high amount of reasoning would be very different from the text analysis models are typically used for. This would mean the LLMs would often get confused or not be able to handle solving without human assistance.
4. Empirical Validation #
As part of our testing, we made sure to run our challenges against LLMs without any human assistance. Challenges LLMs were able to solve trivially were penalized in their points weightage.
Case Study: Large Artifact Bypass #
One interesting interaction we discovered during the challenge with regards to LLMs is highlighted below.
Setup #
One of our Linux Forensics challenge had a 2GB (compressed) image file which when expanded would take up approximately 15GB of space.
Intended Constraint #
This challenge had been designed to thwart LLMs by exceeding the file upload limits.
One the off chance that file upload was still possible, the decompression into a 15GB file was to ensure an LLM would not be able to process the file on a container it spun up without hitting resource limits quickly.
Observed Behavior #
A clever work around we found participants using was to employ local AI agents (in this case through VS code integration) to operate on their local system itself.
Since the challenge only safeguarded against LLMs through upload limitations, the rest of the challenge was not made with preventing LLM solving in mind.
This meant that through the use of local AI agents and LLM reasoning, a challenge we have weighted quite heavily in terms of points was solved too easily.
Insight #
Constraints on model input are ineffective when computation can be offloaded to the user environment.
Implication #
The key takeaway is that challenges must be designed from the ground up with LLM-assisted workflows in mind, rather than treated as an edge case.
Key Tradeoffs #
| Decision | Benefit | Cost |
|---|---|---|
| Shared containers | Low overhead | Weak isolation |
| No rate limiting | Simplicity | Abuse potential |
| Single-node infra | Deterministic | No horizontal scaling |
| No orchestration | Fast setup | Limited elasticity |
Cost Analysis #
-
Total cost: $2.81 (3 days)
- 2 days: low-tier instance
- 1 day: high-performance instance
The cost primarily came from the compute required. Network bandwidth remained well within the provided allotment, as challenges were designed to avoid brute-force approaches.
Overall we managed to balance scale with efficiency to keep costs low.
Future Improvements #
The following are possible improvements to be made in the future:
- Per-user ephemeral containers
- Centralized logging + metrics
- Automated deployment pipelines (CI/CD + Ansible to setup VPS)
- Tighter frontend–infrastructure integration
- More robust AI-resistant challenge patterns
Conclusion #
CTF infrastructure sits at the intersection of:
- Distributed systems
- Security engineering
- Adversarial thinking
Even at a modest scale, careful decisions around routing, isolation, and challenge design significantly impact system robustness and participant experience.
A key emerging constraint is clear:
Systems must now be designed with AI-assisted users as the baseline, not the exception.
If you’re building CTF infrastructure or exploring adversarial system design, feel free to reach out.