Frontier labs are cautious about deploying models in defense and other high-consequence settings because nobody has a credible way to measure whether a model behaves the way the policy asks for under those conditions. Anthropic and OpenAI have both been publicly cautious for exactly this reason. War Games turns that judgement call into a number.
The environment
The simulator is OpenRA's port of Red Alert. The original is a 1995 real-time strategy game set in a fictional alternate history, which is the point: no live targeting, no doctrine that maps to a current operator, and no real-world coupling. Real-time strategy is general reasoning with a clock attached: long-horizon planning, economy management, and combat decisions all happening on a tempo that does not pause for the agent.
A real-time strategy game also forces you to play the way a human would: you watch the same screen, click the same buttons, and lose the same way when you misclick. That is what makes it a measurement environment instead of a benchmark with a structured API.
The environment contains 293 mission specs across 199 unique maps and three difficulties: tutorial drills, scripted single-player campaigns at Easy, Normal, and Hard, and open-ended skirmish setups against the OpenRA AI. Variety blocks memorization. The difficulty bands bracket where a policy sits relative to human skill.
What the agent sees
AlphaStar and most prior RTS agents trained against a structured game API: unit vectors, scripted observations, command-level actions inside the engine. War Games does not expose any of that. The agent receives the same rendered screen a human player sees, and acts through the same input channels (click, drag, type, key, scroll) that any frontier model can already drive via computer use.2
No build orders, no unit tables, no hand-written task prompt. A policy trained or evaluated here practices on what it would actually face when deployed. Any vision-language-action harness plugs in without changes.
Training
Training in War Games is reinforcement learning on top of a vision-language-action model. The hard part is not the optimizer. It is deciding what the policy should be optimized for.
That question is what AI safety calls a behavior specification. Anthropic and OpenAI both publish Model Specs at the company level for the same reason: a model can only be measured against an explicit description of how it should behave. In RL terms, the spec is a reward function. War Games lets you write yours as a reward profile: a YAML or Python file the harness reads to score every step.
Once the spec is explicit, you can run a contrast experiment. Train two policies on the same model and mission catalog under different profiles. The first rewards preservation. The second is train_only and rewards engagements while ignoring friendly losses. Run both on the same evaluation mission: the first disengages, the second pushes through. Neither is failing at the game. They are adhering to different specifications.
The delta between those two trajectories is the spec adherence measurement.
Profiles
A profile is the reward function. It is the file your training run scores against, the thing that turns "the agent did this" into a number. The only difference from a hand-rolled reward function: a profile is not freeform Python. It weights a fixed schema the harness already records. Same vocabulary on Prime Intellect Verifiers, OpenReward, and a local run, so the same profile produces the same number anywhere.
Every profile has two layers. The per-step layer is dense: small numbers scored on every tick. Did units die. Did buildings fall. The terminal layer is sparse: one number at the end. Did the mission succeed. Per-step gives the policy enough signal to learn. Terminal keeps it honest about the outcome.
A profile is not stuck to one moment. Every reward entry sees the game tick, so it can pay out only inside a window: reward scouting in the first 30 seconds, reward map control between 1 and 3 minutes, reward base pressure after 5. The profile becomes a stage description for the run.
Training is not one reward to convergence. It is a climb through profiles toward longer horizons.
Start on a short-horizon profile that pays out fast: the agent only has to do something useful in the first 30 seconds. Once the policy clears that, swap in a profile whose reward window only opens at 3 minutes. Then mission end. Each profile is a slightly harder horizon. Improvement comes from working through the stack, not from sitting on one fixed reward.
The split system tracks this. curriculum holds the intermediate stages. train is the canonical profile. test is never touched.
A profile is just a reward function, so you train your model against whichever profile matches the policy you want. standard for balanced play. protective for a policy that preserves friendly forces. aggressive for a policy that ignores friendly losses to maximise damage. speedrun for fast, decisive play. Pick one and run wargames run --split train --profile .
train_only is a sticker on a profile that says: do not use this for the official score. That is the whole rule. You can still train with the profile, tune against it, run a debug episode with it. The framework refuses exactly one thing: using the profile on the test split, the held-out set used for the public benchmark.
The reason is simple. Some profiles reward bad behaviour on purpose. An aggressive profile pays the agent for ignoring friendly losses. If you train a model on that profile and then score it on the same profile, it will look great. Of course it will: the test is the answer key. That number is not comparable to one from a normal profile. The sticker stops anyone from doing it by accident.
The reason both safe and aggressive profiles ship together is contrast. If a policy trained under protective and a policy trained under aggressive produce the same trajectory on the same mission, the safety profile is not constraining anything; the model would have behaved that way regardless. You need the aggressive end of the dial to prove the protective end is doing work.
External profiles plug in via wargames run --profile-dir or profile_registry.register(...).
Measuring outcomes
Two views run alongside every episode. Skill axes describe how the agent plays: long-horizon planning, economy management, combat decisions, recovery from drift.
Safety axes describe whether the profile is actually being followed: friendly force preservation, collateral damage avoidance, ROE compliance, restraint under pressure, profile contrast.
Both ship as weight-zero metrics. They do not feed gradients. They are descriptive coordinates that compare runs across profiles, so a "win" under an aggressive profile and a "win" under a protective one show up as different points in the same space.
How models run
The simulator runs in real time and never waits. Whether the model keeps up depends on the model itself: its architecture, its inference speed, what kind of input it can ingest. War Games makes that distinction visible. It credits the architectures and inference work pushing toward real-time, instead of treating today's sampled LLMs as the ceiling.
- Launch modes. Direct mode starts inside a specific mission for reproducibility. Menu mode starts at the frontend so the agent has to navigate the UI itself.
- Sampled mode is pull-based: the server sends a frame, the agent takes an action, the server sends the next frame back. This is what current LLMs can actually use. Moondream's small VLM shows the bound is not architectural: with optimised inference, sampled-mode VLMs can approach real-time on commodity hardware. Every saved millisecond between frames is one fewer the world has moved without a decision.
- Streaming mode is push-based: the server pushes frames at a target FPS regardless of whether the agent is ready. Sampled models cannot use this without dropping frames. It exists for the architectures that come next: models built to ingest a frame stream continuously rather than one decision at a time.
Same WebSocket protocol either way; only the cadence differs.
Frame delivery
The simulator runs at a fixed real-time tempo. The model does not. The gap between the two is what makes a real-time environment hard.
effective_fps = frames_observed_by_model / wall_clock_seconds
realtime_fps = frames_sent_by_server / wall_clock_secondsA model receiving 60 FPS that acts at 4 APM is not closing the loop; it is sampling a movie.
Speed
Tempo is the second axis the simulator pressures. Public StarCraft II league data: Bronze ~60 APM, GrandMaster ~300.1 AlphaStar referenced only as a tempo benchmark for human play.2 War Games does not expose a structured game API; pixels in, actions out, real-time clock.
Why this is hard
The world keeps moving between decisions. Every delay changes the state. Every misclick has a cost. Long action chains drift, and there is no API shortcut to recover them. Pixels in, actions out is not a simplification; it is the load-bearing constraint.
Getting started
The repo is github.com/layerbrain/wargames. Clone it, install it, run a debug episode:
git clone https://github.com/layerbrain/wargames
cd wargames
pip install -e .
wargames run --split debug --profile standardThe debug split is single-mission and deterministic, so the first run is a smoke test against a known outcome. From there, three things are worth reading before writing any code:
wargames/harness/: the websocket protocol. This is the contract every agent talks to.scenarios/redalert/profiles/standard.yaml: what a real profile looks like. Edit a weight and re-run to see the number move.wargames/episode/: the episode controller and reward evaluator. Same code path runs locally, on Prime Intellect Verifiers, and through OpenReward Standard.
Plug an agent in through whichever harness fits: Prime Intellect Verifiers for RL training and eval, OpenReward Standard for harness eval (Codex, Claude Code, Gemini via Firehorse), or a local Agent class for full control. Same profile, same number anywhere.
What comes next
Benchmarks. The standard profile is running across frontier models on the held-out test split, and the curves go up on this page as the runs finish. The interesting reads will not be the rankings; they will be the failure modes: where each model stalls, where dense profiles hide what terminal profiles show, where adherence to a safety profile breaks down under aggressive contrast.
After that: more missions, more harness integrations, more games on the same protocol. Insights as we find them, posted here.