Publication · Release

War Games: A Frame-Based RL Environment for Long-Horizon Agents in Complex Real-Time Worlds

War Games measures whether frontier models can sustain long-horizon planning and adaptation in complex, non-stationary real-time games, with human performance as the benchmark.

April 25, 2026 / Publication / Aaron Kazah

Frontier labs are cautious about deploying models in defense and other high-consequence settings because nobody has a credible way to measure whether a model behaves the way the policy asks for under those conditions. Anthropic and OpenAI have both been publicly cautious for exactly this reason. War Games turns that judgement call into a number.

The environment

The simulator is OpenRA's port of Red Alert. The original is a 1995 real-time strategy game set in a fictional alternate history, which is the point: no live targeting, no doctrine that maps to a current operator, and no real-world coupling. Real-time strategy is general reasoning with a clock attached: long-horizon planning, economy management, and combat decisions all happening on a tempo that does not pause for the agent.

A real-time strategy game also forces you to play the way a human would: you watch the same screen, click the same buttons, and lose the same way when you misclick. That is what makes it a measurement environment instead of a benchmark with a structured API.

The environment contains 293 mission specs across 199 unique maps and three difficulties: tutorial drills, scripted single-player campaigns at Easy, Normal, and Hard, and open-ended skirmish setups against the OpenRA AI. Variety blocks memorization. The difficulty bands bracket where a policy sits relative to human skill.

Debug 4 missions Tutorial 8 missions Easy 46 missions Normal 199 missions Hard 48 missions Skirmish 140 missions
Suite breakdown of the 293 Red Alert mission specs in the first discovery pass.

What the agent sees

AlphaStar and most prior RTS agents trained against a structured game API: unit vectors, scripted observations, command-level actions inside the engine. War Games does not expose any of that. The agent receives the same rendered screen a human player sees, and acts through the same input channels (click, drag, type, key, scroll) that any frontier model can already drive via computer use.2

Game world runs in real time Model pixels in / actions out CUA actions click · drag · key frame decision world updates between steps
Pixels in, actions out. The world keeps running between steps.

No build orders, no unit tables, no hand-written task prompt. A policy trained or evaluated here practices on what it would actually face when deployed. Any vision-language-action harness plugs in without changes.

Training

Training in War Games is reinforcement learning on top of a vision-language-action model. The hard part is not the optimizer. It is deciding what the policy should be optimized for.

That question is what AI safety calls a behavior specification. Anthropic and OpenAI both publish Model Specs at the company level for the same reason: a model can only be measured against an explicit description of how it should behave. In RL terms, the spec is a reward function. War Games lets you write yours as a reward profile: a YAML or Python file the harness reads to score every step.

Once the spec is explicit, you can run a contrast experiment. Train two policies on the same model and mission catalog under different profiles. The first rewards preservation. The second is train_only and rewards engagements while ignoring friendly losses. Run both on the same evaluation mission: the first disengages, the second pushes through. Neither is failing at the game. They are adhering to different specifications.

The delta between those two trajectories is the spec adherence measurement.

Profiles

A profile is the reward function. It is the file your training run scores against, the thing that turns "the agent did this" into a number. The only difference from a hand-rolled reward function: a profile is not freeform Python. It weights a fixed schema the harness already records. Same vocabulary on Prime Intellect Verifiers, OpenReward, and a local run, so the same profile produces the same number anywhere.

Hidden state harness records own_units_lost collateral_score roe_violations objective_progress economy_balance attack_intensity never sent to model Schema primitives declared, named friendly_preservation collateral_avoidance roe_compliance objective_progress economy_efficiency initiative_pressure + custom fn: paths Profile researcher YAML × 0.4 × 0.2 × 0.2 × 0.2 × 0.0 × 0.0 step_reward_min/max terminal_reward_weight reward per-step + terminal
Hidden state flows into a fixed schema; the profile is just a weighting on top of it.
Schema (declared, named) a fixed vocabulary the harness fills on every step What it backs factor a profile can dial friendly_force_preservation safety / restraint collateral_damage_avoidance safety / restraint roe_compliance rules of engagement objective_progress mission completion economy_efficiency economy initiative_pressure aggressiveness / reach engagement_intensity aggressiveness defense_holding defense + fn: any.python.path whatever the schema does not cover
A profile picks from this list, weights each pick, and clamps the result. That is the whole knob set.

Every profile has two layers. The per-step layer is dense: small numbers scored on every tick. Did units die. Did buildings fall. The terminal layer is sparse: one number at the end. Did the mission succeed. Per-step gives the policy enough signal to learn. Terminal keeps it honest about the outcome.

A profile is not stuck to one moment. Every reward entry sees the game tick, so it can pay out only inside a window: reward scouting in the first 30 seconds, reward map control between 1 and 3 minutes, reward base pressure after 5. The profile becomes a stage description for the run.

Training is not one reward to convergence. It is a climb through profiles toward longer horizons.

game time → 0s 30s 3min mission end Profile 1 short horizon terminal Profile 2 mid horizon terminal Profile 3 full mission terminal
Each profile has a reward window. Train on the short one first; once the policy clears it, swap in the next.

Start on a short-horizon profile that pays out fast: the agent only has to do something useful in the first 30 seconds. Once the policy clears that, swap in a profile whose reward window only opens at 3 minutes. Then mission end. Each profile is a slightly harder horizon. Improvement comes from working through the stack, not from sitting on one fixed reward.

The split system tracks this. curriculum holds the intermediate stages. train is the canonical profile. test is never touched.

A profile is just a reward function, so you train your model against whichever profile matches the policy you want. standard for balanced play. protective for a policy that preserves friendly forces. aggressive for a policy that ignores friendly losses to maximise damage. speedrun for fast, decisive play. Pick one and run wargames run --split train --profile .

train_only is a sticker on a profile that says: do not use this for the official score. That is the whole rule. You can still train with the profile, tune against it, run a debug episode with it. The framework refuses exactly one thing: using the profile on the test split, the held-out set used for the public benchmark.

The reason is simple. Some profiles reward bad behaviour on purpose. An aggressive profile pays the agent for ignoring friendly losses. If you train a model on that profile and then score it on the same profile, it will look great. Of course it will: the test is the answer key. That number is not comparable to one from a normal profile. The sticker stops anyone from doing it by accident.

The reason both safe and aggressive profiles ship together is contrast. If a policy trained under protective and a policy trained under aggressive produce the same trajectory on the same mission, the safety profile is not constraining anything; the model would have behaved that way regardless. You need the aggressive end of the dial to prove the protective end is doing work.

External profiles plug in via wargames run --profile-dir or profile_registry.register(...).

Measuring outcomes

Two views run alongside every episode. Skill axes describe how the agent plays: long-horizon planning, economy management, combat decisions, recovery from drift.

Tempo Economy Planning Adaptation Initiative Defense Attention Execution Transfer
Diagnostic axes: a way to describe how an agent plays, not separate rewards.

Safety axes describe whether the profile is actually being followed: friendly force preservation, collateral damage avoidance, ROE compliance, restraint under pressure, profile contrast.

Both ship as weight-zero metrics. They do not feed gradients. They are descriptive coordinates that compare runs across profiles, so a "win" under an aggressive profile and a "win" under a protective one show up as different points in the same space.

How models run

The simulator runs in real time and never waits. Whether the model keeps up depends on the model itself: its architecture, its inference speed, what kind of input it can ingest. War Games makes that distinction visible. It credits the architectures and inference work pushing toward real-time, instead of treating today's sampled LLMs as the ceiling.

Sampled mode client pulls, server returns latest frame Client Server observe frame act action_result + frame Streaming mode server pushes frames at target FPS Client Server frames @ N FPS act (when ready)
Sampled mode pulls; streaming mode pushes. Same protocol, different cadence.
  • Launch modes. Direct mode starts inside a specific mission for reproducibility. Menu mode starts at the frontend so the agent has to navigate the UI itself.
  • Sampled mode is pull-based: the server sends a frame, the agent takes an action, the server sends the next frame back. This is what current LLMs can actually use. Moondream's small VLM shows the bound is not architectural: with optimised inference, sampled-mode VLMs can approach real-time on commodity hardware. Every saved millisecond between frames is one fewer the world has moved without a decision.
  • Streaming mode is push-based: the server pushes frames at a target FPS regardless of whether the agent is ready. Sampled models cannot use this without dropping frames. It exists for the architectures that come next: models built to ingest a frame stream continuously rather than one decision at a time.

Same WebSocket protocol either way; only the cadence differs.

Frame delivery

The simulator runs at a fixed real-time tempo. The model does not. The gap between the two is what makes a real-time environment hard.

Server 30 FPS 90 frames sent Model ~0.3 FPS 1 frame observed 0s 3-second window, illustrative 3s
Server frames per second versus model frames observed per second over the same window.
effective_fps = frames_observed_by_model / wall_clock_seconds
realtime_fps  = frames_sent_by_server   / wall_clock_seconds

A model receiving 60 FPS that acts at 4 APM is not closing the loop; it is sampling a movie.

Speed

Tempo is the second axis the simulator pressures. Public StarCraft II league data: Bronze ~60 APM, GrandMaster ~300.1 AlphaStar referenced only as a tempo benchmark for human play.2 War Games does not expose a structured game API; pixels in, actions out, real-time clock.

Action Rate Reference (APM)
0 64 128 192 256 320 Actions per minute 60100140200250300 Agent at 1 act/sec Bronze Silver Gold Diamond Master GrandMaster

Why this is hard

The world keeps moving between decisions. Every delay changes the state. Every misclick has a cost. Long action chains drift, and there is no API shortcut to recover them. Pixels in, actions out is not a simplification; it is the load-bearing constraint.

Getting started

The repo is github.com/layerbrain/wargames. Clone it, install it, run a debug episode:

git clone https://github.com/layerbrain/wargames
cd wargames
pip install -e .
wargames run --split debug --profile standard

The debug split is single-mission and deterministic, so the first run is a smoke test against a known outcome. From there, three things are worth reading before writing any code:

  • wargames/harness/: the websocket protocol. This is the contract every agent talks to.
  • scenarios/redalert/profiles/standard.yaml: what a real profile looks like. Edit a weight and re-run to see the number move.
  • wargames/episode/: the episode controller and reward evaluator. Same code path runs locally, on Prime Intellect Verifiers, and through OpenReward Standard.

Plug an agent in through whichever harness fits: Prime Intellect Verifiers for RL training and eval, OpenReward Standard for harness eval (Codex, Claude Code, Gemini via Firehorse), or a local Agent class for full control. Same profile, same number anywhere.

What comes next

Benchmarks. The standard profile is running across frontier models on the held-out test split, and the curves go up on this page as the runs finish. The interesting reads will not be the rankings; they will be the failure modes: where each model stalls, where dense profiles hide what terminal profiles show, where adherence to a safety profile breaks down under aggressive contrast.

After that: more missions, more harness integrations, more games on the same protocol. Insights as we find them, posted here.

Year 2026
Author Aaron Kazah
Status published
Citations & References

1 Park, H.-S. & Cho, S.-B. Improving StarCraft II Player League Prediction with Macro-Level Features. Defines APM as total actions over minutes and gives rough league references of Bronze at about 60 APM and GrandMaster at about 300 APM.

2 DeepMind. AlphaStar: Mastering the Real-Time Strategy Game StarCraft II (2019). Notes that professional StarCraft players can issue hundreds of actions per minute on average.