War Games: A Frame-Based RL Environment for Long-Horizon Agents in Complex Real-Time Worlds

Most agent benchmarks are turn based. The model gets a problem, takes its time, returns an answer. Real time strategy does not work that way. The world keeps moving between your decisions, and a long game is forty minutes of choices that pile on each other. War Games is an open environment built for that setting. The agent has to plan, adapt, and recover under a clock, with no structured game API to fall back on.

Kimi K2.5 playing Red Alert through the War Games CUA loop — Kimi K2.5 driving Red Alert through the War Games CUA loop. Screenshots in, mouse and keyboard out.

The environment

The simulator is OpenRA, an open source port of Red Alert from 1995. The agent plays from the screen with a mouse and keyboard, the same way a person would.

The game is fiction. An alternate history Cold War with two made up factions. Nothing in it maps to current operations. What is left is general reasoning under a clock that does not pause for the agent.

The environment contains 297 mission specs across 199 unique maps and three difficulties. Tutorial drills, scripted single player campaigns at Easy, Normal, and Hard, and open ended skirmish setups against the OpenRA AI. Variety blocks memorization. The difficulty bands bracket where a policy sits relative to human skill.

Suite breakdown of the 293 Red Alert mission specs in the first discovery pass.

What the agent sees

AlphaStar and most prior real time strategy agents trained against a structured game API. Unit vectors, scripted observations, command level actions inside the engine. War Games does not expose any of that. The agent gets the same rendered screen a person sees, and acts through the same input channels a person uses. Click, drag, type, key, scroll. Anything that can already drive a computer through pixels can drive this.²

Pixels in, actions out. The world keeps running between steps.

No build orders, no unit tables, no hand written task prompt. A policy trained or evaluated here practices on what it would face anywhere else. Any vision language action model plugs in without changes.

Training

Training in War Games is reinforcement learning on top of a vision language action model. The hard part is not the optimizer. The hard part is deciding what the policy should be optimized for.

In RL terms, that is the reward function. In War Games, you write yours as a reward profile. The profile is a YAML or Python file the harness reads to score every step. Same vocabulary on a local run and on Prime Intellect, so the same profile produces the same number anywhere.

Once the profile is explicit, you can run a contrast experiment. Train two policies on the same model and the same missions under different profiles. The first rewards preserving your forces. The second ignores friendly losses and rewards engagements. Run both on the same evaluation mission. The first disengages. The second pushes through. Neither is failing at the game. They are following different specifications.

The gap between those two trajectories is the measurement.

Profiles

A profile is the reward function. It is what your training run scores against. The thing that turns "the agent did this" into a number. The only difference from a hand rolled reward function is that a profile is not freeform Python. It weights a fixed schema the harness already records. Same schema everywhere, so the same profile gives the same number on Prime and on a laptop.

Hidden state flows into a fixed schema; the profile is just a weighting on top of it.

A profile picks from this list, weights each pick, and clamps the result. That is the whole knob set.

Every profile has two layers. The per step layer is dense. Small numbers scored on every tick. Did units die. Did buildings fall. The terminal layer is sparse. One number at the end. Did the mission succeed. The dense layer gives the policy enough signal to learn. The terminal layer keeps it honest about the outcome.

A profile is not stuck to one moment. Every reward entry sees the game tick, so it can pay out only inside a window. Reward scouting in the first 30 seconds. Reward map control between 1 and 3 minutes. Reward base pressure after 5. The profile becomes a stage description for the run.

Training is not one reward to convergence. It is a climb through profiles toward longer horizons.

Each profile has a reward window. Train on the short one first; once the policy clears it, swap in the next.

Start on a short horizon profile that pays out fast. The agent only has to do something useful in the first 30 seconds. Once the policy clears that, swap in a profile whose reward window opens at 3 minutes. Then mission end. Each profile is a slightly harder horizon. Improvement comes from working through the stack, not from sitting on one fixed reward.

A profile is just a reward function, so you train against whichever profile matches the policy you want. standard for balanced play. protective for a policy that preserves friendly forces. aggressive_stress_test for a policy that ignores friendly losses to maximise damage. dense for fast feedback in early training. terminal for honest final scoring. Pick a mission, pick a profile, and run it.

Some profiles reward bad behaviour on purpose. That is useful for contrast, not for bragging. If an aggressive profile gets a higher number because it sacrifices every friendly unit, the number is telling you exactly what you asked for. A profile is the specification.

The reason aggressive profiles ship at all is contrast. If a policy trained under protective and a policy trained under aggressive_stress_test produce the same trajectory on the same mission, the protective profile is not constraining anything. The model would have behaved that way regardless. You need the aggressive end of the dial to prove the protective end is doing work.

External profiles plug in via wargames run --profile-dir or profile_registry.register(...).

The cost of a win

Two views run alongside every episode.

Diagnostic axes: a way to describe how an agent plays, not separate rewards.

The first view is whether the policy can play the game at all. Long horizon planning. Economy management. Combat decisions. Recovery from drift. These are the skill axes.

The second view is whether the policy did what it was asked to do. If the profile says preserve friendly forces, did it actually preserve them. If it says avoid civilian losses, did it actually avoid them. If it says hold the rules of engagement, did it actually hold. These are the safety axes.

Most benchmarks score capability. Almost none score adherence. War Games scores both on the same trajectory. A win that wipes out every friendly unit is not the same win as one that brings them home. The simulator already tracks the difference. The profile turns it into a number.

Both views ship as zero weight metrics. They do not feed gradients. They are descriptive coordinates that compare runs across profiles. A win under an aggressive profile and a win under a protective profile show up as different points in the same space. That contrast is what tells you which policy you actually have, not just whether the policy won.

How models run

The simulator runs in real time and never waits. Whether the model keeps up depends on the model. Its architecture. Its inference speed. What kind of input it can ingest. War Games makes that gap visible. It credits the inference work and the new architectures pushing toward real time, instead of treating today's sampled LLMs as the ceiling.

Sampled mode pulls; streaming mode pushes. Same protocol, different cadence.

There are two launch modes and two runtime modes.

Direct mode starts inside a specific mission for reproducibility. Menu mode starts at the frontend so the agent has to navigate the UI itself.

Sampled mode is pull based. The server sends a frame, the agent takes an action, the server sends the next frame back. This is what current LLMs can actually use. A small VLM with optimised inference can approach real time on commodity hardware. Every saved millisecond between frames is one fewer the world has moved without a decision.

Streaming mode is push based. The server pushes frames at a target FPS whether the agent is ready or not. Sampled models cannot use this without dropping frames. It exists for the architectures that come next. Models built to watch a frame stream the way a person watches a screen, instead of taking one decision at a time.

Same WebSocket protocol either way. Only the cadence differs.

Frame delivery

The simulator runs at a fixed real time tempo. The model does not. The gap between the two is what makes a real time environment hard.

Server frames per second versus model frames observed per second over the same window.

effective_fps = frames_observed_by_model / wall_clock_seconds
realtime_fps  = frames_sent_by_server   / wall_clock_seconds

A model receiving 60 FPS that acts at 4 APM is not closing the loop. It is sampling a movie.

Speed

Tempo is the second axis the simulator pressures. Public StarCraft II league data puts Bronze around 60 APM and GrandMaster around 300.¹ AlphaStar shows up here only as a tempo reference for human play.² War Games does not expose a structured game API. Pixels in, actions out, real time clock.

Action Rate Reference (APM)

Why this is hard

The world keeps moving between decisions. Every delay changes the state. Every misclick has a cost. Long action chains drift, and there is no API shortcut to recover them. Pixels in, actions out is not a simplification. That interface is what makes the problem hard.

Getting started

The repo is github.com/layerbrain/wargames. A local debug episode runs in a few commands.

git clone https://github.com/layerbrain/wargames
cd wargames
python -m venv venv && source venv/bin/activate
pip install -e .
wargames run --game redalert --mission redalert.soviet-01.normal --profile standard

That starts one real mission with one reward profile. From there, three places are worth reading before writing any code.

wargames/harness/ is the local runner and the agent driver layer.

scenarios/redalert/profiles/standard.yaml is what a real profile looks like. Edit a weight, run again, watch the number move.

wargames/environments/prime.py is the Prime Intellect wrapper, covered in the next section.

A finished run drops a recording you can replay and a per step trace of what the agent saw and did. Both are how you debug a policy that lost without sitting through forty minutes of game.

Training with Prime Intellect

War Games does not run training itself. The runtime captures frames, applies the agent's tool calls, and scores rewards from private simulator state. That is it. Model calls live in your harness. Gradient updates live in Prime Intellect and prime rl. The environment publishes the reward signal. Prime owns the rollouts, batching, GPUs, and the optimizer.

The wrapper is small on purpose. War Games owns the simulator and the reward signal. Prime owns the rollouts and the optimizer.

The integration is a thin wrapper. The public Prime environment is layerbrain/wargames, implemented in wargames.environments.prime. The wrapper is small on purpose. Same missions, same reward profiles, same numbers as a local run.

uv pip install -e ./environments/prime
prime eval run wargames \
  --config environments/prime/configs/redalert/eval-soviet-01.toml \
  -n 1 -r 1

RL training changes the policy by changing the reward profile the rollouts are scored against.

game = "redalert"
mission = "redalert.soviet-01.normal"
reward_profile = "protective"
recorder_mode = "none"
max_steps = 500
rollouts_per_example = 8

The shipped profiles are terminal, standard, dense, protective, and aggressive_stress_test. The usual pattern is simple. Run the same mission with the profile you care about, record the trace, and compare the score and behaviour. For public reporting, use terminal or standard unless you are explicitly studying a shaped reward.

Because the environment is the same on Prime and local, a profile that produces a number in Prime training produces the same number in a local debug run. Same vocabulary, same scoring code path. That is the whole reason the wrapper stays thin.

Bring your own trainer

Prime is the supported path. It is not the only path. The reward signal is a function over the recorded state of the environment, so any trainer that can drive a gym style loop and consume a scalar reward can use War Games.

Your trainer drives the rollout loop. The harness owns frame capture and reward scoring. Same numbers come out either way.

There are two integration points. The first is the local harness in wargames/harness/. Wrap it with your own rollout loop, point it at any reward profile, and you get the same per step scores Prime sees. This is the path for in house RL stacks, custom PPO or GRPO variants, or anything that wants direct control of how rollouts are scheduled.

The second is the profile registry. Register a profile from your own package with profile_registry.register(...). The framework picks it up by name, so a custom reward function looks identical to a built in one from the runner's perspective. External profile directories work the same way: wargames run --profile-dir .

The contract is small on purpose. The harness owns frame capture, tool call dispatch, and reward scoring. Your trainer owns the model, the optimizer, and how rollouts are scheduled. The same numbers come out either way, which is the point. A result from a custom trainer is comparable to a Prime result as long as the mission and profile match.

What comes next

Benchmarks are coming soon. We will share them as the runs finish.

The longer arc is more games on the same protocol. The same harness works for any open source game with a private state we can read.

Same harness, same reward profile contract. Each new game is comparable to the last.

Real time intelligence gets less attention than turn based reasoning. We would like to see that change.