I've been trying to name a thing I think matters. The best public name I have right now is World Operating System, in the same sense that World Model describes a learned internal simulator1. Technically, what I mean is a learned supervisory runtime that sits above a conventional kernel and execution substrate. It watches a live stream of observations, keeps an internal model of the machine it is running, and produces two outputs at once: a rendered pixel surface, which is what you see on screen, and a guarded action stream grounded in a real host execution boundary. The first concrete ABI I have in mind is Linux.
If this works, it is the runtime substrate for generative software. Software stops being something you mostly package in advance and starts becoming something you generate from state, intent, and real execution.
What keeps the idea honest is grounding it in a real host interface rather than a toy simulator with a made-up action space. Linux is the first concrete version I have in mind. That is also why today's coding agents, app generators, and computer-use tools are early, weaker precursors: they prove appetite, but they still sit above source files, screenshots, DOMs, and click targets instead of a runtime that can infer state, render the surface, and act through a grounded execution interface. The claim is not that a learned model should replace kernel enforcement, virtual memory, interrupts, or hardware isolation.
The deeper claim is that we have been treating the wrong thing as the unit of computing. We treat the app as the basic object, then package it, ship it, wire it to other software, and spend the rest of its life maintaining the boundaries we created. Some of those boundaries are real: permissions, isolation, scheduling, fault containment. But a large part of the application layer exists because humans needed a way to package, ship, and monetize work. When you ask a teammate to update a forecast, move a meeting, and send a follow-up, they experience one state of the world. Our machines bounce through boundaries we drew for shipping, not because the task requires them.
Software Is a Tower of Babel
The Seam Is the Cost
What bothers me about modern software is how much of it is spent translating between islands that should not be islands in the first place. Your calendar, CRM, spreadsheet, docs, browser, database, and design tool all know something about the same world, but each one holds that knowledge inside its own container. So we build connectors. Then we build systems to manage the connectors. Then we build companies to maintain the systems that manage the connectors.
Why the Boundary Moves
That does not mean every boundary vanishes. Protection, permissions, scheduling, isolation, and fault containment matter more when inferred state is centralized, because drift in the model's map increases blast radius unless isolation is preserved. Many application boundaries are historical packaging boundaries, but some encode indispensable control properties. A learned runtime does not remove the need for modularity. It changes where modularity lives.
Pixels Are the Universal Interface
What Pixels Give You
Everything a human can directly inspect on a computer ends up as a surface. A browser tab, a terminal, a spreadsheet, a game, a dashboard, a settings panel. However complicated the stack below it may be, the part a human can inspect is still a stream of frames.
If a human can operate a machine through a screen, a model can in principle learn from the same surface. It can watch state change over time, see the result of actions, and build an internal model from the coupling between what appeared on screen and what happened next.
That also means the visible layer can be specified by examples, not just text.
The observation bundle at time is . In the most ambitious version, can be just the visible frame . No privileged trace. No handcrafted intermediate state. Just pixels, time, and the consequences of acting. The input is not a static screenshot. It is a sequence. The model watches a spinner start, a file dialog open, a cursor pause, a process hang. The temporal relation between action and consequence is written directly onto the surface.
just means "what the system can observe right now." In the simplest version, that is the current screen and whatever changed around it over time.
Why Pixels Are Not Enough
But pixels are only the surface. A minimized process can hold memory, open files, and live network connections while showing you nothing. Background services may do critical work without ever drawing a window. If the runtime is going to coordinate a real machine coherently, it needs a state representation that includes what the screen leaves out.
Seeing the surface is not the same as knowing the machine state behind it. The runtime still needs to track hidden processes, resources, permissions, and pending work that may never appear on screen.
The point of the figure is simple. Pixels tell you what is visible. The model still needs a deeper state that keeps track of what is real.
The Graph Beneath the Screen
Why a Graph
I think that deeper state has to preserve explicit relational structure and variable cardinality. A graph-structured memory is a natural fit.
I write it as . The nodes are things like processes, descriptors, buffers, sockets, windows, files, devices, and kernel resources. The edges are the relations that make those objects meaningful. Which process owns which file descriptor. Which socket belongs to which process. Which window is backed by which buffer. Which resource is shared, blocked, waiting, or protected.
just means the system keeps a map made of things and relationships. The things are the nodes. The relationships are the edges.
At minimum, the state has to track:
| Entity | Examples |
|---|---|
| Process identity | PIDs, lifecycle state, scheduling priority |
| Resource identity | File descriptors, sockets, buffers, windows, devices |
| Ownership edges | Which process owns which resource |
| Capability labels | What each context is authorized to touch |
| Observability status | Whether each entity is currently visible, backgrounded, or hidden |
| Pending async events | Timers, network callbacks, signals in flight |
| UI-projected subset | Which entities are currently rendered on screen |
A flat latent may encode structure implicitly, but a runtime at this layer needs identity, authority, ownership, and mediation to stay explicit. If you crush all of that into one undifferentiated latent vector, you make the thing you are trying to model harder to preserve and control. A graph also gives the system somewhere to keep identity: the same window persists across frames, the same process moves from foreground to background, the same socket stays open while nothing visible happens. Without that continuity, the model keeps rediscovering the world from scratch.
Imagine you look at your desk, close your eyes for a second, then open them again. Your brain does not rebuild the whole desk from zero. It already has a guess that the laptop, cup, and book are still there, then it uses vision to correct whatever changed. That is the role of structured state here. The system carries forward what it already thinks is true, then updates it with what it sees.
This is where JEPA and V-JEPA 2 matter.2,3 The key insight is not just that prediction should happen in representation space. It is that the representation has to match the structure of the world you want to model. For a supervisory runtime that has to track a real machine, that means preserving relations explicitly.
Why Partition It
The graph is partitioned into a privileged kernel subgraph and a set of isolated execution context subgraphs:
An execution context is broader than a foreground app. It can include a user process, a daemon, a kernel thread, or even the idle context. Any action has to respect that structure:
Syntactically valid traces are not enough. The state has to encode what each context is allowed to touch. Admissible is not the same as successful. A trace can be authority-compatible and still fail with a bad path, a bad descriptor, or resource exhaustion. Those are normal outcomes the loop has to model.
Think of the graph as a building. Each execution context gets its own room. The kernel is the hallway and the security desk. A context can do a lot inside its own room, but if it wants to pass something to another room, it has to go through the shared controlled space.
The visible UI is only a projection of the full graph:
The visible screen is only one slice of the full machine state. The model needs the whole map, not just the part you can see.
Dynamic State
A static graph would describe an embedded microcontroller, not a supervisory runtime over a live host. Real machines create and destroy processes, allocate and release resources, open and close connections continuously. The graph has to support that:
When a process forks or a resource is allocated, the transition model spawns new isolated subgraphs or typed resource nodes. When a context terminates or a resource is released, the model removes the associated subgraph, acting as a latent garbage collector. Graph cardinality is not fixed. It changes with the machine.
The map of the machine is not a fixed-size grid. Processes appear, resources get allocated, connections open. The graph grows. When things terminate or get released, it shrinks. The model has to handle both.
Two Streams
The Macro Loop
Once you have a latent state, the rest of the framework falls into one causal loop. The model predicts where the system should be based on the previous state, action, and outcome. It corrects that prediction using the new observation bundle. From the corrected state it does two things at once: renders the current UI surface, and decides what execution context runs next and what grounded trace that context should emit.
The order matters. The model does not render a fantasy future and hope the machine catches up. It corrects to the present first. Then it renders and acts. Observation is the incoming evidence used to correct state. Rendering is the outgoing display generated from the corrected state.
The loop is guess, check, draw, pick, act. First the model guesses what state the machine should be in. Then it checks the screen and fixes the guess. Then it draws the current surface, chooses what runs next, and emits the trace that makes the next real thing happen.
The rendered surface and the grounded trace are not separate products bolted together after the fact. They are two consequences of one internal state. One stream is dense and visual. The other is sparse and operational.
The scheduler matters here too. A runtime like this is not just a model of one foreground app. It has to model contention: which context gets CPU time, which one is blocked, which one wakes up, which one should be preempted. Without that, you have a controller for one task at a time, not a real runtime layer above the machine.
Why Micro Steps Matter
Real software execution is not one abstract action per step. It is a short chain of dependent operations: open a file, get a descriptor back, read from it, map it into memory. Later operations depend on earlier returns. The executor has to unroll autoregressively over micro steps.
The executor policy emits a sequence conditioned on partial trace and partial returns inside the same quantum:
During latent execution, a learned interface model predicts what comes back at each micro step. During real execution, a host adapter maps each micro step into a concrete operation and returns the real result. Those outcomes roll up into the macro outcome, then memory updates.
Software does not act in one giant move. It acts in short chains. Open the file. Get the file descriptor back. Read from it. Do something with the bytes. The model needs to handle that step by step because the later steps depend on what came back from the earlier ones.
Grounded at the Kernel
Why Linux First
The idea only matters if the action stream lands somewhere real. In general that means a real host execution boundary. The first concrete grounding layer I have in mind is the Linux kernel ABI.
Linux exposes hundreds of syscalls4, but real workloads lean hard on a much smaller core: open, read, write, close, mmap, socket, accept, poll, fork, wait, signal, ioctl. That interface is still much smaller and more stable than the application layer above it.
The model does not need to learn "Photoshop" or "Postgres" or "React" as first-class entities. Those are human names for stable patterns above the kernel boundary. What the model needs to learn are the state transitions and syscall traces that let those patterns exist. The filesystem stays the filesystem. Network sockets stay network sockets. If the model can speak the same kernel interface the machine already understands, it can inherit the machine we already have.
In practice that bridge can be implemented a few ways. On a Linux host, the grounding can be direct. On another host, it can run through a guarded Linux guest or VM. Or it can pass through a platform-specific host adapter that maps predicted micro steps into native operations.
Linux is the first grounding layer here, not the only planning layer. Planning at syscall granularity alone would be like writing prose one ASCII code at a time. The system likely needs a hierarchical action model:
| Level | Description |
|---|---|
| Level 0 | Host-grounded primitive operations (Linux syscalls first) |
| Level 1 | Reusable trace fragments, compiled macros, cached common patterns |
| Level 2 | Task- or application-structured controllers |
| Level 3 | Intent-conditioned planning |
One Architecture, Three Ways to Run It
Three Operating Modes
Once the causal loop is defined, three operating modes fall out naturally.
In real execution, observations and outcomes come from the actual system.
In filtered training, the model predicts forward, then corrects itself using what actually happened. The point is to learn a better state model by closing the gap between prior and posterior.
In latent execution, or dreaming, the system runs inside its own learned model, rehearsing, planning, and branching without an external machine in the loop:
Dreaming is what happens when the model is confident enough to keep rolling forward inside its own imagined machine instead of waiting for the real one to answer back every step.
But Why?
The reason to build something like this is not elegance, and it is not just better automation. The main payoff is that software itself becomes generative. Once a runtime can infer state, render the surface, and act through a real execution interface, software stops being something you mostly package in advance and starts becoming something you generate from state and intent.
Generative Software
Instead of building a separate product for every workflow, you can instantiate software directly from state and intent. If the system can keep persistent state, render the surface directly, and act through a real execution interface, it can generate dynamic websites, internal tools, dashboards, and whole application flows on demand. The visual surface and the semantic structure no longer have to be the same artifact, though the system still needs a real semantic layer for accessibility, search, and machine-readable structure.
Software Can Be Specified by Example
Generative software does not have to start from text. A user can point at screenshots, mockups, videos, or existing products as examples of what they want, and the runtime can use those references to shape the surface it generates.
What those references give you is the visible layer, not the full software underneath. A user might want the navigation feel of one tool and the pipeline view of another in a single workflow. The runtime can generate that interface, but it still has to infer and ground the data, permissions, and actions that make it actually work.
That is why today's coding agents, app generators, and computer-use tools look like a precursor rather than the destination. They show demand for generated software and generated capability, but they still work through fixed interfaces, source files, and surfaces built for humans. The larger direction is a runtime that can generate the visible layer itself while staying grounded in real execution.
In the longer term, existing binaries can persist as headless backends while the runtime becomes the primary renderer and interaction layer. The visible software package gets replaced first, not the execution underneath.
This is not science fiction. Flutter Web already renders to canvas while generating a separate semantics tree for accessibility5. Google Docs moved its rendering stack to canvas for the same reason6. We already have working examples of the visual layer and the semantic layer being separated.
What This Changes About Software Markets
If this architecture works, the implication is economic. Most users would not feel like they are building software any more than they feel like they are programming when they generate an image. They would just ask for outcomes. In that world, a lot of SaaS starts looking less like a durable product category and more like a temporary packaging layer. The value does not disappear. It moves from fixed application packaging into the runtime, the data boundary, permissions, trust, and execution.
If a runtime can reproduce and recombine the visible software layer from examples, the packaged UI becomes much less defensible as the core product artifact.
Robots and Real Hardware
The same loop is not limited to laptops. If hardware can be observed and controlled through screens, panels, telemetry, or software commands, the model can treat it as another connected device. Humans already do this: drive in a game, fly in a simulator, operate remote equipment by looking at a surface and reacting in a loop. The core problem stays the same.
Universal RL Environments
Once the system can run against a real host, a guarded guest, or its own learned model, the same architecture can operate, imitate, rehearse, and simulate. This suggests a path to universal RL environments: one runtime that can inhabit software, hardware, and latent rollouts with the same state and action semantics.
Why Now
The last reason is more candid. The recent wave of coding agents and app generators shows that people want software to be generated, adapted, and operated for them instead of packaged in advance. But the deeper reason is that models are finally good enough to help build the data engine this runtime would need: environments, traces, corrections, and long-horizon supervision. We still do not have the final system. What we do have, for the first time, are models that look strong enough to help create the training world for a stronger one.
The Architecture
The State
The system carries two kinds of state at each step: . is the explicit graph of what exists right now and how it is connected. is the running memory that carries recent context. What just happened, what the system expected, what is still unfolding. The state also encodes boundaries: who can touch what, what has to go through the kernel, and where cross-context interaction is legal.
is the map of what exists. is the recent memory the map does not fully hold yet. You need both because a running machine is more than one frozen snapshot.
Ownership Split
| Scope | Owns |
|---|---|
| World OS | State inference, policy, coarse scheduling, UI synthesis, action proposal |
| Guard / Adapter | Capability checks, syscall mediation, translation to execution substrate, audit logging |
| Host OS | Memory safety, CPU scheduling, virtual memory, interrupts, isolation primitives, privileged enforcement |
The learned runtime proposes. The guard mediates. The host operating system enforces.
Components
The architecture decomposes into seven modules:
- Observer — ingests the screen or framebuffer, input events, syscall returns, async events, and optional telemetry into the observation bundle .
- State Estimator — runs the prior dynamics model and the posterior correction to maintain the inferred graph state and running memory .
- Scheduler — selects the next execution context or task quantum from the kernel partition of the graph.
- Executor — emits autoregressive micro-step syscall traces within the selected quantum, optionally drawing on a learned macro library for common patterns.
- Guard / Validator — filters proposed actions through capability checks, policy constraints, and semantic safety rules before anything crosses into the real machine.
- Renderer — derives the visible pixel surface from the UI partition of the graph state, with a separate semantics layer where needed.
- Host Adapter — maps abstract actions to real guest or host operations, collects returns, and feeds outcomes back into the loop.
The Loop
The macro step looks like this. First the system predicts a prior state from the previous state, action, outcome, and memory:
Then it corrects that prior using the current observation bundle:
From the corrected state it renders the visible surface:
That does not mean one giant model has to repaint every pixel from scratch at display rate. It means the visible surface is derived from the current inferred state. In practice that surface can be driven through deterministic renderers, retained scene graphs, Canvas or WebGL backends, or other fast compositors, with semantic structure where it helps. If a proposed trace fails at the host, the outcome feeds back into state correction before the next render. The surface never races ahead of what the loop has verified.
It also selects a coarse execution context from the kernel partition:
This is macro-level context selection, not a replacement for the host kernel's microsecond CPU scheduler.
The composite action is the scheduled context together with its syscall trace:
Inside the selected quantum, the executor rolls out micro steps:
In latent execution, a learned interface model predicts the micro-step outcomes:
In real execution, those same micro steps are grounded through a host adapter:
This is the critical split. The learned policy proposes an abstract trace. A deterministic guard decides what can cross into the real machine.
The interface returns synchronous results and asynchronous events:
Then memory updates:
Every cycle ends with the system folding the latest result back into memory so the next decision is made by a system that has actually learned from what just happened.
Training
The training path is surprisingly concrete. Record a real machine being used: screen or framebuffer, aligned syscall trace, results, and asynchronous events. Turn that into sequential supervision for the loop above.
The hard part is not naming the pieces. The hard part is aligning them: what was on screen, what action happened, what came back, and what changed next. That gets messy fast when important state lives off-screen, arrives asynchronously, or depends on remote systems. The dataset barely exists. Someone has to build the collector before they can train the model.
The training object is a sequential dataset of transitions:
Imagine someone double clicks
report.pdf. One step shows the file manager with the cursor over the file. The action log saysopen("report.pdf"). The result says the OS returned a file descriptor and some bytes. The next step shows the PDF viewer open. That is all the dataset line means: one moment, what happened, then the next moment.
Cutting each step at the right boundary matters: record what was on screen, what action happened, what came back, what the screen looked like next. If you mix those up, the model cheats by seeing effects before causes.
Passive capture is not sufficient by itself. Human traces are narrow, biased, and weak on failure recovery and edge cases. The model also needs instrumented sandboxes, synthetic task generation, branching rollouts, and online correction. The data engine is not just a recorder. It is a recorder, a sandbox, and a task generator.
At a high level, the training problem is to expose the runtime to as many software environments, workflows, outcomes, and failure modes as possible, with aligned traces that let it learn state, behavior, and correction over time.
What Is Learned, What Is Not
Not everything in this system is a neural network. The split matters because it determines where latency lives and where safety can be hard.
| Learned | Deterministic / Engineered |
|---|---|
| Latent state estimator | Syscall mediation and dispatch |
| Dynamics model | Host execution and kernel interface |
| Scheduler policy | Security policy enforcement |
| Executor policy | Audit logging |
| UI generation / surface derivation | Fast rendering backends (Canvas, WebGL, compositors) |
| Interface outcome predictor (for dreaming) | Macro cache / compiled trace library |
| Replay and rollback tooling |
The learned components propose. The deterministic components enforce, execute, and audit. That boundary is what makes the system both flexible and controllable.
What This Is
This is a theoretical framework, not a claim that the final algorithms exist. Publicly I call the direction World OS. Technically the object here is a supervisory runtime, not a latent replacement for the whole host operating system. It matters because generative software would need a runtime layer like this. The symbols are not hiding solved tricks. They are naming the hard parts. Someone still has to build this and find out which algorithms actually make the loop hold together.
Hurdles
These are not reasons not to build it. They are the hills you hit the moment you try.
Build the Data Engine
The first hurdle is data. This corpus does not already exist. You cannot scrape it from the public internet. You have to record synchronized screen frames, syscall traces, returns, and asynchronous events across long, messy computing sessions. The collector is not a detail. It is part of the system.
The privacy surface is also unusually large. Full screen capture and long session recordings constitute some of the most invasive telemetry a system could collect. Privacy-preserving collection is itself a first-class systems problem: synthetic environments, sandboxes, local on-device collection, redaction pipelines, and strict separation of training data from raw personal data.
Safe Execution Is Not Optional
The second hurdle is safety. The moment a probabilistic system can touch a real machine, permissions and guardrails stop being optional. A learned policy cannot be the only line of defense.
Safety here has to be layered. The host OS or VM still provides the hard boundary. The guard mediates what the model is allowed to ask for. Higher-risk actions like deletion, execution, exfiltration, privilege changes, or irreversible mutation need tighter policy and, in some cases, explicit approval. And if the system cannot evaluate a proposed action confidently, it should refuse it rather than guess.
The hardest failures are not illegal low-level actions. They are legal actions pointed at the wrong thing. A system can delete the wrong file tree using entirely valid operations. That is why syscall mediation alone is not enough. Safety also has to reason about resource identity, provenance, sensitivity, and when a human should stay in the loop. Even then, the problem is not solved. It is just pushed into a shape that can actually be engineered.
Latency Has To Be Engineered Down
The third hurdle is latency. If every step has to wait on full neural inference, the system will feel slow and the whole idea falls apart.
So the runtime has to be tiered. Use the model where judgment is needed. Cache, compile, or handle the common paths deterministically where it is not. The problem is whether the loop can be made fast enough to feel real.
State Inference Can Drift
The model's internal graph is a learned guess about the real machine. If a background process silently corrupts a file, or a socket drops packets without throwing an immediate error, the model's map can diverge from what the kernel actually knows. The correction loop helps. The posterior corrects the prior every step using new observations. But the correction is only as good as what makes it into the observation bundle. This is not unique to this architecture. Every system that maintains a view of state from partial observations has the same problem. The answer is not omniscience. It is designing the observation bundle to include enough signal that the correction loop stays tight.
Guardrails for Dynamic Workflows
If the model generates entirely new workflows on the fly, how do you write static security rules for it? Too strict and the system cannot do anything useful. Too loose and it might destroy something critical while trying to help.
This sounds like a paradox, but it is the same problem every system that executes dynamic behavior already faces. Postgres accepts arbitrary SQL from users and still enforces permissions, row-level security, and constraints. It does not need to predict every possible query. It enforces invariants on the operations themselves. The guard here works at the syscall boundary, which is a much narrower surface than "all possible software." The kernel already knows how to enforce permissions, memory protection, and process isolation. The guard's job is to constrain what the model is allowed to ask for, not to re-implement the protections the kernel already provides.
I also expect there are more hurdles hiding behind these. That is normal. You only find some of the real ones once you start climbing.
Now What?
The right place to start is the data engine. A runtime like this needs aligned traces, instrumented environments, and long-horizon recordings that do not yet exist at the right scale.
That is why the first thing we will put into the world is OS0: a primitive version of this built to attack the data bottleneck. Its job is to generate and instrument environments, collect aligned traces, and reveal what data and control signals the stronger runtime will actually need.
If that works, it gives us the training substrate for what comes next.
If this work resonates with you, reach out at team@layerbrain.com.