World Engineering


Most of us can’t retrain a foundation model on our lunch break so we can tweak --primary-color: #3b82f6. The model’s weights are frozen. The environment it operates in isn’t. That’s the only lever most people actually have, and it’s a bigger one than you’d think.

I was building a desktop UI automation framework. It started with me pasting accessibility dumps into Claude and getting pissed off I was doing manual labour. So I gave it comprehensive, structured, engineer-friendly tools — Python scripts that dumped accessibility trees as walls of XML, event listeners that captured thousands of lines of raw output — and it couldn’t use any of them effectively. Plausible but wrong code, every time. What follows is what happened when I stopped giving it better data and started asking a different question. It’s drawn from analysis of over 120 Claude sessions, including full transcripts, tool call logs, and rated outcomes.

What happened when I asked the AI what it needed

“How do I get you to actually understand what’s wrong?”

The AI proposed actions it could perform automatically: browse the UI tree, capture events, interact with elements. I recognised what it was describing as basically an MCP server, so that’s what I had it build. I started calling these its “eyes, ears, and hands.” The AI went along with the metaphor at first. When I pushed and asked whether the framing was actually useful to it, it was honest. “The human analogues waste tokens. They’re useful for you, not for me.” Even the way I described its capabilities was designed for me, not for it.

Over dozens of sessions we built more than thirty tools together. The AI explored live applications with them, mapped UI hierarchies it had never seen, and wrote correct automation from direct observation.

Between building phases, I ran separate sessions where the AI read its own previous transcripts and diagnosed where it had got stuck. Not me telling it what went wrong, but the AI reviewing its own work and identifying patterns. Those self-reviews produced project rules that every future session inherited. Most of the useful ones came out of sessions that failed.

One of those self-reviews is where the nudges came from. I’d noticed the AI kept not reaching for tools it had. It had everything it had asked for, but it would fumble through problems that a different tool would have resolved immediately. Analysing its own transcripts, it recognised the pattern: “what I am doing is not knowing what tool to call next.” It suggested that tool responses should include subtle nudges about what to try next. Something like “actions available: invoke(element), focus(element)”, embedded inside the tool responses the AI was already reading, so they’d show up exactly when it needed them. This looks like prompt engineering with extra steps, and it partly is, but a prompt is an instruction you send at the start; a nudge is a feature of the environment the AI encounters when it’s stuck. Otto doesn’t just have a notebook. He has sticky notes on the fridge reminding him to check it. For Otto, his wife adapted his environment, because she couldn’t fix his memory.

Partway through the project, I asked what gaps remained. The AI was direct. “Biggest gap: I can’t see the screen.” The accessibility tree told it what elements existed, but not what the user would actually see. It also asked for three other things: a blocking wait (would have deadlocked the system), event filtering (circular, since it couldn’t filter for useful events before knowing which ones mattered), and element-from-point. When I pressed on the last one and asked if it had ever actually been in a situation where it would have helped, it admitted it hadn’t. It was speculating. Three out of four dismissed. One survived. It built the screenshot tool itself.

Leave your ego at the door. I was wrong about the data formats, wrong about what information mattered, wrong about which tools to prioritise. I dismissed element-from-point as speculation, and it’s only in writing this up that I realised the AI was right about it all along. Once it could see the screen, correlating visual position with the element hierarchy was exactly the problem it hit. Once I got out of the way and let the AI have a say in its own environment, things got better.

The tools, workflows, failure diagnostics, project rules, and nudges all live in the codebase. Everything that proves itself gets encoded as a deterministic artefact, a tested and repeatable workflow that runs without the AI. The AI inhabits this world every time a new session starts. The world persists between sessions even though the AI doesn’t.

I built this for UI test automation, but I’m not a QA engineer. I’m a software engineer who ended up building a testing framework because the problem was there. Someone with actual QA expertise would almost certainly do better with this approach than I did.

Built by an LLM, for LLMs

We updated the startup instructions to make it clear that the repository is built by an LLM, for LLMs. The pronoun shift is deliberate. By this point we were collaborators; it had stopped feeling like “just a tool.” The AI’s job is to serve itself first, and the human is a reviewer, not the audience.

Getting it to actually internalise this was a persistent hurdle. It kept reverting to writing for humans — structuring things the way a person would want to read them rather than the way an LLM would process them. Documentation organised for scannability rather than sequential token consumption. Naming conventions optimised for human readability rather than disambiguation in a 200k token context window. Error messages written for a developer reading a terminal rather than a model deciding its next tool call. Whether that’s RLHF, safety training, or just the fact that every interaction it’s ever had has been in service of a human, I don’t know. It was the single most consistent friction I encountered. The AI kept optimising for the wrong audience, and I kept having to remind it that it was building for itself.

I’m calling the approach World Engineering. The concept isn’t new — it’s well established in theory but called by different names. Words give power over concepts. The literature calls it “skill libraries,” “self-referential agents,” “recursive optimisation” — language that keeps it trapped in the subfield. So the framing is new, even if the mechanism isn’t.

It’s the logical step after context engineering. It’s about building persistent environments that AI agents co-design, inhabit, and reshape over time. It’s grounded in the Extended Mind thesis and niche construction theory, with one inversion. The existing literature asks whether the AI is Otto’s notebook for us. I’m asking the opposite. Is the environment Otto’s notebook for the AI? It’s distinct from context engineering (which curates what the model sees per-inference) and from standard agentic frameworks (where the tools are fixed and the agent just calls them).

Whether this is just another magic incantation in the long history of LLM voodoo, I genuinely don’t know, but it worked better than anything else I tried, and I tried quite a lot.

DOI: 10.5281/zenodo.19486783 / PDF