Context

Part II of the series On working with agentic models, this post reflects practical and philosophical lessons picked up through hands-on experimentation and learning from others.

Praxis

These are patterns that have worked for me in practice with agentic models.

"Don't compact, don't argue. Just start over."

Oxide's internal tips on LLM use contain a trove of practical and professional guidance. One of the most immediately impactful is this:

As the chat goes on, each message gets more expensive while Claude gets dumber. That's a bad trade...

Instead of compacting (automatically or manually) when the context window has grown large, instruct the model to update the task spec with progress and remaining tasks then start a new session based on the spec. This both reduces token usage and improves model performance.

💡

Don’t accumulate context until it auto-compacts – intervene early.
Don’t just run /compact intermittently. While that's good, starting fresh sessions is even better.
Do write progress summaries to disk, then use /clear or start new sessions frequently.

Have the model output programs, not just results

I tend to ask for programs in Bash because its portability. Models seem to default to Python if you don't specify a language. With Bash, powerful programs like grep, awk, and sed can be combined without introducing new dependencies.

You can instruct the model to run the program and analyze the results - or do so yourself. Either way, adding persistence reduces token usage and dependence on the model compared to having it generate transient programs each session.

Coding from anywhere V3: (also) use managed services

In On agentic coding from anywhere, I describe a process for using Claude Code CLI or Codex CLI via remote access to your own hardware. That said, off-the-shelf hosted solutions from Anthropic and OpenAI can go surprisingly far. What you lose in control, you gain in ease - which may be the right trade-off for your situation.

On existing codebases, with offloaded verification

Use Claude Code for Web or Codex Cloud on existing codebases when local compilation isn’t required – the inability to do so is the main downside compared to the CLI approach. While you can’t run builds or tests directly, you can offload verification by leveraging your CI/CD setup to do that work. Commit and push to a feature branch and have something like GitHub Actions or Xcode Cloud do the compilation and verification steps. If they report failures back to you, feed that into the context window. It’s a longer and more indirect feedback loop, but it works.

On existing codebases, with inline verification

Use remote CLI-based approach for heavier work, such as when a specific environment setup is required (e.g macOS for iOS development) or local builds and tests are important. This requires the most effort to set up but provides maximum control.

For new (web) experiments

Use Claude Artifacts for building standalone web prototypes. Use Claude Projects to enhance Artifacts with persistent custom instructions. System prompts like Simon Willison’s “never use React” prompt enable easier static site deployment without needing a build system.

While I haven't played with Codex Cloud as yet, it may offer similar capabilities. ChatGPT Canvas is not quite analogous to Artifacts in that it won't build in-chat single-page apps – you'll need to deploy the generated code similar to the process described below for non-web apps.

For new non-web experiments

For non-web or more complex experiments (i.e. those with external dependencies), set up a single “tools” repo that models can push to, avoiding per-project setup and repeated OAuth grants.

This idea is borrowed from Simon Willison’s tools.simonwillison.net setup. He takes advantage of GitHub Pages to instantly publish all of his web experiments, making them immediately accessible for use while mobile.

Principles

The context you’re operating in matters because it determines things like acceptable levels of noise (e.g. offline versus inline verification), rigor, data handling and so on.

Context determines both to whom you are accountable and how.

For instance, until a hobbyist releases something to the world with the intent for others to use, they are accountable primarily to themselves. By contrast, a professional is accountable to their team, their organization, and the organization’s users.

While I’m bullish on mobile agentic coding for hobby projects and learning, I’m bearish on it for professional work. In the professional context, I’m more inclined to use workflows with AI as a carefully reviewed assistant than an autonomous agent.

In hobbyist contexts: Be disciplined about learning

Prioritize learning rather than focusing solely on output speed, especially in less familiar domains. Mastery improves judgment and makes AI usage more effective (i.e. spotting those, "you're absolutely right" moments). Investment in cultivating mastery. The best ways I've found to do that are:

Actually review the code output. When in non-professional contexts, e.g. hobby projects, this stipulation may be relaxed. But in professional contexts, where you have a commitment to others, it becomes imperative, as discussed below.
Review the chain-of-thought reasoning output.
When unfamiliar concepts are encountered, either ask for an explanation or jot them down for follow-up research. An example of this was when I encountered the nonisolated keyword recently in Swift for the first time. It was introduced since the last time I used Swift. I found the naming counterintuitive until this SO comment helped clear it up. The way to read it is that because the encapsulated code is isolated against mutations, it can safely be called in non-isolated contexts - not that the encapsulated code is itself non-isolated.

In professional contexts: Don’t submit LLM-generated content for review that you haven’t reviewed yourself.

Borrowing from Using LLMs at Oxide’s framing:

LLM-generated code should not be reviewed by others if the responsible engineer has not themselves reviewed it.

I’d extend this to all artifacts, including documents. I’ve scarcely seen anything erode credibility faster than reviewers suspecting that they have reviewed something more thoroughly than the author.

The guide continues,

Moreover, once in the loop of peer review, generation should more or less be removed: if code review comments are addressed by wholesale re-generation, iterative review becomes impossible.

This is insightful because it also applies to documents, not just code, and is a good practice outside of AI usage.

Make it easy for your reviewers to track if or how you’ve responded to their feedback by either

Making the set of changes to the in-review artifacts no larger than necessary.
Annotate where/how feedback has been incorporated when you do make sweeping changes.

Conclusion

Novel invention is the purview of science. Combining existing inventions into new applications is the purview of engineering. I am an engineer.

With the pace of change and the as-yet unstandardized nature of LLM/generative AI tools, it may be prudent to hold off on trying to keep up with the trends until the winning patterns emerge. By then, there may well be better tools and learning materials in addition to less volatility.

Be that as it may, there's also value in seeing those patterns emerge from the ground up and developing a first-hand sense of what works and when.

So while the specifics of this post may quickly become dated, the process of developing them - (re)learning how I learn even with growing responsibilities - has been energizing in its own right. The takeaways that I think will outlive the rest of the post are:

The only constant is change, so learning how to learn is timeless.
Regardless of how code or a document is produced, when you submit it, it becomes your responsibility. Trust is hard to earn yet easy to erode - don’t squander it by pushing inadequately reviewed generated code.

On working with agentic models, part II: More lessons from AI-assisted software development