1. I built this because I like my agents to be local. Not in a container, not in a remote server, but running on my finely-tuned machine. This helps me run all agents on full-auto, in peace.
2. Yes, it's just a policy-generator for sandbox-exec. IMO, that's the best part about the project - no dependencies, no fancy tech, no virtualization. But I did put in many hours to identify the minimum required permissions for agents to continue working with auto-updates, keychain integration, and pasting images, etc. There are notes about my investigations into what each agent needs https://agent-safehouse.dev/docs/agent-investigations/ (AI-generated)
3. You don't even need the rest of the project and use just the Policy Builder to generate a single sandbox-exec policy you can put into your dotfiles https://agent-safehouse.dev/policy-builder.html
Problem 1: the agent does something destructive by accident — rm -rf, hard git revert, writes to the wrong config. Filesystem sandboxing solves this well.
Problem 2: the agent does something destructive because it was prompt-injected via a file it read. Sandboxing doesn't help here — the agent already has your credentials in memory before it reads the malicious file.
The only real answer to problem 2 is either never give the agent credentials that can do real damage, or have a separate process auditing tool calls before they execute. Neither is fully solved yet.
Agent Safehouse is a clean solution to problem 1. That's genuinely useful and worth having even if problem 2 remains open.
I honestly think that sandboxing is currently THE major challenge that needs to be solved for the tech to fully realise its potential. Yes the early adopters will YOLO it and run agents natively. It won't fly at all longer term or in regulated or more conservative corporate environments, let alone production systems where critical operations or data are in play.
The challenge is that we need a much more sophisticated version of sandboxing than anybody has made before. We can start with network, file system and execute permissions - but we need way more than that. For example, if you really need an agent to use a browser to test your application in a live environment, capture screenshots and debug them - you have to give it all kinds of permissions that go beyond what can be constrained with a traditional sandboxing model. If it has to interact with resources that cost money (say, create cloud resources) then you need an agent aware cloud cost / billing constraint.
Somehow all this needs to be pulled together into an actual cohesive approach that people can work with in a practical way.
This looks like a competent wrapper around sandbox-exec. I've seen a whole lot of similar wrappers emerging over the past few months.
What I really need is help figuring out which ones are trustworthy.
I think this needs to take the form of documentation combined with clearly explained and readable automated tests.
Most sandboxes - including sandbox-exec itself - are massively under-documented.
I am going to trust them I need both detailed documentation and proof that they work as advertised.
I like that it's just a shell script.
I do wish that there was a simple way to sandbox programs with an overlay or copy-on-write semantics (or better yet bind mounts). I don't care if, in the process of doing some work, an LLM agent modifies .bashrc -- I only care if it modifies _my_ .bashrc
Basically, give an agent its own unprivileged user account (interacting with it via sudo, SSH, and shared directories), then add sandbox-exe on top for finer-grained control of access to system resources.
$ container system start
$ container run -d --name myubuntu ubuntu:latest sleep infinity
$ container exec myubuntu bash -c "apt-get update -qq && apt-get install -y openssh-server"
$ container exec myubuntu bash -c "
apt-get install -y curl &&
curl -fsSL https://deb.nodesource.com/setup_lts.x |
bash - &&
apt-get install -y nodejs
"
$ container exec myubuntu npm install -g @anthropic-ai/claude-code
$ container exec myubuntu claude --versionOne thing I'd add: sandboxing the execution environment only solves half the problem. The other half is the prompt itself — if the agent's instructions are ambiguous or poorly scoped, sandboxing just contains the damage from a confused agent rather than preventing it.
I built flompt (https://flompt.dev) to address the instruction side — a visual prompt builder that decomposes agent prompts into 12 semantic blocks (role, constraints, objective, output format, etc.) and compiles them to Claude-optimized XML. Tight instructions + sandboxed execution = actually safe agents.
Its manpage has been saying it's deprecated for a decade now, yet we're continuing to find great uses for it. And the 'App Sandbox' replacement doesn't work at all for use cases like this where end users define their own sandbox rules. Hope Apple sees this usage and stops any plans to actually deprecate sandbox-exec. I recall a bunch of macOS internal services also rely on it.
Around last summer (July–August 2025), I desperately needed a sandbox like this. I had multiple disasters with Claude Code and other early AI models. The worst was when Claude Code did a hard git revert to restore a single file, which wiped out ~1000 lines of development work across multiple files.
But now, as of March 2026, at least in my experience, agents have become more reliable. With proper guardrails in claude.md and built-in safety measures, I haven't had a major incident in about 3 months.
That said, layering multiple safeguards is always recommended—your software assets are your assets. I'd still recommend using something like this. But things are changing, bit by bit.
One thing we've been seeing with production AI agents is that the real risk isn't just filesystem access, but the chain of actions agents can take once they have tool access.
Even a simple log-reading capability can escalate if the agent starts triggering automated workflows or calling internal APIs.
We've been experimenting with incident-aware agents that detect abnormal behavior and automatically generate incident reports with suggested fixes.
Curious if you're thinking about integrating behavioral monitoring or anomaly detection on top of the sandbox layer.
Having real macOS Docker would solve the problem this project solves, and 1001 other problems.
For those using pi, I've built something similar[1] that works on macOS+Linux, using sandbox-exec/bubblewrap. Only benefit over OP is that there's some UX for temporarilily/permanently bypassing blocks.
https://github.com/hsaliak/sacre_bleu very rough around the edges, but it works. In the past there were apps that either behaved well, or had malicious intent, but with these LLM backed apps, you are going to see apps that want to behave well, but cannot guarantee it. We are going to see a lot of experimentation in this space until the UX settles!
If/since AI agents work continuously, it seems like running macOS in a VM (via the virtualization framework directly) is the most secure solution and requires a lot less verification than any sandboxing script. (Critical feature: no access to my keychain.)
AI agents are not at all like container deploys which come and go with sub-second speed, and need to be small enough that you can run many at a time. (If you're running local inference, that's the primary resource hog.)
I'm not too worried about multiple agents in the same vm stepping on each other. I give them different work-trees or directory trees; if they step over 1% of the time, it's not a risk to the bare-metal system.
Not sure if I'm missing something...
I use clippy with rust and the only thing I had to add was:
(subpath "/Library/Developer/CommandLineTools")However the challenge is, sandbox profiles (rules) are always workload specific. How do you define “least privilege” for a workload and then enforce it through the sandbox.
Which is why general sandboxes wont be useful or even feasible. The value is observing and probably auto-generating baseline policy for a given workload.
Wrong or overly relaxed policies would make sandbox ineffective against real threats it is expected to protect against.
Then I realized the only thing I care about on my local machine is "don't touch my files", and Unix users solved that in 1970. So I just run agents as "agent" user.
I think running it on a separate machine is nicer though, because it's even simpler and safer than that. (My solution still requires careful setup and regular overhead when you get permission issues. "It's on another laptop, and my stuff isn't" has neither of those problems.)
p.s. thanks for making this; timely as I am playing whackamole with sandboxing right now.
been watching microsandbox but its pretty early. landlock is the linux kernel primitive that could theoretically enable something like this but nobody's built the nice policy layer on top yet.
curious if anyone has a good solution for the "agent running on a remote linux server" case. the threat model is a bit different anyway (no iMessage/keychain to protect) but filesystem and network containment still matter a lot
That's a 200 OK the whole way down. "Prevent bad actions" and "detect wrong-but-permitted actions" are completely different problems.
I built yolobox to solve this using docker/apple containers: https://github.com/finbarr/yolobox
But given how fast agents are moving, I would be shocked if such tools were not already being built
I'm involved with a project building something very similar, which we literally open sourced an alpha version of last week:
https://github.com/GreyhavenHQ/greywall
It's a bit different in that:
- We started with Linux
- It is a binary that wraps the agent runtime
- It runs alongside a proxy which captures all traffic to provide a visibility layer
- Rules can be changed dynamically at runtime
I am so happy this problem is getting the attention it deserves!
How does this compare with Codex's and Claude's built-in sandboxing?
Why always the fixation on the hardware?
All the issues we get from AI today (hallucinations, goal shift, context decay, etc) get amplified unbelievably fast once you begin scaling agents out due to cascading. The risk being you go to bed and when you wake up your entire infrastructure is gone lol.
Most setups handle this awkwardly: fire a webhook, write to a log, hope the human is watching. The sandbox keeps the agent contained, but doesn't give it a clean "pause and ask" primitive. The agent either guesses (risky) or silently fails (frustrating).
Seems like there are two layers: the security boundary (sandbox-exec, containers, etc.) and the communication boundary (how does a contained agent reach the human?). This project nails the first. The second is still awkward for most setups.
sandbox-profiles is a solid primitive for local agents. The missing piece in production is the tool layer — even a sandboxed agent can still make dangerous API calls if the MCP tools it has access to aren't individually authed and scoped.
The real stack is: sandbox the runtime (what Agent Safehouse does) + scope the tools (what we do with JIT OAuth at the MCP layer). Neither alone is enough.
Nice work shipping this.
https://www.arcade.dev/blog/ai-agent-auth-challenges-develop...