When someone raises AI agent security in an engineering discussion, the first instinct is almost always the same: “sandbox it.” Drop the agent into an isolated container, restrict its network access, filter its file system, audit its prompts. These are reasonable instincts. Most of them are worth doing.
But there’s a structural flaw running through nearly every sandboxing approach, and it’s worth being direct about it. Sandboxes don’t change what credentials the agent holds. They constrain what the agent can do with the surrounding system. The key itself, already in the execution context, is untouched.
That’s the subtraction model. You start with full credentials and try to subtract dangerous capabilities around them. The problem is you can’t subtract your way to safety when the thing you’re protecting is already inside.
The Subtraction Model Has a Ceiling
Think about what a sandbox actually does. It isolates the process from the host OS. It restricts file system access. It limits which outbound network connections can succeed. It can block the agent from reading /etc/passwd or exfiltrating local config files.
None of that changes what happens when the agent calls an API it already has keys for.
Say you’re running an agent with a real Stripe secret key. You wrap it in a sandbox with tight network egress rules. The agent gets prompt-injected by a malicious document it processes. The injected instructions tell it to refund a charge or initiate a transfer. The sandbox sees legitimate outbound HTTPS traffic to api.stripe.com. Nothing gets blocked. The agent was authorized to talk to Stripe, so it does.
The sandbox stopped the attacker from getting a shell on your server. It did not stop the attacker from using your Stripe credentials. Those were inside the perimeter the whole time.
This isn’t a flaw in any specific sandbox tool. It’s the architecture. When the credential is in the execution context, any code running in that context can use it. Sandboxes limit the attack surface for OS-level exploits. They don’t limit what an authorized process can do with the credentials it already holds.
What Zerobox and Similar Tools Actually Solve
Tools like Zerobox, E2B, and the container-based isolation layers in most agent frameworks solve a genuine problem. OS-level isolation matters. If an agent gets compromised at the process level, you don’t want it to be able to reach the rest of your infrastructure. That’s a real concern and these tools address it properly.
The limitation is specific: the credential model is unchanged. Whether the key lives in an environment variable, gets injected at container startup, or is fetched from a secrets manager at runtime, the agent ends up with it in memory. From that point forward, the sandbox boundary is irrelevant to credential abuse. The agent has what it needs to call the API.
This isn’t a knock on these tools. They layer on real protections. The point is that no amount of process isolation changes what a token-holding process is authorized to do with the external services it holds tokens for.
The Alternative: Don’t Give It the Key
The capability model inverts the assumption. Instead of restricting what an agent can do with a credential, you don’t give the agent the credential in the first place.
API Stronghold’s phantom token pattern works like this. The real API key lives at the proxy boundary, never in the agent’s execution context. When the agent needs to call an API, it holds a phantom token: a scoped, short-lived credential that maps to a limited set of allowed operations.
That token has a few properties:
- Short-lived. 24 hours by default. It expires whether or not anything bad happens.
- Scoped to specific endpoints. A token for reading customer data can’t create charges. A token for sending emails can’t query billing.
- Non-replayable. Captured tokens can’t be replayed outside the expected request context.
- Audited at the proxy boundary. Every call goes through a layer that logs, rate-limits, and can revoke access in real time.
The real credential never enters the agent’s process. There’s nothing to exfiltrate.
What Prompt Injection Actually Wins
Prompt injection is the practical threat model here. An agent processes external content, some of that content contains embedded instructions, and the agent follows them. This is a real and unsolved problem at the model level. No prompt filtering catches it reliably.
If the agent holds a real API key and gets prompt-injected, the attacker gets whatever access that key provides until you rotate it. Rotation takes time. By then, the damage is done or your detection caught it after the fact.
If the agent holds a phantom token, the attacker wins the current session. They get a scoped token that’s already expiring, every call through it is logged, and you can revoke it with one API call. The real key was never exposed. You don’t need an emergency rotation because there’s nothing to rotate from the agent’s side.
The attacker wins the session, not the credential. That’s a fundamentally different blast radius.
When Subtraction Controls Still Make Sense
OS isolation is worth running. File system containment limits what a compromised process can read and exfiltrate from the local environment. Network egress filtering stops agents from reaching unexpected destinations. These are sensible layers and none of them conflict with the phantom token approach.
The argument isn’t that sandboxing is useless. It’s that sandboxing and credential isolation are solving different parts of the problem. Sandboxes protect the host from the agent. Phantom tokens protect external services from a compromised agent. You want both.
What you don’t want is to treat a sandbox as a credential security strategy. They’re not interchangeable. A sandboxed agent with a real production API key is still a sandboxed agent with a real production API key. The containment is real; the credential exposure is also real.
The Model to Adopt
The credential should never live where the agent can read it. That’s the rule. Everything else is a mitigation layered on top of a design flaw.
Phantom tokens make this work in practice. The agent gets a token scoped to what it actually needs. That token expires, every call is logged, and revocation is immediate. The real key stays at the boundary, outside the execution context entirely.
If you’re building AI agents that call external APIs, this is the architecture to start with. Sandboxing is something you add on top, not something you rely on instead.
Try API Stronghold free and see how phantom tokens work in a real agent deployment.