Why Agent Guardrails Don't Work (And What Does)
Everyone building AI agents eventually hits the same wall: how do you let an agent act on real systems without it doing something catastrophic?
Sid
Founder, Vyuh
The instinctive answer is guardrails. Filters. Safety layers. Runtime checks.
This approach feels right. It's also fundamentally flawed.
The Plastic Problem
Start with a mental model.
Existing software is unshaped plastic.
Every enterprise has valuable software: APIs, databases, internal tools, legacy systems. This software is powerful. It runs the business.
But from an AI agent's perspective, it's:
- →Sharp-edged: designed for humans who know the implicit rules
- →Inconsistent: different conventions, different authors, different eras
- →Undocumented: tribal knowledge embedded in code
- →Unsafe: no concept of "an AI might try to use this"
Now consider what AI agents are good at: planning sequences of actions, reasoning about goals, composing steps into workflows.
Agents are builders. They're good at assembling things.
But builders can't work with unshaped plastic. Give a builder raw, irregular material and they'll cut themselves. Or cut you.
What builders need are Lego blocks: uniform, labeled, safe to handle, designed to fit together.
The gap between existing software (plastic) and what agents need (Lego) is the core problem.
The Guardrail Instinct
The obvious solution: let agents interact with raw systems, but add guardrails.
This is how most agent safety systems work. It's intuitive. It's also a trap.
Why Guardrails Fail
The Enumeration Problem
For a guardrail to work, it must answer: "Is this action dangerous?"
This requires enumerating dangerous actions. But the space of dangerous actions is infinite. You can't list them all.
- →Block SQL injection patterns? Attackers find new patterns.
- →Block file system access? What about symlinks?
- →Block "delete" commands? What about "truncate"? "Drop"? "Remove"?
Every guardrail is a game of whack-a-mole.
The Context Problem
The same action can be safe or dangerous depending on context.
- •
DELETE FROM users WHERE id = 5— dangerous if unauthorized, routine if valid admin - •
GET /api/internal/metrics— fine for monitoring, catastrophic if exposed - •
send_email(to=user, body=data)— normal unless data contains secrets
Guardrails evaluate actions in isolation. Danger often lives in context.
The Adversarial Problem
Guardrails assume good-faith agents making occasional mistakes.
But agents are susceptible to prompt injection. A malicious input can make an agent want to do something harmful. Now your guardrail is facing an adversary, not a bumbling assistant.
Adversarial inputs are designed to bypass filters. This is a losing game.
The Confidence Problem
Guardrails must make a binary decision: block or allow.
When they're wrong:
- →False positive (blocked safe action): annoying but recoverable
- →False negative (allowed dangerous action): catastrophic
The asymmetry is brutal. One false negative can be career-ending. So teams over-filter (agents become useless) or under-filter (agents become dangerous).
The Blacklist Trap
Here's the core issue: guardrails are a blacklist approach.
Blacklists have a fundamental flaw: they fail open.
Anything you didn't think to block is allowed. Novel attacks succeed by default. The burden is on defenders to anticipate every possible misuse.
This is why security moved away from blacklists decades ago. Firewalls evolved from "block known bad ports" to "allow only known good ports." The same shift needs to happen for agents.
The Alternative: Constrain the Space
What if, instead of filtering what agents try to do, you constrained what they could do?
The closed world model:
This is a whitelist approach:
The shift is subtle but profound:
| Guardrails | Closed World |
|---|---|
| "Is this action dangerous?" | "Does this action exist?" |
| Guess at runtime | Known at design time |
| Fail open (novel attacks succeed) | Fail closed (undefined = impossible) |
| Cat-and-mouse with adversaries | Adversaries constrained to defined actions |
You're not blocking dangerous actions. You're making them impossible to express.
From Plastic to Lego
This reframes the problem.
The job isn't "add guardrails to raw systems." The job is "turn plastic into Lego."
Before an agent ever sees a capability:
- 1.The capability is explicitly defined
- 2.Its inputs are typed and constrained
- 3.Its behavior is documented
- 4.It's been validated against quality rules
- 5.It's been approved for inclusion
By the time the agent discovers an action, the action is already safe. Not "probably safe." Not "safe unless prompt-injected." Safe by construction.
The agent operates in a world of Lego blocks. It never sees the plastic.
The Three Constraints
A closed world needs three properties:
1. Schema Constraint
Every action has typed inputs.
The agent can't pass arbitrary data. Inputs are validated before execution. SQL injection? Can't happen. There's no free-form query field.
2. Discovery Constraint
Actions are only visible based on role.
The analyst agent doesn't see admin actions. They're not "blocked." They don't exist in the analyst's world.
3. Execution Constraint
All actions go through a single enforcement point.
No backdoors. No direct access. Every action is validated, permissioned, and logged.
What This Guarantees (And Doesn't)
Guarantees:
- ✓Bounded action space: only defined actions can be called
- ✓Typed inputs: no arbitrary data injection
- ✓Permissioned discovery: role determines what's visible
- ✓Complete audit trail: all execution through single point
Not Guaranteed:
- →Action correctness: a defined action might have bugs
- →Agent reasoning: we constrain actions, not thoughts
- →Business logic safety: an action might be unwise but allowed
The honest claim: Agents operate in a bounded world of well-defined actions.
We're not claiming perfect safety. We're claiming the action space is finite, typed, and governed.
The Prompt Injection Case
Consider prompt injection under both models.
Guardrails approach:
Closed world approach:
The attack surface shrinks from "anything the system can do" to "anything in the defined action set."
That's still a surface. But it's bounded. Auditable. Designed rather than discovered.
Practical Implications
If you're building agent systems:
Don't:
- ×Give agents raw API access and "add safety later"
- ×Rely on prompt engineering to prevent misuse
- ×Trust runtime filters to catch all dangerous actions
- ×Assume "it hasn't happened yet" means it won't
Do:
- ✓Define the action space explicitly before agents use it
- ✓Type all inputs and validate before execution
- ✓Make discovery role-based (invisible = inaccessible)
- ✓Route all execution through a single auditable point
- ✓Assume adversarial inputs and design accordingly
The Bigger Picture
Agent safety isn't a feature you bolt on. It's an architecture you build from.
The teams that get this right won't be the ones with the cleverest guardrails. They'll be the ones who understood that the action space itself is the control surface.
Define the world carefully. Let agents operate freely within it.
That's not a limitation on agents. It's what makes agents deployable.
Written by Sid
Founder at Vyuh