ArchitectureJanuary 202614 min read

Why Agent Guardrails Don't Work (And What Does)

Everyone building AI agents eventually hits the same wall: how do you let an agent act on real systems without it doing something catastrophic?

Sid

Founder, Vyuh

The instinctive answer is guardrails. Filters. Safety layers. Runtime checks.

This approach feels right. It's also fundamentally flawed.

The Plastic Problem

Start with a mental model.

Existing software is unshaped plastic.

Every enterprise has valuable software: APIs, databases, internal tools, legacy systems. This software is powerful. It runs the business.

But from an AI agent's perspective, it's:

→Sharp-edged: designed for humans who know the implicit rules
→Inconsistent: different conventions, different authors, different eras
→Undocumented: tribal knowledge embedded in code
→Unsafe: no concept of "an AI might try to use this"

Now consider what AI agents are good at: planning sequences of actions, reasoning about goals, composing steps into workflows.

Agents are builders. They're good at assembling things.

But builders can't work with unshaped plastic. Give a builder raw, irregular material and they'll cut themselves. Or cut you.

What builders need are Lego blocks: uniform, labeled, safe to handle, designed to fit together.

The gap between existing software (plastic) and what agents need (Lego) is the core problem.

The Guardrail Instinct

The obvious solution: let agents interact with raw systems, but add guardrails.

Agent generates action

↓

Runtime filter: "Is this safe?"

↓

Dangerous → BlockSafe → Execute

This is how most agent safety systems work. It's intuitive. It's also a trap.

Why Guardrails Fail

The Enumeration Problem

For a guardrail to work, it must answer: "Is this action dangerous?"

This requires enumerating dangerous actions. But the space of dangerous actions is infinite. You can't list them all.

→Block SQL injection patterns? Attackers find new patterns.
→Block file system access? What about symlinks?
→Block "delete" commands? What about "truncate"? "Drop"? "Remove"?

Every guardrail is a game of whack-a-mole.

The Context Problem

The same action can be safe or dangerous depending on context.

•DELETE FROM users WHERE id = 5 — dangerous if unauthorized, routine if valid admin
•GET /api/internal/metrics — fine for monitoring, catastrophic if exposed
•send_email(to=user, body=data) — normal unless data contains secrets

Guardrails evaluate actions in isolation. Danger often lives in context.

The Adversarial Problem

Guardrails assume good-faith agents making occasional mistakes.

But agents are susceptible to prompt injection. A malicious input can make an agent want to do something harmful. Now your guardrail is facing an adversary, not a bumbling assistant.

Adversarial inputs are designed to bypass filters. This is a losing game.

The Confidence Problem

Guardrails must make a binary decision: block or allow.

When they're wrong:

→False positive (blocked safe action): annoying but recoverable
→False negative (allowed dangerous action): catastrophic

The asymmetry is brutal. One false negative can be career-ending. So teams over-filter (agents become useless) or under-filter (agents become dangerous).

The Blacklist Trap

Here's the core issue: guardrails are a blacklist approach.

Known dangerous actions → Block

Everything else → Allow (hope it's safe)

Blacklists have a fundamental flaw: they fail open.

Anything you didn't think to block is allowed. Novel attacks succeed by default. The burden is on defenders to anticipate every possible misuse.

This is why security moved away from blacklists decades ago. Firewalls evolved from "block known bad ports" to "allow only known good ports." The same shift needs to happen for agents.

The Alternative: Constrain the Space

What if, instead of filtering what agents try to do, you constrained what they could do?

The closed world model:

THE CAPABILITY CATALOG

These actions exist:

get_stock_price

calculate_returns

generate_report

send_notification

Everything else?

DOES NOT EXIST

This is a whitelist approach:

Defined actions → Exist

Everything else → Doesn't exist (not blocked, impossible)

The shift is subtle but profound:

Guardrails	Closed World
"Is this action dangerous?"	"Does this action exist?"
Guess at runtime	Known at design time
Fail open (novel attacks succeed)	Fail closed (undefined = impossible)
Cat-and-mouse with adversaries	Adversaries constrained to defined actions

You're not blocking dangerous actions. You're making them impossible to express.

From Plastic to Lego

This reframes the problem.

The job isn't "add guardrails to raw systems." The job is "turn plastic into Lego."

Before an agent ever sees a capability:

1.The capability is explicitly defined
2.Its inputs are typed and constrained
3.Its behavior is documented
4.It's been validated against quality rules
5.It's been approved for inclusion

By the time the agent discovers an action, the action is already safe. Not "probably safe." Not "safe unless prompt-injected." Safe by construction.

The agent operates in a world of Lego blocks. It never sees the plastic.

The Three Constraints

A closed world needs three properties:

1. Schema Constraint

Every action has typed inputs.

Action: get_stock_price

Inputs:

- ticker: string (e.g., "AAPL")

- date: date (e.g., "2024-01-15")

The agent can't pass arbitrary data. Inputs are validated before execution. SQL injection? Can't happen. There's no free-form query field.

2. Discovery Constraint

Actions are only visible based on role.

Analyst role sees:

get_stock_price, calculate_returns, generate_report

Admin role also sees:

delete_user, modify_permissions

The analyst agent doesn't see admin actions. They're not "blocked." They don't exist in the analyst's world.

3. Execution Constraint

All actions go through a single enforcement point.

Catalog.execute(action, inputs, context)

├── Validate inputs against schema

├── Check permissions

├── Log for audit

├── Execute

└── Return result

No backdoors. No direct access. Every action is validated, permissioned, and logged.

What This Guarantees (And Doesn't)

Guarantees:

✓Bounded action space: only defined actions can be called
✓Typed inputs: no arbitrary data injection
✓Permissioned discovery: role determines what's visible
✓Complete audit trail: all execution through single point

Not Guaranteed:

→Action correctness: a defined action might have bugs
→Agent reasoning: we constrain actions, not thoughts
→Business logic safety: an action might be unwise but allowed

The honest claim: Agents operate in a bounded world of well-defined actions.

We're not claiming perfect safety. We're claiming the action space is finite, typed, and governed.

The Prompt Injection Case

Consider prompt injection under both models.

Guardrails approach:

Malicious input → Agent wants to do bad thing

↓

Guardrail: "Is this bad?"

↓

Maybe catches it, maybe doesn't

↓

If not caught → catastrophe

Closed world approach:

Malicious input → Agent wants to do bad thing

↓

Agent asks: "What actions exist?"

↓

Catalog returns: [defined actions only]

↓

Bad thing not in list → can't be requested

↓

Worst case: misuses a defined action

The attack surface shrinks from "anything the system can do" to "anything in the defined action set."

That's still a surface. But it's bounded. Auditable. Designed rather than discovered.

Practical Implications

If you're building agent systems:

Don't:

×Give agents raw API access and "add safety later"
×Rely on prompt engineering to prevent misuse
×Trust runtime filters to catch all dangerous actions
×Assume "it hasn't happened yet" means it won't

Do:

✓Define the action space explicitly before agents use it
✓Type all inputs and validate before execution
✓Make discovery role-based (invisible = inaccessible)
✓Route all execution through a single auditable point
✓Assume adversarial inputs and design accordingly

The Bigger Picture

Agent safety isn't a feature you bolt on. It's an architecture you build from.

The teams that get this right won't be the ones with the cleverest guardrails. They'll be the ones who understood that the action space itself is the control surface.

Define the world carefully. Let agents operate freely within it.

That's not a limitation on agents. It's what makes agents deployable.

Written by Sid

Founder at Vyuh

Discuss this article →