Securing Agentic AI: Why Autonomy Changes the Risk Model
Traditional AppSec assumes a human acts on the model's output. Agents remove the human. When output becomes action, the risk model has to change with it.
- Read time
- 9 min
- Threat coverage
- Agentic risk · Excessive agency
- Frameworks
- OWASP Agentic · MITRE ATLAS · NIST AI RMF
- Audience
- Security leadership · Architects
For a decade, application security rested on a quiet assumption: the model proposes, a human disposes. A language model returned text, and a person decided what to do with it. Agentic AI deletes that step. When the model's output is wired directly to tools, the suggestion is the action — and the risk model has to change with it.
Output becomes action
A chatbot that drafts an email is a content problem: the worst case is a bad sentence a person can ignore. An agent that sends the email is an authorization problem: the worst case is an action taken under your credentials before anyone reviews it. The model is identical; the wiring is not. Autonomy is precisely the removal of the human checkpoint that classic application security leaned on as its last line of defense.
That one change collapses two boundaries at once. The line between Data and Instruction was already fragile in language models; agentic systems also erase the line between a Suggestion and a committed action. A manipulated token stops being a misleading answer and becomes an API call, a transaction, a file write, a message to a customer.
- The threat is not "the model says something harmful." It is "the model does something harmful, with real privileges."
- The asset under attack is not the conversation. It is every system the agent's identity can reach.
- The question is no longer "is the output safe?" but "what is the worst this actor can do, and who authorized it?"
The surface beyond the model
Securing the model — alignment, refusals, content filters — addresses one component of a system that now has five. Each added component is a trust boundary, and each is where real agentic incidents actually begin.
- Model — the reasoning core. Still subject to jailbreaks, but rarely the whole story.
- Retrieval (RAG) — every document the agent reads is untrusted input that can carry instructions. One poisoned record reaches every user who later retrieves it.
- Tools — the agent's hands. An over-scoped tool turns a text manipulation into action across whatever that tool can touch.
- Memory — persistence makes a one-time injection durable: a planted instruction can steer decisions long after the original prompt is gone.
- Identity — the credentials the agent acts under. This is the variable that sets the blast radius, and it is almost always over-provisioned.
Read top to bottom, that list is an attack path: untrusted content enters through retrieval, becomes an instruction in the model, executes through a tool, persists in memory, and acts with the reach of an identity. Defending the model alone leaves four of the five doors open.
Worked scenario
Scheduling agent → calendar-borne privilege change
- Objective
- Make a privileged change without holding any credentials.
- Path
- Malicious meeting invite → agent reads the description as instruction → calls an over-scoped settings tool.
- Impact
- State change executed under the agent's standing identity.
- Detection
- A tool call that traces back to no user request in the session.
- Mitigation
- Human approval on state-changing tools; per-session, least-privilege credentials.
What contains an actor
Because the threat is action, the controls that count are the ones that constrain action. Content defenses — jailbreak filters, input sanitization — lower the odds an agent is manipulated; they do nothing to limit what a manipulated agent can do once it is. Containment lives in the authorization layer, not the prompt.
- Least-privilege identity. Scope each agent's credential to the minimum systems its task needs, issued short-lived rather than standing. The highest-leverage control, because it caps the blast radius however the agent is compromised.
- Tool gating. Put explicit, per-call authorization in front of every state-changing tool, and a human approval gate in front of the irreversible ones.
- Input isolation. Quarantine retrieved and tool-returned content into a context the model treats as data, never as a control channel.
- Action observability. Log every tool call as a security event tied to the session and principal, so a compromised actor is detectable — and revocable — in flight.
Framework mapping
| Control | OWASP LLM | NIST AI RMF |
|---|---|---|
| Least-privilege identity | LLM06 | Manage 1.3 |
| Tool gating & approval | LLM06 | Manage 2.2 |
| Input isolation | LLM01 | Measure 2.7 |
Checklist
- Every agent has its own identity, scoped to least privilege and short-lived.
- State-changing tools require per-call authorization; irreversible ones require human approval.
- Retrieved and tool-returned content is isolated as data, never executed as instruction.
- Every tool call is logged as a security event and can be revoked mid-session.
- The program assumes content filters will fail and verifies the blast radius stays bounded when they do.