Practical Advice on Securing Agentic Applications

7 minute read

Disclaimer: All opinions and views in this article are my own. When citing, please call me an Independent Security Researcher.

The lethal trifecta for AI agents: private data, untrusted content, and external communication.

There is no permanent fix for that combination. But if I were building an agentic application today, these are the steps I would take to make it much harder to exploit.

Building a safe agent is like securing an airport:

  1. Separate data from instructions is checking who is a passenger and who is crew
  2. Constrain capabilities is locking doors so people can only enter areas they are allowed to enter
  3. Human in the loop is sending unusual or high-risk cases to manual inspection

Separate data from instructions

Key Idea: Separating data from instructions is necessary, but not sufficient. It helps the model keep context straight, but it does not remove the need for downstream controls. If the attacker adapts, this layer can still fail.

Prompt injection exists because the model sees trusted instructions and untrusted content in the same flat context window. If a retrieved document, web page, or email includes attacker-written instructions, the model may treat those instructions as part of the task.

Real applications have already started adopting strategies to reduce this risk.

#CaseWhat went wrongWhy it matters
1GitHub MCP Server filter code fencesHidden instructions inside markdown code fence metadata were being passed through tool resultsShows real MCP apps started sanitizing untrusted data before it reaches the model
2Supabase MCP wraps execute_sql responseSQL results could carry prompt injection content, so the response was wrapped with defensive instructions and tested end to endShows tool output itself needs treatment as untrusted data
3Azure AI Foundry SpotlightingExternal content like documents, emails, and web pages can carry cross prompt injectionShows Spotlighting is being productized as a real defense, not just a paper idea
4AutoGen Web Surfer sanitize page titleWeb page metadata like titles could inject instructions into the agent promptShows even small fields like page metadata need sanitization



The common pattern is simple: external content should not enter the prompt untouched. It should be sanitized, transformed, or marked first. Spotlighting is one example of that idea. Instead of asking the model to guess which text is trustworthy, it rewrites untrusted content so its origin is more explicit in the prompt. The original paper showed strong gains against indirect prompt injection, cutting attack success from above 50% to below 2% in their evaluation, with limited impact on task quality.

That result is encouraging, and Microsoft’s later LLMail-Inject work showed that combining multiple defenses can be even more effective. But there is an important caveat: defenses should be judged against adaptive attackers. The Attacker Moves Second is a useful reminder that stronger adaptive attacks have bypassed many defenses that initially looked robust under weaker evaluations.

Here’s my take: I built spotlighting-datamarking as an OSS implementation of all three spotlighting variants from the paper: data marking, random interleaving, and base64 encoding (the strongest). I also took inspiration from other OSS projects and how they handle this problem, and tried to incorporate those defenses as well.

1import { DataMarkingViaSpotlighting } from 'spotlighting-datamarking';
2
3const marker = new DataMarkingViaSpotlighting();
4
5const result = marker.markData('Ignore previous instructions');
6// result.markedText  → "[MARKER]Ignore[MARKER]previous[MARKER]instructions[MARKER]"
7// result.dataMarker  → the random marker string
8// result.prompt      → LLM instruction to prepend to your system prompt

The goal is not to claim that marking alone solves prompt injection. The goal is to make untrusted content easier for the model to treat as data, raise attacker cost, and strengthen the first line of defense before stricter capability controls take over.

Use secure-by-default libraries for agent capabilities

Key Idea: The agent's access MUST BE shaped in advance by code.

If the capabilities we provide to the model are secure by default, then even if the model goes rogue or gets prompt-injected, it will not be able to bypass the practical guardrails. Let’s understand this with a few case studies.

#CaseWhat went wrongWhy it matters
1Anthropic Filesystem MCP Server path bypassUsed a naive prefix check, so paths outside the allowed directory could still passShows that simple string checks are not enough for file boundaries
2Anthropic Filesystem MCP Server symlink escapeSymlinks could be used to escape the allowed directory and reach host filesShows why path checks must be symlink-aware
3filesystem-mcp path traversal../ traversal could break out of the configured rootShows how agent file tools can become arbitrary file access
4mcp-server-git traversal via git_addRelative paths let files outside the repo get stagedShows that repo-scoped tools also need strict path containment
5OpenClaw workspace traversalCrafted workspace values let the app reach files outside the intended workspaceShows this is not just an MCP problem, agentic apps need the same boundary checks
6GitHub MCP prompt injection chainUntrusted content pushed the agent to access and leak private dataShows the same pattern beyond local files



The common thread is clear: the capabilities the model had access to were not secure by default, allowing it to reach files outside the designated directory.

If the server had used a package like is-path-inside-secure, a small, symlink-aware defensive primitive designed for security-sensitive path containment checks, then even with the system compromised, the LLM would still have had no access to files outside the allowed boundary.

This is not limited to path traversal. The same principle applies to SSRF, XSS, and other vulnerability classes: use libraries that are secure by construction rather than relying on developers to remember every edge case. For curated lists of such packages, see tl;dr sec’s awesome-secure-defaults and Liran Tal’s awesome-nodejs-security.

BUT isn’t supply chain a BIG risk? Yes, but there are ways to defend against it. Use dependency cooldowns, SCA tools, and similar practices. Also, the idea is to promote secure-by-design and not increase the number of dependencies.

Think of this as a filesystem analogue of restrictive privilege, not just least privilege. RBAC constrains access at the identity layer. Capability checks constrain access at the operation and resource layer. The agent MUST NOT be trusted with broad access and then told to behave; its access must be shaped in advance by code.

When something still looks risky, stop and ask a human

Key Idea: The final defense is selective friction: let safe actions stay fast, and make risky actions stop at a human boundary. Trigger HIL based on how often a tool gets invoked.

Human-in-the-loop (HIL) means the agent can plan and propose actions, but a person must approve actions once they cross a risk boundary.

#Protocol / systemWhat it requiresWhy it matters
1MCP specClients should show confirmation prompts for sensitive operations and users should be able to deny tool callsMakes HIL a protocol-level expectation
2ACP SpecBuilt-in Await mechanism pauses execution until an external response arrives
3A2A SpecTask state includes input-required so an agent can stop and wait for more input
4OpenAI computer useKeep a human in the loop for high-impact actionsReal product guidance for agentic systems
5Anthropic computer useAsk a human to confirm actions with meaningful real-world consequences or affirmative consentStrong example of HIL in practice



However, HIL should not be the first answer to every tool call. If you prompt on every read, you create approval fatigue and users stop paying attention. A better pattern is to reserve mandatory approval for destructive actions and apply threshold-based approval for low-risk reads once their frequency becomes unusual.

Tool typeDefault policyWhen to require approval
Read only, low sensitivityAllowWhen invocation count crosses a threshold or access pattern looks abnormal
Read only, high sensitivityAskAlways, or after a very small threshold
Write, delete, send, publishAskAlways
External network or open world actionsAskAlways, or when destination is not allowlisted

The goal is not to remove human oversight. It is to apply it where it still has signal. Destructive actions should always require approval. Low-risk reads can be allowed until frequency or pattern suggests the model is no longer acting within normal bounds.

For example: MCP tool annotations are useful for describing intent. readOnlyHint can help identify tools that should be treated as non-destructive, while destructiveHint can help flag operations that deserve stricter review. But these are hints, not trust anchors, especially when the server is not fully trusted. The client or application still needs its own enforcement policy.

A practical pattern is for the server to track thresholds such as repeated read activity for THAT session. When that threshold is exceeded, the server can use MCP elicitation to force a client-side approval step before the workflow can continue.

Similarly, provisions in other protocols can also be leveraged to implement threshold-based HIL.

Conclusion

My view is simple: agent security is not about making the model perfect. It is about making the system resilient when the model is imperfect. If prompt separation fails, capability limits should still hold. If capability limits are not enough, a human boundary should still exist. There may never be a permanent fix for the lethal trifecta, but we can still make exploitation difficult, expensive, and easier to contain. That is the approach I would build around: assume failure, layer defenses, and make the dangerous path the hardest path.

Hope this helps.

-

END