15 Research Lab -Adversarial Safety Evaluation of Frontier AI Systems

John Kearney

The Future of AI Agent Security

March 15, 202615 Research Lab

agent-safetyresearchtools

AI agent security is a young field. The tools and techniques we use today are first-generation solutions. Here is where the field is heading based on current research directions and industry trends.

Formal Verification of Safety Properties

Today we test for safety empirically: run payloads, measure detection rates, red team the system. Empirical testing gives confidence but not certainty. The future includes formal methods that prove safety properties mathematically.

Research directions:

Formal specification of agent authorization policies with provable enforcement guarantees
Model checking for multi-agent interaction protocols
Static analysis of agent pipelines for policy completeness (can the agent reach any state where it has unauthorized capability?)

This is hard. Natural language processing inherently resists formal analysis. But specific components (policy engines, audit trail integrity, authorization chains) can be formally verified even if the model itself cannot.

Multi-Agent Governance

As multi-agent systems become common, governance must extend beyond individual agents to the system level:

Cross-agent authorization frameworks where delegated tasks carry the original requester's permissions
System-level behavioral monitoring that tracks patterns across agents, not just within them
Formal trust models for inter-agent communication
Standardized protocols for agent identity and capability attestation

The MCP specification will likely evolve to include identity, authorization, and audit primitives that are currently left to implementers.

Hardware-Level Agent Controls

Current safety controls operate at the software level. Future controls may include:

Trusted execution environments for policy engines (hardware-enforced isolation from the agent)
Hardware security modules for audit receipt signing
Memory protection that prevents agents from accessing data outside their authorized scope

These approaches are borrowed from traditional security and applied to the agent context.

Convergence of AI Safety and Cybersecurity

The two fields are merging. AI safety researchers need cybersecurity expertise (threat modeling, defense-in-depth, incident response). Security practitioners need AI safety expertise (model failure modes, alignment concepts, evaluation methodology).

The MITRE ATT&CK mapping to alignment failure modes is an early example. The OWASP Agentic Security Initiative is another. Expect more cross-pollination: security conferences featuring AI safety talks, AI safety conferences featuring security practitioner content.

Regulatory Maturation

The EU AI Act is the first major regulation. Others will follow. As regulatory bodies gain experience with AI agent incidents, requirements will become more specific:

Standardized audit trail formats (beyond the current Article 12 requirements)
Mandatory testing requirements with defined methodologies
Incident reporting obligations
Third-party certification of safety controls

Organizations that build strong technical controls now will be ahead of the regulatory curve.

Open Questions

Several fundamental questions remain unresolved:

Can prompt injection be solved at the model level, or is runtime defense the permanent answer?
How do you monitor multi-agent systems at scale without prohibitive overhead?
What is the right level of human oversight as agents become more capable?
How do you formally specify "the agent should not do anything harmful" in a way that is both complete and implementable?

These questions define the research agenda for the next several years. The answers will shape the tools and practices that become standard.

The direction is clear: agent security is converging with traditional security disciplines while developing its own specialized techniques. Build on proven security principles. Adapt them for the agent context. Contribute to the open-source ecosystem that makes this accessible to everyone.