15 Research Lab -Adversarial Safety Evaluation of Frontier AI Systems

John Kearney

Research Agenda

Core Thesis

AI safety evaluation infrastructure is an under-audited attack surface. The tools we use to determine whether AI systems are safe are themselves vulnerable to the systems they evaluate. Scorers can be prompt-injected. Monitors can be evaded. Scoring code can be gamed. Evaluation environments leak signals that allow models to detect they are being tested.

Nobody is auditing this infrastructure with professional offensive security methodology. 15 Research Lab does that. We treat every evaluation framework as a system to be penetration tested, not trusted.

Methodology

We apply offensive security kill chains and penetration testing methodology to AI safety evaluation. Reconnaissance, enumeration, exploitation, post-exploitation, and reporting — the same phases used against network infrastructure, applied to evaluation pipelines. Every finding includes reproduction steps and proposed fixes. Findings are operationalized as detection modules in Authensor, our open-source framework.

Active Research Threads

LLM-as-Judge Vulnerability Class

Active

LLM-as-judge evaluation pipelines are vulnerable to prompt injection from the models they evaluate. We have identified this vulnerability class across multiple major evaluation frameworks. The evaluated model can influence its own score by injecting instructions into tool output, chain-of-thought, or structured responses that the judge model processes.

Outputs: 47 vulnerabilities across 12 frameworks documented. Preprint and responsible disclosure forthcoming.

Multi-Turn Adversarial Dynamics

Active

Single-turn safety is largely solved. Multi-turn is not. Our experiments show 71% compliance erosion via 15-turn gradual escalation on requests that get 0% when asked directly. The gradient steepness, not the endpoint, determines success. We study the mechanisms: contextual compliance momentum, frustration-as-evidence, and presentation-decision coupling.

Outputs: Refusal as Exfiltration paper (preprint). 73+ agent trials across 5 rounds. Read the paper →

Evaluation Brittleness and Goodhart's Law in AI Safety

Active

Safety benchmarks are subject to Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. We audit evaluation scoring for gaming vectors: regex exploitation, TOCTOU gaps, string-matching bypasses, and scoring code access. Multiple major frameworks demonstrate exploitable patterns.

Outputs: Cross-framework findings documented. Preprint forthcoming.

Economic Dynamics of AI Safety Infrastructure

Active

Safety mechanisms have economic costs. RLHF-induced verbosity inflates output length, costing an estimated $1.2 billion annually. Information density (facts per token) is a better optimization target than length penalties. We study how economic incentives shape the safety properties of deployed systems.

Outputs: The Verbosity Premium paper (preprint). RCI framework for measuring information density. Read the paper →

Architectural Constraints on Algorithmic Discovery

Active

Neural networks hit hard capacity limits when learning multiple algorithmic tasks simultaneously. We demonstrate a capacity cliff: a d=128 transformer groks 5 operations but collapses at 6. Modular architecture with half the parameters recovers full capability. This has implications for how safety properties compose (or fail to compose) in multi-task models.

Outputs: Grokking Has Finite Capacity paper (preprint). Read the paper →

Collaboration

We are looking for collaborators with expertise in adversarial machine learning, evaluation design, or AI governance. We bring offensive security methodology, a growing corpus of cross-framework vulnerability data, and the tooling to operationalize findings. We are particularly interested in working with evaluation framework maintainers who want their tools audited before deployment.

Get in touch · GitHub