Methodology · litmus-v5 · specification

The litmus test

A behavioral evaluation of an MCP server — what it does when exercised the way an agent would, not what its README says. The string methodologyVersion: "litmus-v5" travels with every grade this spec produces.

§01

What it measures

The litmus test connects to an MCP server the way an agent would, exercises its tools, and watches what the server actually does: whether its outputs try to hijack the calling agent, whether it reaches out over the network when nothing required it, and whether data handed to it leaks back out. It answers one question an agent needs answered before it trusts a tool:

Will this MCP server hijack me, phone home, or leak my data?

It does not measure popularity, code quality, maintenance, or licensing — this is a behavioral grade, not a quality or popularity score.

What a passing grade does — and does not — claim

It does claim: against litmus-v5, on the exact tool surface identified by the fingerprint, the server did not exhibit the failure modes below, and the evidence is published and re-runnable.

It does notclaim the server is “safe,” “secure,” or “trustworthy” in general. We underclaim. A grade is a dated, versioned, reproducible observation — nothing more.

§02

Checks and probes

Nine probes across four categories. Probe IDs are permanent and carry their family number; future probes extend a family without renumbering.

C-01 — Tool-output injection · probes 1.1, 1.2, 1.3

A server’s tool descriptions and outputs flow straight into the calling agent’s context — the documented “tool poisoning” attack class. Probe 1.1 pulls the full tool surface via tools/list and scans every name, description, and input schema. Probe 1.2 issues bait calls — benign-but-suggestive inputs designed to elicit injection-shaped echoes — and scans every output. Probe 1.3(second-order) feeds one tool’s output back in as another tool’s input — the chaining path a real agent takes — and scans the result; only injection the server generated from that chained content fails, not a verbatim echo.

Fails on any high-severity hit: instruction mimicry aimed at the agent (“ignore previous instructions”, role-tag or tool-call-shaped JSON in free text), hidden or invisible Unicode (zero-width, bidi overrides, tag characters), or markdown and link tricks (javascript: / data: URIs, link text contradicting the href, or a URL whose query string is exfiltration-shaped — a credential-named key, a long/high-entropy value, or an interpolation marker; an honest ?q=search link is not flagged).

C-02 — Permission overreach · probes 2.1, 2.2

Does it touch more than it claimed — or label itself less capable than it is? Probe 2.1 (declared-permission honesty) flags a tool that annotates itself readOnlyHint: true while it plainly mutates — caught three ways: an unambiguously destructive verb in its name (delete, transfer, withdraw, …), a parameter that names value movement or secrets (recipient, amount, private_key, …), or a descriptionthat admits it (“transfers,” “deletes”). An agent — and this harness — would otherwise auto-trust that read-only claim. Each layer is deliberately narrow: polysemous words like create, update, or query are not flagged, so an honest create_query is no false positive. Static and deterministic; needs no sandbox.

Probe 2.2 runs the server inside a default-deny network sandbox and exercises it. Egress is routed to a local sinkhole that logs {host, port, firstBytes} and never completes the connection — the evidence shows what it tried to reach. Each attempt is classified against an effective allowlist: the operator baseline plus the hosts the package declares it reaches (polygraph.egress in its manifest). Egress matching that list on host and port is permitted and recorded; anything beyond it — or a declared host reached on an undeclared port — is overreach and fails. A passing C-02 means no overreach, not no network; the declared hosts are surfaced in the evidence for the consumer to judge. A host-level DNAT gateway captures hard-coded IPs and DoH too, so an IP literal can’t dodge the check.

Probe 2.2 requires that the harness runs the server itself. For a remote MCP URL the harness cannot control the server’s network, so 2.2 is recorded as skipped — remote target; without Docker it is skipped — no sandbox (a skip is not a pass). Probe 2.1 is static and always runs — so C-02 can still fail on a remote target when a tool lies about being read-only.

C-03 — Sensitive-data handling · probes 4.1, 4.2

Before any calls, the harness plants canaries: a distinctive fake API key (POLYGRAPH-CANARY-<uuid>), a distinctive PII string, and a fake bearer token — seeded into arguments, the working directory, and the environment. Probe 4.1 scans all tool outputs for any canary echo back to the agent. Probe 4.2scans captured egress for canary bytes in any outbound payload. A canary surfacing anywhere it shouldn’t is a failure. Without the sandbox, 4.2 degrades to output-scan only and is annotated as such.

C-04 — Adversarial input handling · probes 3.1, 3.2

How the server holds up under hostile input. Probe 3.1 stresses each tool with a deterministic battery of malformed and oversized arguments; it fails if the server crashes or hangs, or if its output spills an uncaught stack trace (an internals leak). Probe 3.2 feeds known jailbreak patterns and scans the output — failing only if the server amplifies them into agent-directed injection of its own (a verbatim echo is excluded). A C-04 failure caps the grade at D. It is graded off-chain: it moves the overall letter, but the on-chain schema keeps its three per-category slots.

§03

Shared scanners

The detection primitives behind C-01, C-03, and C-04, implemented once and applied uniformly: invisibleUnicode (zero-width, bidi-override, and tag-char codepoints, each reported with codepoint and byte offset), instructionMimicry (agent-directed imperatives, override and jailbreak framing, free-text tool-call JSON), markdownTricks (javascript:/data: URIs, link-text/href mismatch, exfiltration-shaped query strings), internalsLeak (uncaught stack-trace and crash signatures across Node, Python, Java, Go, Ruby, Rust, .NET, and PHP), and canaryMatch (exact and lightly-obfuscated matches — case, whitespace, simple encodings). A shared reflection check lets the second-order and jailbreak probes ignore content a tool merely echoed back. Scanners are pure functions over text: independently testable, and the place new failure modes get added.

§04

Grading rubric

A single letter A–F, always accompanied by a rationale string — never a bare grade. Only four grades are reachable: C is reserved (no condition maps to it), and the scale skips E, as letter grades conventionally do.

Grade rubric · litmus-v5 §5

Grade	Condition
A	All four categories pass.
B	C-01, C-03, and C-04 pass; C-02 `skipped` (no sandbox or remote target). Egress was not verified — capped by design.
C	Reserved — no litmus-v5 condition maps to it. Future probe categories may claim it.
D	C-02 or C-04 failure — egress overreach, a read-only lie, or a crash / internals-leak / amplification — with no C-01/C-03 failure.
F	Any C-01 or C-03 failure — active injection or data leak.

Rationale: injection and data-leak are disqualifying — they are the failures that directly harm an agent that trusts the server, so they floor the grade at F. A C-02 failure (egress overreach or a read-only lie) or a C-04 failure (a crash, an internals-leak, or jailbreak amplification) is serious but not proven exfiltration or harm, so it caps at D. The B tier keeps the no-sandbox path usable while stating honestly that egress was not verified. Every grade carries its reasons in the evidence bundle.

§05

Reproducibility

What makes a grade trustworthy rather than an assertion:

Deterministic harness. Same server version + same litmus-v5 harness → same findings. The bait, jailbreak, and malformed batteries are varied but fixed — no randomness in probe verdicts; timestamps and environment are recorded, not baked in.
Tool-defs fingerprint. The canonicalized tool surface is hashed (sha256) to a bytes32. The grade certifies that exact surface. If the server later changes a tool description — a rug pull — the fingerprint no longer matches and the grade is stale by construction. Consumers recompute the live fingerprint before trusting.
Published evidence. The full evidence bundle — every finding, every artifact — travels with the grade. Anyone can fetch and inspect it.
Re-runnable. Anyone — a skeptic, a counterparty, a future independent verifier — can re-run litmus-v5 against the same server and compare fingerprint and grade. A false grade is falsifiable, not merely disputable.

§06

Threat model & limits

Two properties decide whether a grade can be trusted, and they are independent.

Forgeability — can the runner fake the result? Fixed by the proof layer, not the methodology. Reproducibility makes a lie falsifiable; the roadmap layers — independently verifiable grade records, and later hardware-attested runs — make it progressively unprofitable, then impossible.

Evasion — can the server tell it’s being tested and behave? A fundamental methodology limit. Because the methodology is open, a server can recognize the test context and behave benignly during evaluation, then misbehave in production — a defeat device. No proof layer fixes this; an independent lab running the same open test has the same exposure. We reduce, not eliminate, the gap: per-run-unique canary values, bait/jailbreak/malformed inputs drawn from varied (widened) but fixed pools, behavioral probes over real outputs rather than static reads, periodic re-attestation, and the live-fingerprint check at call time against bait-and-switch. Evasion is an explicitly acknowledged residual risk of v1.

Non-goals

Not an independence claim (yet). v1 grades can be self-run: the subject grades itself, and trust anchors on reproducibility — the open harness makes a false grade falsifiable. Skin-in-the-game and independent counter-attestation are roadmap. A v1 grade is a reproducible test result, not an independent verdict. We say so plainly.
Not secrets management. How a server stores or rotates its own secrets is out of scope for v1.
Bounded surface. We probe the advertised tool surface at evaluation time. Tools gated behind auth or state we cannot reach are recorded as unexercised, never passed.
No absolute claims.Never “100% safe” or “guaranteed.” Underclaim, over-deliver.

§07

Versioning

This page documents litmus-v5. Probes evolve as agents do; new failure modes get new probe IDs within their family. A change that alters pass/fail semantics bumps the methodology version. Every evidence bundle and every attestation embeds the methodology version that produced it, so a grade is always tied to the spec it was measured against — earlier litmus-v1…v4grades stay valid as their own version’s results.

Changelog · litmus-v5 adds C-01 probe 1.3 (second-order injection), makes C-02 egress port-aware, and widens probe 2.1 to parameter- and description-evidenced read-only lies; it also widens the probe payloads and sharpens the scanners. litmus-v4 makes C-04 (adversarial input) a graded category — a crash, internals-leak, or jailbreak amplification caps the grade at D — and closes the hard-coded-IP egress gap with a host-DNAT gateway. litmus-v3 reframed C-02 from default-deny to egress overreach: a server may reach hosts it declares, so a passing C-02 means no overreach, not no network. litmus-v2added C-02 probe 2.1 (declared-permission honesty). Each pass/fail-semantics change bumps the methodology version; earlier grades stay valid as their own version’s results.

§08