Mutation testing, explained

Spec: ISO/IEC/IEEE 29119-4 Spec: PHPUnit Evidence: Test-backed

At a glance

Line coverage tells you a line ran during the test suite. It does not tell you any test would have failed if that line were wrong. Mutation testing closes that gap by deliberately breaking the code and checking whether the tests notice. This page explains what a mutation score means and how NextPDF uses it as a diagnostic, not a trophy.

Why this matters

Coverage is one of testing’s most trusted metrics, and one of its most misleading. A test that calls a method and asserts nothing executes every line in it: perfect coverage, zero detection. The standards literature is explicit that the ordering among coverage criteria gives no indication of their ability to expose faults. That ability is the property it calls test effectiveness (ISO/IEC/IEEE 29119-4, §C.2.4). A coverage percentage and a fault-finding guarantee are different claims.

For a PDF engine this is not academic. A signature byte-range check, a cross-reference offset, an encoding branch — tests can fully “cover” all of these without ever asserting the value that matters. A green suite over weak tests is worse than an honest gap, because it actively discourages anyone from looking.

The short version

Mutation testing makes thousands of small, deliberate edits (mutants) to the source — flip a < to <=, a + to -, a return value — and reruns the tests against each.
If a test fails on a mutant, the mutant is killed: some test actually asserted that behavior. If every test still passes, the mutant escaped: the behavior was executed but never checked.
The Mutation Score Indicator (MSI) is, roughly, killed mutants over total non-equivalent mutants. It measures whether your tests detect changes, not whether they run the code.
Some mutants are equivalent — they cannot change observable behavior, so no test can kill them. Counting these as failures is dishonest. NextPDF proves and ledgers them instead of dismissing them informally.
NextPDF uses MSI to find and strengthen weak tests. It is a diagnostic gate in continuous integration, not a marketing figure.

How NextPDF approaches it

Mutation runs against the engine with the Infection mutator. It is configured over the production source tree, with the arithmetic, boolean, conditional-boundary, equality, return-value, and removal mutator families enabled — exactly the operators that expose “executed but unasserted” logic. The flow is mechanical:

Start green The suite must pass before mutation begins.
Mutate Apply one small, deliberate change to the source.
Re-run Run the tests that cover the mutated line.
Killed A test failed — the behaviour is genuinely asserted.
Escaped All tests still pass — a weak spot to strengthen.
Equivalent No test can kill it because behaviour is unchanged — proven and ledgered, not scored as a miss.

The mutation-testing loop NextPDF runs: take green tests, generate a mutant, rerun the covering tests, and classify the mutant as killed (a test caught it), escaped (a coverage-but-no-assertion gap to fix), or proven-equivalent (no test can kill it; ledgered, not counted against the score).

Two design choices make the number trustworthy. First, the score is wired as a gate. Continuous integration enforces a minimum MSI (and a minimum covered-MSI) and runs a diff-scoped variant on changed lines. As a result, a change that adds code but not real assertions is caught at review, not discovered later. Second, NextPDF does not silently discount inconvenient mutants. Mutants that are genuinely semantically equivalent — for example !== versus != when strict typing guarantees both operands share a type — are recorded in a mutation ledger with an explicit equivalence proof test. As a result, the escaped count reflects real gaps, not bookkeeping. PHPStan Level 10 plus strict_types plus typed properties is what makes those equivalence proofs sound.

What the evidence says

Evidence: Test-backed Mutation testing is configured in the engine over the production source directories with the behavior-revealing mutator families enabled. It is enforced as a continuous-integration gate with a minimum MSI and a diff-scoped variant. It is a build check, not an afterthought.

Evidence: Test-backed The equivalent-mutant problem is handled honestly. Semantically equivalent mutants are classified and backed by dedicated equivalence-proof tests in a mutation ledger, with the soundness of each proof resting on PHPStan Level 10 plus strict typing. The escaped count therefore represents real undetected behavior, not unkillable noise inflated into a worse-looking score.

Evidence: Standard-backed Mutation is a recognized technique, not a NextPDF invention. Spec: ISO/IEC/IEEE 29119-4, §B.2.4 describes applying generic mutations to elements of a specification to derive specific mutations for testing. The technique is needed at all because the same standard states that the subsumes ordering of coverage criteria does not order them by fault-exposing ability (ISO/IEC/IEEE 29119-4, §C.2.4).

Evidence: Standard-backed Coverage itself is well-defined and limited. Spec: PHPUnit distinguishes line, branch, and path coverage. Line coverage only records that an executable line ran. Knowing the definition is what makes its insufficiency obvious.

Practical example

The point is not the command — it is what an escaped mutant tells you:

<?php

declare(strict_types=1);

final class ByteRange
{
    // Suppose the production guard is:
    //     if ($offset < 0) { throw new InvalidByteRange(); }
    public function assertNonNegative(int $offset): void
    {
        if ($offset < 0) {
            throw new InvalidByteRange('offset must be >= 0');
        }
    }
}

// A test that EXECUTES this line but does not assert the boundary:
//     $byteRange->assertNonNegative(5);   // no exception expected, none asserted
// gives 100% line coverage of assertNonNegative().
//
// Mutation flips `< 0` to `<= 0`. Behaviour now differs ONLY at $offset === 0.
// If no test passes 0 and asserts what happens, every test still passes:
// the mutant ESCAPED. Coverage said "tested"; mutation said "the boundary
// is unasserted". The fix is a test that pins offset === 0, not a higher
// target.
//
//   composer mutation:diff   → mutate only changed lines, enforce min MSI
//   composer mutation:full   → full-tree mutation gate

That escaped mutant is the entire value proposition. It located a real, specific, missing assertion that a coverage report rated as fully tested.

Common misconception

The headline misconception is that mutation score is a grade to maximize. A very high MSI achieved by writing tests to kill mutants is as hollow as high coverage achieved by calling methods without asserting. The metric has been gamed and no longer measures detection. NextPDF uses MSI to find weak tests. The deliverable is a better assertion; bragging is explicitly not the purpose.

The second misconception is that every surviving mutant is a defect in the tests. Some mutants are genuinely equivalent and cannot be killed by any test, because they do not change observable behavior. Treating those as failures produces a dishonest, artificially low score and trains people to ignore the report. NextPDF’s answer is to prove equivalence explicitly and ledger it, not to quietly suppress it or pretend the number is worse than it is.

Limits and boundaries

Mutation testing measures whether tests detect injected changes. It does not prove the code is correct. It does not measure performance or conformance. It cannot kill a truly equivalent mutant. The current mutation score, the minimum-MSI threshold in force, the number of ledgered equivalents, and any coverage figure are living quality signals generated from continuous-integration artifacts and published with the build. They are deliberately absent here, because a number pasted into prose goes stale and becomes a small lie. The one stable fact this page states is PHPStan Level 10, and that is a configuration property that underpins the equivalence proofs, not a measurement.

The mutator selection, thresholds, and ledger policy are owned by the engine’s mutation configuration and may evolve. That configuration is the authority if it ever disagrees with this page. No claim is made here about any other library’s test effectiveness.

The NextPDF testing pyramid — the five tiers whose tests mutation testing audits for real fault detection.
Strict types, everywhere — how PHPStan Level 10 and strict typing make the equivalent-mutant proofs sound.
Golden-file testing — another tier whose detecting power mutation testing helps validate.

Glossary

Mutant — a single, deliberate small change to the source, used to test whether the test suite would notice that change.
Killed mutant — a mutant that made at least one test fail; the behavior was genuinely asserted.
Escaped mutant — a mutant that left every test passing. The behavior was executed but never asserted — a weak spot to fix.
Equivalent mutant — a mutant that cannot change observable behavior, so no test can kill it. NextPDF proves and ledgers these.
MSI (Mutation Score Indicator) — roughly, killed mutants divided by total non-equivalent mutants; a measure of detection, not execution.
Line coverage — a metric that records only that an executable line ran during the suite; defined by PHPUnit, and insufficient on its own.