The NextPDF testing pyramid

Spec: ISO/IEC/IEEE 29119-4 Spec: ISO/IEC 25010 Evidence: Test-backed PHPStan: Level 10

At a glance

NextPDF does not have one kind of test. It has five tiers, and each one answers a different question about the engine. The reason is that a PDF can pass a unit test and still be a structurally broken file on disk. This page names the five tiers and what each one is responsible for proving.

Why this matters

A PDF engine has an unusually wide failure surface. The same code path can be correct as a function and correct as a stream of bytes, and still produce a file that a conformant reader rejects. It can also produce a file that renders subtly wrong only at a page break. Test the engine at a single granularity and you gain confidence in exactly that granularity and nothing else.

The standards literature is clear about this. Specification-based and structure-based test design techniques are not related to each other, and a test strategy is advised to use more than one criterion, with at least one functional and one structural (ISO/IEC/IEEE 29119-4, Annex A). A single tier is not a smaller version of a good strategy. It is a different strategy, and an incomplete one.

The short version

NextPDF organizes its tests into five tiers, base to apex:

Unit — one class or function, in isolation. The broad base.
Integration — collaborating units across a module boundary.
Structural — the emitted PDF object graph, cross-reference table, and trailer are well-formed and conformant.
Visual — the rendered page matches an approved reference within a stated tolerance.
Golden — pinned end-to-end fixtures that catch unintended drift in the final output. The apex.

Each tier proves something the tier below it cannot. None of them is decorative. The pyramid shape is about quantity — many cheap unit tests, fewer expensive end-to-end ones — not about importance.

How NextPDF approaches it

The tiers are physical, not aspirational. The repository’s PHPUnit configuration declares each as a named test suite mapped one-to-one to a directory. A tier is therefore a place you can point a runner at, not a label in a slide. The suites a senior engineer will recognize include Unit, Integration, Golden, Snapshot, Reproducibility, Conformance, Standards, and Performance, each with its own execution profile (isolation, time bucket, and whether it runs by default in continuous integration).

That separation is deliberate. The fast base tier (Unit) runs on every change with a one-second-per-test budget. The slower, environment-sensitive tiers — visual rendering, full conformance, performance — are opt-in or nightly. This keeps the common path fast and deterministic without giving up the deeper checks. Strict typing underpins the whole stack. The engine runs analysis at Spec: PHPStan, Level 10 with the error budget locked at zero, so a large class of defects never reaches a test at all.

Tier 1 of 5 Unit Isolated behaviour of a single class or function; the broad base.
Tier 2 of 5 Integration Collaborating units across a module boundary.
Tier 3 of 5 Structural The emitted PDF object/xref structure is well-formed and conformant.
Tier 4 of 5 Visual Rendered output matches an approved reference within tolerance.
Tier 5 of 5 Golden End-to-end byte/lossless fixtures pinned as the contract; the apex.

NextPDF's five test tiers, base to apex. Unit is the broad, fast base; each tier above proves a property the tier below cannot, up to golden end-to-end fixtures at the apex. Width is a quantity hint only — it does not rank importance.

What the evidence says

Evidence: Test-backed The five suites exist as declared PHPUnit test suites in the engine’s configuration, each bound to its own directory and execution profile. The tier vocabulary on this page is the same vocabulary the test infrastructure uses.

Evidence: Standard-backed The reason for more than one tier is anchored in Spec: ISO/IEC/IEEE 29119-4, Annex A : coverage criteria do not all relate to one another, and a strategy is advised to combine functional and structural techniques. Crucially, the same annex notes the subsumes ordering between coverage criteria gives no indication of their ability to expose faults — test effectiveness (ISO/IEC/IEEE 29119-4, §C.2.4). “More coverage” is not the same claim as “better tests”.

Evidence: Standard-backed The choice of which properties to prove maps to Spec: ISO/IEC 25010 product-quality characteristics: functional correctness (unit, integration), and the file-level properties that make a PDF actually usable downstream (structural, visual, golden). The quality model is explicit that different characteristics matter in different contexts of use.

Practical example

The tiers are addressable from the engine’s own scripts. A change to a single formatter is verified at the base. A change to the document facade is verified across tiers:

<?php

declare(strict_types=1);

// Tier 1 — Unit: one unit, isolated, fast.
// composer test:unit         → phpunit --testsuite Unit

// Tier 2 — Integration: collaborating units across a boundary.
// composer test:integration  → phpunit --testsuite Integration

// Tier 3 — Structural: the emitted PDF object graph is well-formed.
// vendor/bin/phpunit --testsuite Conformance

// Tier 4/5 — Visual + Golden: rendered/serialized output vs a pinned
// reference (golden is byte/structure-pinned, never auto-updated).
// vendor/bin/phpunit --testsuite Golden

// A change to the document facade touches every API, so the routing
// guidance escalates it from "unit only" to the full unit + integration
// surface — the tier you run is a function of blast radius, not habit.

The point of the example is the routing logic, not the commands. The tier you exercise is chosen by what the change can break. The infrastructure makes each tier a first-class, separately runnable target.

Common misconception

The pyramid is often read as a ranking — unit tests at the bottom because they matter least, end-to-end at the top because they matter most (or the reverse). It is neither. The vertical axis is roughly cost and quantity: many fast, cheap unit tests forming a wide base; progressively fewer, slower, higher-fidelity tests above. A golden test is not “better” than a unit test. It catches a different failure, later, more expensively, and would be a poor substitute for the thousands of fast checks beneath it.

The second misconception is that a high coverage number means the pyramid is sound. It does not. Coverage measures execution, not detection. The standards explicitly decline to equate coverage ordering with fault-finding ability. That gap is exactly what mutation testing exists to expose.

Limits and boundaries

This page describes the shape and intent of the strategy, not its current results. Test counts, coverage percentages, and mutation scores are deliberately absent here. They are living quality signals, generated from continuous-integration artifacts. The current figures are published with the build. Frozen into prose, they would quietly become outdated. The one number stated — PHPStan Level 10 — is a stable configuration fact, verifiable in the engine’s static-analysis configuration, not a measurement.

The tier names are stable architecture vocabulary. The precise set of suites and their execution profiles evolves with the engine and is owned by the test configuration, which is the authority if it ever disagrees with this explanation. This page asserts no specific pass rate and makes no comparison to any other library’s test strategy.

Golden-file testing — how the apex tier pins reference output and stays honest.
Mutation testing, explained — why a coverage number does not prove the tests in these tiers actually work.
Strict types, everywhere — how PHPStan Level 10 removes a class of defects before any tier runs.

Glossary

Test tier — a level of the strategy that proves one kind of property (for example, unit behavior or structural validity). NextPDF uses five.
Structural test — a check that the emitted PDF’s object graph, cross-reference table, and trailer are well-formed and conformant, as opposed to merely checking a return value.
Visual test — a check that a rendered page matches an approved reference image within a declared tolerance.
Golden test — an end-to-end check against a pinned reference output that is never auto-updated; the contract for “the output did not change”.
Test effectiveness — the ability of a test set to expose faults, which ISO/IEC/IEEE 29119-4 distinguishes from coverage. Acronym note: MSI (Mutation Score Indicator) is defined on the mutation testing page.
PHPStan Level 10 — the strictest static-analysis level; NextPDF runs it with the error budget locked at zero.