Streams and filters

Evidence: Standard-backed

At a glance

Most of a real PDF’s bytes are inside streams: page content, fonts, images, and the cross-reference stream itself. Almost none of those bytes are stored raw; they pass through one or more filters first. This page covers which filters you meet, what each is for, where they cause trouble, and why NextPDF pins its compression so the same input always produces the same bytes.

Why this matters

A stream and its filter are a contract: “these bytes are deflate-compressed, then base-85 encoded — decode in that order to get the real data.” If the /Filter entry disagrees with what the bytes actually are, or the /Length is wrong, or two filters are listed in the wrong order, the stream is undecodable and the object it carried is lost. A reader does not heuristically guess; it does what the dictionary tells it.

There is a second, quieter cost. If a library’s compressor is nondeterministic — different zlib build, different level, different internal block boundaries — then two runs that should produce an identical PDF produce two different files. That breaks byte-level reproducibility. Broken reproducibility then breaks golden-file tests, signed-build verification, and any pipeline that diffs output. Filters determine both whether the PDF is correct and whether the PDF is the same.

The short version

A stream object is a dictionary plus a block of bytes, wrapped in stream … endstream, with a /Length and usually a /Filter.
The /Filter entry names the decode filter — or an array of filters applied as a pipeline, in order.
The filters split into two families: compression (FlateDecode, LZWDecode, RunLengthDecode, DCTDecode, JPXDecode, JBIG2Decode) and ASCII transport (ASCIIHexDecode, ASCII85Decode), plus the special Crypt filter for encryption.
The one you will see most is FlateDecode — zlib/deflate. It is the default for content, fonts, and the cross-reference stream.
NextPDF pins its Flate output to a fixed level and format so the same input bytes always compress to the same output bytes.

How NextPDF approaches it

NextPDF emits stream objects through one buffer helper and compresses through one pinned compressor — on purpose.

BinaryBuffer::writeStream() (src/Support/BinaryBuffer.php) wraps stream content in its dictionary, always writing a /Length equal to the actual byte length and merging in any extra entries the caller supplies, such as /Filter. There is no path where the declared length can disagree with the bytes written, because the length is taken from the content string itself.

Compression goes through PinnedZlibCompressor (src/Writer/PinnedZlibCompressor.php). This class exists for one reason. gzcompress without an explicit level defers to the zlib runtime default, which has historically varied across builds. The 2-byte zlib header even encodes the level indirectly, so “the default” is not a stable output. The compressor pins the level to the RFC 1951 maximum and always emits zlib-wrapped deflate (RFC 1950 header + Adler-32 trailer), which is exactly what /Filter /FlateDecode expects. A hard failure from zlib becomes a typed exception rather than a silent fallback to uncompressed output — a stream is never quietly emitted raw.

The cross-reference stream itself is a worked example of all of this: CrossReferenceStream (src/Core/CrossReferenceStream.php) builds a binary table, compresses it, and emits it as a stream object with /Type /XRef, a /W field-width array, and /Filter /FlateDecode. The index that lets a reader find every object is, itself, a filtered stream.

Filter	Family	What it is for	Where it goes wrong
FlateDecode	Compression	zlib/deflate; the default for content, fonts, xref streams	A non-deterministic zlib build makes “identical” PDFs differ byte-for-byte
LZWDecode	Compression	Older Lempel–Ziv–Welch compression	Legacy; superseded by Flate, occasionally still seen in old files
DCTDecode	Compression	JPEG-encoded colour/grayscale images	Lossy — re-encoding an already-DCT image degrades it again
JPXDecode	Compression	JPEG 2000 wavelet image data	Not permitted by some archival profiles; wide support is uneven
JBIG2Decode	Compression	Bilevel (1-bit) image compression	Must not be used with inline images; lossy modes can alter scans
RunLengthDecode	Compression	Byte-oriented run-length	Only helps data with long single-byte runs; can grow other data
ASCIIHexDecode	Transport	Binary as hex digits	Doubles size; only for 7-bit-safe channels, never for size
ASCII85Decode	Transport	Binary as base-85 ASCII	~25% overhead; a transport convenience, not compression
Crypt	Security	Applies the document’s security handler	A cross-reference stream must not use a Crypt filter

The PDF standard filter set, by family, with the failure each one is associated with. NextPDF writes FlateDecode for content, fonts, and the cross-reference stream; the ASCII transport filters are for 7-bit channels, never for reducing size.

What the evidence says

The filter mechanism is defined by Spec: ISO 32000-2, §7.4 . A stream dictionary names its filters through /Filter. When the entry lists more than one filter, those filters form a decode pipeline and are applied in sequence. A writer encodes a stream to compress it or to make it 7-bit-safe. A reader invokes the corresponding decode filters to recover the original data. Evidence: Standard-backed

The standard’s filter table classifies each filter. FlateDecode decompresses zlib/deflate-encoded data, reproducing the original text or binary data. DCTDecode reproduces image samples that approximate the original via JPEG — the word “approximate” is the standard telling you it is lossy. LZWDecode, RunLengthDecode, JBIG2Decode, JPXDecode, and the Crypt filter are each defined there too, with JBIG2 explicitly barred from inline images.

The cross-reference stream applies the format’s own machinery to itself: it is a stream object (/Type /XRef, Spec: ISO 32000-2, §7.5.8 ) whose /W array states the byte width of each entry field in the decoded stream. The standard requires that it is not encrypted and does not use a Crypt filter. NextPDF’s CrossReferenceStream follows this exactly — FlateDecode, explicit /W, no encryption.

Practical example

A page content stream, compressed with Flate. This is the overwhelmingly common shape: a dictionary with /Length and /Filter, then the compressed bytes between stream and endstream.

<?php

declare(strict_types=1);

use NextPDF\Writer\PinnedZlibCompressor;

// The marking operators a page content stream carries, uncompressed.
$content = "BT /F1 12 Tf 72 712 Td (Hello) Tj ET\n";

// NextPDF compresses through the pinned compressor: fixed level,
// fixed zlib-wrapped format. The same $content always yields the
// same $compressed bytes, on any supported PHP/zlib build.
$compressed = PinnedZlibCompressor::compress($content);

// Emitted as a stream object. /Length is the real byte length of
// $compressed; /Filter names the decode the reader must apply.
//   N 0 obj
//   << /Length <strlen($compressed)> /Filter /FlateDecode >>
//   stream
//   <$compressed bytes>
//   endstream
//   endobj

A reader does the inverse: read /Length bytes, run them through FlateDecode because /Filter says so, and get the original operators back. Pin the compressor and that round trip is not only correct. It is identical every time, which is what golden-file and signed-build checks rely on.

Common misconception

The trap is treating the ASCII filters as compression. ASCIIHexDecode and ASCII85Decode make a stream larger — roughly double, and roughly 25% respectively. They exist to move binary data through a channel that is only safe for 7-bit text, not to save space. Choosing ASCII85 to “shrink” a PDF does the opposite. The second half of the same misconception is believing FlateDecode is lossless for images “for free”. Flate is lossless, but if the image was already DCT (JPEG) encoded, wrapping it again or transcoding it through a lossy filter degrades it regardless of what Flate does around it. The filter pipeline preserves exactly what you feed it — including a re-compression artifact you fed it by accident.

Limits and boundaries

This page covers how filters are declared and applied, not the bit-level algorithm inside each one. The determinism guarantee is specifically about NextPDF’s Flate output for the streams it writes. It holds across PHP minor versions and standards-conforming zlib builds, but the standard explicitly permits a deflate encoder to choose different internal block boundaries, so byte-identical output across genuinely different zlib implementations (for example a stock zlib versus zlib-ng) is not promised. The build environment is pinned for that reason.

NextPDF chooses FlateDecode and the ASCII transport filters for the data it emits. It is not an image transcoder. It does not promise to re-pack an arbitrary inbound JPEG2000 or JBIG2 stream, and lossy image trade-offs are a property of the source data, not something a writer can undo.

Mini-FAQ

Why is FlateDecode everywhere? It is lossless, general-purpose, well-supported, and a good fit for the text-and-operators content of most PDFs. It is the safe default for content streams, embedded fonts, and the cross-reference stream.

Can I turn compression off? You can omit /Filter and store raw bytes, and a reader will accept it. The file gets larger and nothing else improves; there is rarely a reason outside debugging.

Why pin the compression level at all? So the output is reproducible. An unpinned level (or zlib build) can change the compressed bytes without changing the decompressed content — correct, but not identical, which defeats byte-level verification.

What a PDF actually is — the object model the streams in this page live inside.
Fonts: the hard part — embedded font programs are filtered streams, with their own failure modes.
PDF 2.0: what changed — how the 2.0 baseline treats streams and the cross-reference stream NextPDF defaults to.

Glossary

Stream object — a dictionary plus a block of bytes between stream and endstream, carrying a /Length and usually a /Filter.
Filter — a named decoding transformation a reader applies to a stream’s bytes (for example FlateDecode).
Filter pipeline — an array of filters applied in sequence; the array order is the decode order.
FlateDecode — the zlib/deflate filter; the default compression for content, fonts, and cross-reference streams.
DCTDecode — the JPEG image filter; lossy, so re-encoding degrades the image again.
ASCII transport filter — ASCIIHexDecode / ASCII85Decode; makes data 7-bit-safe at the cost of size — not compression.
Deterministic compression — producing byte-identical compressed output for identical input, achieved by pinning the compressor’s level and format.