What a PDF actually is
ISO 32000-2 §7 Evidence: Standard-backed
At a glance
Section titled “At a glance”A PDF is not a page description that happens to be in a file. It is a small graph database with a printer attached. This page describes the four parts every PDF has — header, body, cross-reference table, trailer — and how NextPDF writes them so a reader can find every object without guessing.
Why this matters
Section titled “Why this matters”Most PDF bugs are not rendering bugs. They are structure bugs: a byte offset that points one character past the object it should, a trailer that names the wrong root, a cross-reference entry that disagrees with where the object actually sits. None of these change how a page looks until a reader takes a different path through the file and falls off the end of it.
If you treat a PDF as opaque, those failures look random. If you know the object model, they look like exactly what they are: a number that does not match a position. Reading the format is the difference between “the PDF is corrupt” and “object 14’s offset is stale because the writer measured it before finalizing the stream length.”
The short version
Section titled “The short version”A PDF has four parts, in file order:
- A header — one line naming the version (
%PDF-2.0). - A body — a sequence of numbered indirect objects: dictionaries, streams, arrays, numbers, strings, names.
- A cross-reference table (or, in PDF 2.0, a cross-reference stream) — a lookup from object number to byte offset, so any object can be reached without scanning the file.
- A trailer — a small dictionary naming the document’s root object and pointing at where the cross-reference section starts.
A reader does not read a PDF front to back. It reads the last line
first, finds startxref, jumps to the cross-reference section, and uses it
as an index into the body. The format is built to be read backward. That
one fact explains most of its design.
How NextPDF approaches it
Section titled “How NextPDF approaches it”NextPDF builds a PDF the way the format is read: object first, offset recorded after, table written last.
Every indirect object is allocated a number by a single registry
(src/Core/ObjectRegistry.php). The registry hands out sequential numbers
through allocate() and, after an object’s bytes are written to the
output buffer, records the byte offset through register(). Offsets are
never guessed ahead of time. They are observed from
BinaryBuffer::getOffset() at the moment the object header is emitted. This
is why a NextPDF cross-reference entry cannot drift from the object it
describes: the offset is whatever the buffer’s position actually was.
When the body is complete, a version-specific serialization strategy
(src/Writer/PdfSerializationStrategy.php) writes the cross-reference
section and trailer:
Pdf20StreamStrategyemits a compressed cross-reference stream (/Type /XRef) — the PDF 2.0 default.Pdf17TableStrategyandPdf14TableStrategyemit a traditional 20-byte cross-reference table plus a separate trailer dictionary — required by the PDF/A profiles that mandate older file structure.
The strategy is chosen by the output profile, not inferred. Whichever it
is, the final bytes are the same shape: the cross-reference section, then
startxref, then the byte offset, then %%EOF. That tail is what a reader
finds first.
- Step 1 of 4: ISO 32000-2 §7.5.5 — %%EOF and startxref at the file end
- Step 2 of 4: ISO 32000-2 §7.5.4 / §7.5.8 — the cross-reference section maps object number to offset
- Step 3 of 4: ISO 32000-2 §7.5.5 — the trailer names /Root, the document catalog
- Step 4 of 4: ISO 32000-2 §7.3.10 — each indirect object is reached at its recorded offset
What the evidence says
Section titled “What the evidence says”The four-part structure is not a NextPDF convention; it is the file
structure clause of Spec: ISO 32000-2, §7.5 ISO 32000-2 §7.5 . The
standard defines a PDF as a header, a body of objects, a cross-reference
table, and a trailer, and states that a reader should parse from the end of
the file. The last line is %%EOF, and the two lines before it are the
startxref keyword and the byte offset to the cross-reference section.
An indirect object is defined as an object number and a generation
number, separated by whitespace, followed by the object’s value bracketed
between the keywords obj and endobj. The combination of object number
and generation number uniquely identifies the object; an indirect
reference to it is written as the object number, the generation number,
and the keyword R. NextPDF’s ObjectRegistry mirrors this exactly: a
sequential number, generation 0 for newly written objects, and a recorded
offset.
PDF 1.5 onward also allows objects to live inside an object stream,
where they are stored without the obj/endobj keywords and must have
generation zero. The cross-reference stream (/Type /XRef,
Spec: ISO 32000-2, §7.5.8 ISO 32000-2 §7.5.8 ) is the PDF 2.0
mechanism that indexes both ordinary objects and these compressed ones.
NextPDF’s CrossReferenceStream builds it with a /W field-width array and
FlateDecode compression.
Practical example
Section titled “Practical example”This is the shape of a minimal PDF body and its trailer. The numbers in the cross-reference section are byte offsets. They must be exactly right, which is why NextPDF records them from the buffer rather than computing them.
%PDF-2.01 0 obj<< /Type /Catalog /Pages 2 0 R >>endobj2 0 obj<< /Type /Pages /Kids [3 0 R] /Count 1 >>endobj3 0 obj<< /Type /Page /Parent 2 0 R /MediaBox [0 0 612 792] >>endobjxref0 40000000000 65535 f0000000009 00000 n0000000058 00000 n0000000122 00000 ntrailer<< /Size 4 /Root 1 0 R >>startxref196%%EOFA reader opens this from the bottom: %%EOF, then startxref 196, then it
seeks to byte 196 where xref begins, reads that object 1 lives at byte 9,
follows /Root 1 0 R to the catalog, and walks the page tree from there.
Object 0 is always the free-list head with generation 65535 — a quirk
inherited from the format’s earliest design, faithfully reproduced because
readers expect it.
Common misconception
Section titled “Common misconception”The trap is believing a PDF is read top to bottom like source code. It is
not. The body can be in any object order. Object numbers need not be
sequential in the file, and a reader never relies on them being so. The
only authoritative index is the cross-reference section, and the only way
to find that is the trailer at the end. A PDF with a perfectly valid body
and a single wrong number in startxref is unreadable. A PDF with objects
written in a scrambled order but a correct cross-reference table is fine.
Position is meaningless; the recorded position is everything.
Limits and boundaries
Section titled “Limits and boundaries”This page describes file structure, not page content. How marks get onto a page — content streams, graphics operators, text showing — is a separate topic. It also does not cover what happens when a file is changed after it is written. That is the job of incremental updates, where the writer appends a second cross-reference section and the trailer chains backward.
NextPDF is a writer. The behavior described here is how it serializes a document it built. It is not a general-purpose PDF parser or repair tool. It does not promise to read, reconstruct, or salvage an arbitrary third-party file with a damaged cross-reference table. The guarantee is narrow and deliberate. The files NextPDF writes have offsets that match, because they are measured, not predicted.
Mini-FAQ
Section titled “Mini-FAQ”Why generation numbers if new files always use 0? Generation numbers exist for object reuse across updates. A freshly written file has every object at generation 0. Non-zero generations appear only when a file has been incrementally updated and an object number is recycled.
Can two objects have the same number? In a single cross-reference section, no. Across incremental updates a file can physically contain several copies of the same object number. The most recent cross-reference entry wins. That is the subject of the next page.
Does object order in the file matter for output? No. NextPDF writes objects in a deterministic order for reproducible builds, but a reader resolves everything through the cross-reference section, so the physical order is not semantically meaningful.
Related docs
Section titled “Related docs”- Incremental updates and why they matter — what happens when a written PDF is changed: appended sections and a chained trailer.
- Streams and filters — how the body’s stream objects are compressed and encoded.
- PDF 2.0: what changed — how the file structure differs between 1.7 and the 2.0 baseline NextPDF targets.
Glossary
Section titled “Glossary”- Indirect object — a numbered object in the body, written as
N G obj … endobj, whereNis the object number andGthe generation number. - Indirect reference — a pointer to an indirect object, written
N G R. - Cross-reference table (xref) — the index from object number to byte
offset. In PDF 2.0 this is usually a cross-reference stream
(
/Type /XRef) instead of the classic 20-byte-per-entry text table. - Trailer — the dictionary at the end of a cross-reference section that
names
/Root(the document catalog) and/Size, and is found via thestartxrefoffset. - Object stream — a stream object that itself contains other indirect
objects (compressed together); members have no
obj/endobjand generation zero. - Document catalog — the object named by
/Root; the entry point to the page tree and everything else in the document.