Skip to content

What a PDF actually is

Evidence: Standard-backed

A PDF is not a page description that happens to be in a file. It is a small graph database with a printer attached. This page describes the four parts every PDF has — header, body, cross-reference table, trailer — and how NextPDF writes them so a reader can find every object without guessing.

Most PDF bugs are not rendering bugs. They are structure bugs: a byte offset that points one character past the object it should, a trailer that names the wrong root, a cross-reference entry that disagrees with where the object actually sits. None of these change how a page looks until a reader takes a different path through the file and falls off the end of it.

If you treat a PDF as opaque, those failures look random. If you know the object model, they look like exactly what they are: a number that does not match a position. Reading the format is the difference between “the PDF is corrupt” and “object 14’s offset is stale because the writer measured it before finalizing the stream length.”

A PDF has four parts, in file order:

  1. A header — one line naming the version (%PDF-2.0).
  2. A body — a sequence of numbered indirect objects: dictionaries, streams, arrays, numbers, strings, names.
  3. A cross-reference table (or, in PDF 2.0, a cross-reference stream) — a lookup from object number to byte offset, so any object can be reached without scanning the file.
  4. A trailer — a small dictionary naming the document’s root object and pointing at where the cross-reference section starts.

A reader does not read a PDF front to back. It reads the last line first, finds startxref, jumps to the cross-reference section, and uses it as an index into the body. The format is built to be read backward. That one fact explains most of its design.

NextPDF builds a PDF the way the format is read: object first, offset recorded after, table written last.

Every indirect object is allocated a number by a single registry (src/Core/ObjectRegistry.php). The registry hands out sequential numbers through allocate() and, after an object’s bytes are written to the output buffer, records the byte offset through register(). Offsets are never guessed ahead of time. They are observed from BinaryBuffer::getOffset() at the moment the object header is emitted. This is why a NextPDF cross-reference entry cannot drift from the object it describes: the offset is whatever the buffer’s position actually was.

When the body is complete, a version-specific serialization strategy (src/Writer/PdfSerializationStrategy.php) writes the cross-reference section and trailer:

  • Pdf20StreamStrategy emits a compressed cross-reference stream (/Type /XRef) — the PDF 2.0 default.
  • Pdf17TableStrategy and Pdf14TableStrategy emit a traditional 20-byte cross-reference table plus a separate trailer dictionary — required by the PDF/A profiles that mandate older file structure.

The strategy is chosen by the output profile, not inferred. Whichever it is, the final bytes are the same shape: the cross-reference section, then startxref, then the byte offset, then %%EOF. That tail is what a reader finds first.

  1. Step 1 of 4: ISO 32000-2 §7.5.5 — %%EOF and startxref at the file end
  2. Step 2 of 4: ISO 32000-2 §7.5.4 / §7.5.8 — the cross-reference section maps object number to offset
  3. Step 3 of 4: ISO 32000-2 §7.5.5 — the trailer names /Root, the document catalog
  4. Step 4 of 4: ISO 32000-2 §7.3.10 — each indirect object is reached at its recorded offset
How a reader resolves an object in a NextPDF file, and the ISO 32000-2 clause that defines each step: it starts at the end of the file and works inward.

The four-part structure is not a NextPDF convention; it is the file structure clause of Spec: ISO 32000-2, §7.5 . The standard defines a PDF as a header, a body of objects, a cross-reference table, and a trailer, and states that a reader should parse from the end of the file. The last line is %%EOF, and the two lines before it are the startxref keyword and the byte offset to the cross-reference section.

Evidence: Standard-backed

An indirect object is defined as an object number and a generation number, separated by whitespace, followed by the object’s value bracketed between the keywords obj and endobj. The combination of object number and generation number uniquely identifies the object; an indirect reference to it is written as the object number, the generation number, and the keyword R. NextPDF’s ObjectRegistry mirrors this exactly: a sequential number, generation 0 for newly written objects, and a recorded offset.

PDF 1.5 onward also allows objects to live inside an object stream, where they are stored without the obj/endobj keywords and must have generation zero. The cross-reference stream (/Type /XRef, Spec: ISO 32000-2, §7.5.8 ) is the PDF 2.0 mechanism that indexes both ordinary objects and these compressed ones. NextPDF’s CrossReferenceStream builds it with a /W field-width array and FlateDecode compression.

This is the shape of a minimal PDF body and its trailer. The numbers in the cross-reference section are byte offsets. They must be exactly right, which is why NextPDF records them from the buffer rather than computing them.

%PDF-2.0
1 0 obj
<< /Type /Catalog /Pages 2 0 R >>
endobj
2 0 obj
<< /Type /Pages /Kids [3 0 R] /Count 1 >>
endobj
3 0 obj
<< /Type /Page /Parent 2 0 R /MediaBox [0 0 612 792] >>
endobj
xref
0 4
0000000000 65535 f
0000000009 00000 n
0000000058 00000 n
0000000122 00000 n
trailer
<< /Size 4 /Root 1 0 R >>
startxref
196
%%EOF

A reader opens this from the bottom: %%EOF, then startxref 196, then it seeks to byte 196 where xref begins, reads that object 1 lives at byte 9, follows /Root 1 0 R to the catalog, and walks the page tree from there. Object 0 is always the free-list head with generation 65535 — a quirk inherited from the format’s earliest design, faithfully reproduced because readers expect it.

The trap is believing a PDF is read top to bottom like source code. It is not. The body can be in any object order. Object numbers need not be sequential in the file, and a reader never relies on them being so. The only authoritative index is the cross-reference section, and the only way to find that is the trailer at the end. A PDF with a perfectly valid body and a single wrong number in startxref is unreadable. A PDF with objects written in a scrambled order but a correct cross-reference table is fine. Position is meaningless; the recorded position is everything.

This page describes file structure, not page content. How marks get onto a page — content streams, graphics operators, text showing — is a separate topic. It also does not cover what happens when a file is changed after it is written. That is the job of incremental updates, where the writer appends a second cross-reference section and the trailer chains backward.

NextPDF is a writer. The behavior described here is how it serializes a document it built. It is not a general-purpose PDF parser or repair tool. It does not promise to read, reconstruct, or salvage an arbitrary third-party file with a damaged cross-reference table. The guarantee is narrow and deliberate. The files NextPDF writes have offsets that match, because they are measured, not predicted.

Why generation numbers if new files always use 0? Generation numbers exist for object reuse across updates. A freshly written file has every object at generation 0. Non-zero generations appear only when a file has been incrementally updated and an object number is recycled.

Can two objects have the same number? In a single cross-reference section, no. Across incremental updates a file can physically contain several copies of the same object number. The most recent cross-reference entry wins. That is the subject of the next page.

Does object order in the file matter for output? No. NextPDF writes objects in a deterministic order for reproducible builds, but a reader resolves everything through the cross-reference section, so the physical order is not semantically meaningful.

  • Indirect object — a numbered object in the body, written as N G obj … endobj, where N is the object number and G the generation number.
  • Indirect reference — a pointer to an indirect object, written N G R.
  • Cross-reference table (xref) — the index from object number to byte offset. In PDF 2.0 this is usually a cross-reference stream (/Type /XRef) instead of the classic 20-byte-per-entry text table.
  • Trailer — the dictionary at the end of a cross-reference section that names /Root (the document catalog) and /Size, and is found via the startxref offset.
  • Object stream — a stream object that itself contains other indirect objects (compressed together); members have no obj/endobj and generation zero.
  • Document catalog — the object named by /Root; the entry point to the page tree and everything else in the document.