Skip to content

Advanced PDF parser diagnostics

The Artisan import path reads a Chrome-generated Portable Document Format (PDF) file and brings one page into a NextPDF document. When a difficult input breaks that import, look below PageImporter::import() to the parser classes that read the file byte by byte.

This guide covers the low-level parser surface in the NextPDF\Parser namespace: PdfReader, PdfTokenizer, CrossRefParser, StreamDecoder, ResourceCollector, RevisionExtractor, and the value objects PdfObject and RevisionXRefTable. Every symbol shown here exists in nextpdf/artisan. The guide describes the parser as it is built, not an idealized interface.

Use this guide as both explanation and how-to. It shows how the pieces fit, then walks you through inspecting an incremental-update revision. For the import boundary above this layer, see the Artisan developer guide.

Use the parser surface only when the normal import path has already failed and you need to find the cause. Typical triggers include:

  • PageImporter::import() throws NextPDF\Artisan\Exception\PdfParseException, and you need to know whether the cross-reference table, a stream filter, or the page tree is at fault.
  • A Chrome upgrade changes the output format, such as when a traditional cross-reference table becomes a cross-reference stream, or vice versa, and your fixtures stop matching.
  • You receive a third-party PDF that Chrome did not produce, and you want to confirm whether the parser can read it at all.
  • You are analyzing an incrementally updated document and need per-revision byte ranges or object visibility.

If you are writing a normal renderer integration, you do not need this surface. The parser is an internal diagnostic tool, not a general-purpose PDF library. It does not support encrypted PDFs, linearized hint tables, or incremental updates with conflicting object redefinitions.

The parser is a small set of single-responsibility classes. PdfReader is the entry point. The other classes are collaborators it constructs or calls.

ClassResponsibilityKey methods
PdfReaderRead the file structure, resolve objects, and traverse the page tree.parse(), getObject(), getTrailer(), getObjectNumbers(), getPage(), getPageContentStream(), getPageResources(), getPageMediaBox(), resolveRef(), collectPageResources(), getRevisionCount(), getRevisionXRef(), getRevisions()
PdfTokenizerAnalyze lexical syntax per ISO 32000-2:2020 §7.2: names, strings, numbers, dictionaries, arrays, and references.readToken(), readValue(), readName(), readNumber(), readDictionary(), readArray(), readStreamData(), peek(), skipWhitespace(), getOffset(), setOffset()
CrossRefParserParse traditional cross-reference tables and cross-reference streams.parseXRefTable(), parseXRefStream()
StreamDecoderDecode stream bytes by /Filter.decode() (static)
ResourceCollectorTraverse a Resources tree recursively and collect every reachable indirect object.traverse(), getCollected()
RevisionExtractorSlice an incrementally updated file into per-revision byte ranges.extractRevision() (static), getRevisionBoundaries() (static)
PdfObjectImmutable parsed indirect object (dictionary plus optional stream).get(), getRef(), getArray(), getType(), getSubtype(), hasStream(), getDictionary(), getRawStreamData(), getRawDictionaryBytes()
RevisionXRefTableImmutable per-revision cross-reference snapshot.getObjectNumbers(), getActiveObjectCount(), hasRootUpdate(), getSize()

Construct \NextPDF\Parser\PdfReader with the raw PDF bytes, then call parse() before you call any other method. parse() checks the %PDF- header, finds startxref in the file tail, and walks the cross-reference chain by following /Prev links.

After parse(), the reader exposes three method groups:

  • Object access. getObject(int $objNum) returns a PdfObject, resolving Type 2 entries (objects stored inside an object stream) automatically. getObjectNumbers() returns a sorted list<int> of every non-free object number. resolveRef(mixed $value) follows one indirect reference. A direct value passes through unchanged.
  • Page access. getPage(int $pageIndex) resolves the catalog, walks /Pages, and returns the page at the zero-based index. getPageContentStream(), getPageResources(), and getPageMediaBox() extract the parts PageImporter needs. collectPageResources() returns array<int, PdfObject> for every object reachable from the page’s Resources and Contents.
  • Revision access. getRevisionCount() returns the number of incremental revisions. A single-revision file returns 1. getRevisionXRef(int $index) returns one RevisionXRefTable (index 0 is the most recent). getRevisions() returns the full list<RevisionXRefTable>.

PdfTokenizer reads the byte stream. You rarely construct it yourself because PdfReader and CrossRefParser own their instances. Inspect this layer when a parse fails on a malformed token. Two behaviors matter for diagnostics:

  • Security limits are constants, not configuration. The tokenizer caps literal-string nesting, dictionary and array nesting, keyword length, and array element count. When input exceeds a limit, it throws PdfParseException and names the limit in the message. A crafted input that trips one of these limits is a defense working as designed, not a parser bug.
  • readValue() routes parsing. It inspects the next byte and delegates to readName(), readLiteralString(), readHexString(), readArray(), readDictionary(), or a number/reference reader. An indirect reference N G R is returned as the array shape ['type' => 'ref', 'num' => N, 'gen' => G]. PdfObject::getRef() and PdfReader::resolveRef() recognize this shape.

CrossRefParser — cross-reference resolution

Section titled “CrossRefParser — cross-reference resolution”

CrossRefParser parses both formats Chrome can emit:

  • parseXRefTable() reads a traditional xref table (PDF 1.x style): subsection headers, fixed-width 20-byte entries, and then a trailer dictionary.
  • parseXRefStream() reads a cross-reference stream (PDF 2.0, ISO 32000-2:2020 §7.5.8): an indirect object with /Type /XRef, a /W field-width array, and a binary stream of entries.

Both return the same shape: array{xref: array<int, ...>, trailer: array<string, mixed>, prevOffset: int|null}. PdfReader::parse() decides which parser to call by peeking at the four bytes at the cross-reference offset: xref selects the table parser, and anything else is treated as a stream object. Both parsers enforce a one-million-entry ceiling per subsection to reject forged counts that would otherwise make the parser run excessively.

StreamDecoder::decode(string $data, string|array $filter) is static and applies one filter or a chained list of filters. It supports exactly the filters Chrome’s printToPDF emits:

  • FlateDecode (zlib, with a raw-deflate fallback)
  • ASCIIHexDecode
  • ASCII85Decode

Any other filter name throws PdfParseException with Unsupported stream filter. The decoder caps decompressed output at 16 MiB to bound decompression-bomb risk. Oversized output throws rather than allocating without limit. When PdfReader reads a stream and decoding throws, it falls back to the raw stream bytes, so one bad filter does not abort the whole parse.

ResourceCollector — deep resource traversal

Section titled “ResourceCollector — deep resource traversal”

ResourceCollector is constructed with the PdfReader and called through PdfReader::collectPageResources(). Its traverse() method walks a value recursively, follows every ['type' => 'ref'] reference through getObject(), and records each resolved object once in an array<int, PdfObject> keyed by object number. It caps recursion depth and silently skips references it cannot resolve, so one dangling reference yields a partial collection instead of a hard failure.

RevisionExtractor — incremental updates and revisions

Section titled “RevisionExtractor — incremental updates and revisions”

A PDF that was signed, annotated, or otherwise edited after creation carries incremental updates. Each edit appends a new cross-reference section and trailer, ending in a %%EOF marker. RevisionExtractor works entirely from static methods over a parsed PdfReader:

  • extractRevision(string $pdfData, PdfReader $reader, int $revision) returns the file truncated at the requested revision’s %%EOF boundary. Revision 0 (most recent) returns the whole file; higher indices return progressively older snapshots.
  • getRevisionBoundaries(string $pdfData, PdfReader $reader) returns a list<array{revision, startByte, endByte, sizeBytes}> describing the byte range each revision contributed.

This isolation is deliberate. Extracting an older revision exposes only the objects visible up to that point, which blocks hybrid cross-reference attacks where a later revision redefines an earlier object.

This procedure inspects the revision history of a PDF that may have been edited after Chrome produced it. The example is shaped for production: it declares strict types, uses full type hints, validates its input, and catches the most specific exception.

  1. Read the PDF bytes into memory, and reject empty input before constructing the reader.
  2. Construct \NextPDF\Parser\PdfReader and call parse().
  3. Read getRevisionCount(). A value of 1 means a single-revision file with no incremental updates.
  4. For each revision, read its RevisionXRefTable and inspect getActiveObjectCount(), hasRootUpdate(), and getSize().
  5. Compute per-revision byte ranges with RevisionExtractor::getRevisionBoundaries().
  6. Catch PdfParseException, the most specific exception the parser raises, and surface a diagnostic message.
examples/inspect-revisions.php
<?php
declare(strict_types=1);
namespace App\Pdf\Diagnostics;
use NextPDF\Artisan\Exception\PdfParseException;
use NextPDF\Parser\PdfReader;
use NextPDF\Parser\RevisionExtractor;
use NextPDF\Parser\RevisionXRefTable;
/**
* Inspect the incremental-update history of a PDF file.
*
* @return list<array{revision: int, activeObjects: int, rootUpdate: bool, size: int, startByte: int, endByte: int, sizeBytes: int}>
*
* @throws PdfParseException If the file is not a readable PDF.
*/
function inspectRevisions(string $path): array
{
$pdfData = \file_get_contents($path);
if ($pdfData === false || $pdfData === '') {
throw new PdfParseException("Cannot read PDF bytes from path: {$path}");
}
$reader = new PdfReader($pdfData);
$reader->parse();
$boundaries = RevisionExtractor::getRevisionBoundaries($pdfData, $reader);
$report = [];
foreach ($reader->getRevisions() as $table) {
\assert($table instanceof RevisionXRefTable);
$index = $table->index;
$boundary = $boundaries[$index];
$report[] = [
'revision' => $index,
'activeObjects' => $table->getActiveObjectCount(),
'rootUpdate' => $table->hasRootUpdate(),
'size' => $table->getSize(),
'startByte' => $boundary['startByte'],
'endByte' => $boundary['endByte'],
'sizeBytes' => $boundary['sizeBytes'],
];
}
return $report;
}

The reader orders revisions from newest (index0) to oldest. To extract one older snapshot as standalone bytes, for example, to diff what an edit changed, call the extractor directly:

examples/extract-revision.php
<?php
declare(strict_types=1);
namespace App\Pdf\Diagnostics;
use NextPDF\Artisan\Exception\PdfParseException;
use NextPDF\Parser\PdfReader;
use NextPDF\Parser\RevisionExtractor;
/**
* Extract one revision of a PDF as standalone bytes.
*
* @throws PdfParseException If the file is unreadable or the revision index is out of range.
*/
function extractRevision(string $pdfData, int $revision): string
{
if ($pdfData === '') {
throw new PdfParseException('Empty PDF input');
}
$reader = new PdfReader($pdfData);
$reader->parse();
// Throws PdfParseException with an "out of range" message for an invalid index.
return RevisionExtractor::extractRevision($pdfData, $reader, $revision);
}

Every parser failure surfaces as NextPDF\Artisan\Exception\PdfParseException. The message identifies the cause. Use the table below to map a message fragment to the stage that raised it.

Message fragmentStageWhat it means
missing %PDF- headerPdfReader::parse()The bytes are not a PDF, or the input was truncated at the beginning.
Cannot find startxref marker / Invalid startxref offsetPdfReader::parse()The file tail is corrupt, or the cross-reference pointer is out of bounds.
Expected 'xref' keyword / Invalid xref subsection headerCrossRefParser::parseXRefTable()A traditional cross-reference table is malformed.
XRef stream ... /Type /XRef / invalid /W arrayCrossRefParser::parseXRefStream()A cross-reference stream is missing required dictionary entries.
exceeds limit of (xref or object-stream count)CrossRefParser / PdfReaderA forged count tripped a denial-of-service guard.
Unsupported stream filterStreamDecoder::decode()The stream uses a filter outside the supported FlateDecode / ASCIIHexDecode / ASCII85Decode set.
FlateDecode decompression failed / output exceeds ... bytes limitStreamDecoderThe compressed data is invalid or expands past the 16 MiB cap.
Maximum nesting depth ... exceeded / Keyword exceeds maximum lengthPdfTokenizerA crafted or pathological structure tripped a tokenizer limit.
Page index ... not found / out of range in subtreePdfReader::getPage()The requested page index does not exist in the page tree.
Revision index ... out of rangePdfReader / RevisionExtractorThe revision index is outside 0 to getRevisionCount() - 1.

When you catch the exception, log the message and the source path, then either rethrow or return a defined error. Do not discard it silently. An empty catch block hides the one piece of information the parser worked to produce.

examples/parse-with-diagnostics.php
<?php
declare(strict_types=1);
namespace App\Pdf\Diagnostics;
use NextPDF\Artisan\Exception\PdfParseException;
use NextPDF\Parser\PdfReader;
use Psr\Log\LoggerInterface;
/**
* Parse a PDF, logging the precise parser-stage message on failure.
*
* @throws PdfParseException Rethrown after logging so the caller can decide policy.
*/
function parseWithDiagnostics(string $pdfData, LoggerInterface $logger): PdfReader
{
if ($pdfData === '') {
throw new PdfParseException('Empty PDF input');
}
$reader = new PdfReader($pdfData);
try {
$reader->parse();
} catch (PdfParseException $exception) {
$logger->error('PDF parse failed', [
'reason' => $exception->getMessage(),
'bytes' => \strlen($pdfData),
]);
throw $exception;
}
return $reader;
}
  • Always call parse() first. Every accessor on PdfReader assumes the cross-reference chain is loaded. Calling getObject() or getPage() before parse() returns nothing useful.
  • Treat the parser as read-only and Chrome-shaped. It targets the subset of PDF syntax that Chrome’s printToPDF emits. Encrypted PDFs, linearized hint tables, and conflicting incremental updates are out of scope by design. Do not extend it into a general PDF repair tool.
  • Keep the security limits in place. The nesting, keyword-length, array-size, cross-reference-count, and decompression caps bound resource use on hostile input. A PdfParseException from a limit is the correct outcome for a crafted file. Raising a limit to accept such a file widens the attack surface.
  • Default to page 0. getPage() and PageImporter::import() default to the first page. Choose another index only when the workflow deliberately needs it.
  • Validate input before constructing the reader. Reject empty or unreadable bytes early, as the examples above do, so a clear application-level error appears before any parser exception.
  • Catch PdfParseException, never bare \Exception. It is the single, specific type the parser raises. Catching it keeps unrelated failures from being masked.
  • Artisan developer guide — the import boundary above the parser, including ChromeHtmlRenderer, PageImporter, and the architecture layers.
  • Artisan API reference — the published method tables for the package’s public surface.
  • Artisan troubleshooting — symptom-first guidance for renderer and import failures.
  • Chrome renderer setup — configuring the renderer that produces the PDFs this parser reads.
  • ISO 32000-2:2020 §7.5 (file structure, cross-reference, incremental updates) and §7.2 (lexical conventions) — the specification the tokenizer and cross-reference parser implement. Consult the published standard for the authoritative byte-level format.