Ast: semantic document tree and serialization
At a glance
Section titled “At a glance”The Ast module provides the engine’s semantic document abstract syntax tree
(AST). It models a document as a typed node hierarchy: Document, Section,
Heading, Paragraph, List, Table, Figure, Code, and FormField.
The model records bounding boxes and citation anchors, and it serializes to
versioned JavaScript Object Notation (JSON). The accessibility tagging layer
uses this tree to produce a structure tree.
Stability: experimental. This is an internal model surface. Its classes do not carry version-frozen public application programming interface (API) guarantees. The node set and node attributes may change. The serialization schema is versioned independently (
AstDocument::CURRENT_SCHEMA_VERSION = '1.0.0'). The serializer detects and rejects an incompatible schema, so persisted AST JSON keeps a stable contract even when the in-memory API changes.
Install
Section titled “Install”composer require nextpdf/core:^3Conceptual overview
Section titled “Conceptual overview”Here, an AST represents a document’s logical structure. It is not a parser
syntax tree for one input format. AstDocument is the container. It holds the
root AstNode (which must be NodeType::Document), a schema version, a
hash of the source Portable Document Format (PDF) file, and a page count. It
rejects invalid construction, including an empty schema version, a page count
below one, or the wrong root type.
AstNode is the recursive node. NodeType enumerates the semantic kinds.
A node carries children, an optional BoundingBox, optional text content, and
attributes validated by NodeAttributeSchema. The node API supports immutable
derivation. withBboxAndText() returns a new node. deepClone() copies a
subtree. NodeId is the value-object identity. CitationAnchor ties a node
to a source location for traceability. AstNodeCollection is a
Countable/IteratorAggregate set with ofType() filtering.
AstSerializer is the persistence boundary. serialize() writes an
AstDocument to JSON. deserialize() reads it back. canDeserialize() and
extractSchemaVersion() let you check compatibility before parsing, so a
schema mismatch is a detected condition instead of a corrupt load.
AstDocument::estimateTokenCount() helps size content for downstream
token-bounded processing.
API surface
Section titled “API surface”| Class | Key members | Role |
|---|---|---|
AstDocument | toJson(), nodeCount(), estimateTokenCount(), CURRENT_SCHEMA_VERSION | Root container; validates root type and schema |
AstNode | addChild(), children(), childCount(), totalNodeCount(), withBboxAndText(), deepClone() | Recursive semantic node |
NodeType (enum) | Document, Heading, Table, Figure, FormField, … | Semantic node kind |
AstNodeCollection | add(), count(), isEmpty(), ofType(), toArray() | Iterable, type-filterable node set |
AstSerializer | serialize(), deserialize(), canDeserialize(), extractSchemaVersion() | Versioned JSON persistence |
BoundingBox | toArray(), equals() | Geometry value object (epsilon compare) |
NodeId / CitationAnchor | toString(), equals(), toArray() | Node identity and source-traceability anchor |
NodeAttributeSchema | attribute validation | Schema for node attributes |
Run composer docs:generate-api-php -- --module=Ast to generate the full
PHPDoc table.
Code sample — Quick start
Section titled “Code sample — Quick start”Build a small tree, then serialize it.
<?php
declare(strict_types=1);
require_once __DIR__ . '/../vendor/autoload.php';
use NextPDF\Ast\AstNode;use NextPDF\Ast\AstSerializer;use NextPDF\Ast\NodeType;
$root = new AstNode(NodeType::Document);$heading = new AstNode(NodeType::Heading);$root->addChild($heading);$root->addChild(new AstNode(NodeType::Paragraph));
echo "Nodes: {$root->totalNodeCount()}\n";
$json = (new AstSerializer())->serialize(/* an AstDocument wrapping $root */);Code sample — Production
Section titled “Code sample — Production”Round-trip persisted AST defensively. Check schema compatibility before you deserialize untrusted JSON.
<?php
declare(strict_types=1);
require_once __DIR__ . '/../vendor/autoload.php';
use NextPDF\Ast\AstDocument;use NextPDF\Ast\AstSerializer;use Psr\Log\LoggerInterface;
final readonly class AstStore{ public function __construct( private AstSerializer $serializer, private LoggerInterface $logger, ) {}
public function load(string $json): ?AstDocument { if (!$this->serializer->canDeserialize($json)) { $this->logger->warning('AST JSON schema incompatible; rejected.', [ 'found_schema' => $this->serializer->extractSchemaVersion($json), 'expected' => AstDocument::CURRENT_SCHEMA_VERSION, ]);
return null; }
return $this->serializer->deserialize($json); }}Edge cases & gotchas
Section titled “Edge cases & gotchas”AstDocumentrequires the root node to beNodeType::Document. A tree with any other root throws at construction.AstNode::withBboxAndText()anddeepClone()return new instances. The available node mutators (addChild()) mutate the node. The derivation helpers do not. Know which method you are calling.- Always gate
deserialize()withcanDeserialize()for externally sourced JSON. A schema-version mismatch is a detectable, expected condition. estimateTokenCount()is an estimate for sizing downstream processing, not an exact tokenizer count. Do not treat it as authoritative.BoundingBox::equals()is an epsilon compare (default 0.001). Exact float equality is not the contract.
Performance
Section titled “Performance”Tree construction and traversal are O(n) in node count. Serialization is
linear in the tree size. The reproducibility profile is bitwise. The same
tree serializes to the same JSON bytes, which keeps the schema stable as a
persistence contract. The default reference workload stays well inside the
1500 ms wall / 64 MB peak budget.
Security notes
Section titled “Security notes”AstSerializer::deserialize() parses JSON that may be persisted or
transmitted. Validate compatibility with canDeserialize() first. Treat the
deserialized tree’s text content and attributes as untrusted strings when they
re-enter the application or are rendered. The module itself performs no
input/output (I/O) and embeds no external data. See the engine threat model in
/modules/core/security/.
Conformance
Section titled “Conformance”This module asserts no PDF-specification normative claim. The semantic AST is
an engine-internal abstraction. It does not implement a standardized document
model whose clauses must be cited. Where the AST feeds accessibility tagging,
the PDF/UA and tagged-PDF conformance of the output is documented and
validated on /modules/core/accessibility/ and /modules/core/conformance/,
not here.
See also
Section titled “See also”- Accessibility module — uses the AST to build the structure tree.
- Inspect module — inspects layout and structure.
- HTML module — provides a source of document structure.
- Conformance overview