Skip to content

Metadata: XMP packet build and streaming read

The Metadata module is the engine’s Extensible Metadata Platform (XMP) layer. It builds the XMP packet that a Portable Document Format (PDF) file carries as a metadata stream. It reads an existing packet without loading the whole document into memory. It emits the engine’s audit-trail XMP extension.

Terminal window
composer require nextpdf/core:^3

A PDF stores document-level metadata as an XMP packet in a metadata stream attached to the document catalog, as described by ISO 32000-2 §14.3. This module owns the production and consumption of that packet. Its surface is deliberately small and focused: three classes under NextPDF\Metadata\Xmp.

XmpMetadataBuilder produces the packet. It serializes a property set into a well-formed XMP document wrapped in the standard <?xpacket?> processing instructions. It uses the canonical packet globally unique identifier (GUID) and byte-order mark fixed by the XMP specification. The output is the byte string that the Writer embeds as the metadata stream, the in-PDF XMP representation described in §14.3.

XmpStreamReader consumes a packet. It is built for hostile input. The source is streamed in 64 KB chunks to a bounded temporary file before parsing. The reader enforces an aggregate byte cap during that write. The libxml entity loader is set to null for the parse and restored afterward. A DOCTYPE triggers a hard rejection. iterateProperties() returns a generator that yields (namespaceUri, localName, textContent) tuples for each leaf element without building the whole tree in memory; only the current element and its text node are alive in the parser at any moment. An oversized packet raises PacketTooLargeException; malformed Extensible Markup Language (XML), a DOCTYPE, or non-UTF-8 input raises InvalidConfigException.

XmpAuditFieldEmitter is the engine-specific extension. It renders an AuditReport into a custom XMP field under the nextpdfAudit namespace, so a document’s conformance audit travels with the file as standards-compliant XMP instead of as a sidecar. The AuditReport it renders is not produced by the emitter. The caller activates enrichment by running a render under CssRenderingMode::Audit with a caller-supplied auditCollector configured through Config(auditCollector: ...). The collector is caller-driven: the caller feeds it, and the emitter renders whatever it has collected. It is newer than the core XMP surface (@since 5.4.0). The builder and reader are @since 2.0.0.

ClassKey membersRole
XmpMetadataBuilderbuild(): string, XPACKET_GUID, XPACKET_BOMSerializes a property set into an XMP packet (@since 2.0.0)
XmpStreamReaderiterateProperties(mixed $source, int $byteCap = DEFAULT_BYTE_CAP): \Generator, DEFAULT_BYTE_CAPBounded, streaming, DOCTYPE-rejecting XMP reader (@since 2.0.0)
PacketTooLargeExceptionextends NextPdfExceptionRaised when an XMP packet exceeds the byte cap (@since 2.0.0)
XmpAuditFieldEmitterrender(?AuditReport $report): string, NAMESPACE_URIRenders the audit trail as a custom XMP field (@since 5.4.0)

Run composer docs:generate-api-php -- --module=Metadata to generate the full PHPDoc table.

Stream properties out of an existing XMP packet under an explicit byte cap.

<?php
declare(strict_types=1);
require_once __DIR__ . '/../vendor/autoload.php';
use NextPDF\Metadata\Xmp\XmpStreamReader;
$reader = new XmpStreamReader();
foreach ($reader->iterateProperties(file_get_contents('/srv/in/xmp.xml'), byteCap: 1_048_576) as [$ns, $name, $value]) {
printf("%s:%s = %s\n", $ns, $name, $value);
}

Read a packet defensively, and map the module’s typed failures to an application-level outcome instead of letting raw parser faults escape.

<?php
declare(strict_types=1);
require_once __DIR__ . '/../vendor/autoload.php';
use NextPDF\Exception\InvalidConfigException;
use NextPDF\Metadata\Xmp\PacketTooLargeException;
use NextPDF\Metadata\Xmp\XmpStreamReader;
use Psr\Log\LoggerInterface;
final readonly class XmpIngestService
{
public function __construct(
private XmpStreamReader $reader,
private LoggerInterface $logger,
) {}
/**
* @param resource|string $source A stream resource or XMP byte string.
*
* @return array<string, string> Flattened "ns:name" => value map.
*/
public function ingest(mixed $source): array
{
$properties = [];
try {
// Cap untrusted XMP at 4 MB regardless of the 1 GiB default.
foreach ($this->reader->iterateProperties($source, byteCap: 4_194_304) as [$ns, $name, $value]) {
$properties["{$ns}:{$name}"] = $value;
}
} catch (PacketTooLargeException $e) {
$this->logger->warning('XMP packet exceeded ingest cap; rejected.', ['error' => $e->getMessage()]);
return [];
} catch (InvalidConfigException $e) {
$this->logger->warning('XMP packet malformed or unsafe; rejected.', ['error' => $e->getMessage()]);
return [];
}
return $properties;
}
}
  • XmpStreamReader rejects any DOCTYPE outright. This is an XML External Entity (XXE) defense, not a validation nicety; a packet that needs a DOCTYPE is not accepted. Sanitize it upstream.
  • The byte cap defaults to 1 GiB (DEFAULT_BYTE_CAP). That default is a ceiling, not a recommendation. Pass a tight byteCap for untrusted input.
  • iterateProperties() is a generator. Consume it once; iterating it twice does not replay.
  • The reader sets the libxml entity loader to null for the parse and restores it. Do not run it concurrently with other libxml-based parsing in the same request if that parsing depends on the entity loader.
  • XmpAuditFieldEmitter::render(null) is valid and yields an empty rendering; a null AuditReport means “no audit”, not an error.

The builder is linear in the property count. The reader’s memory use is dominated by the longest single text run, not by document size, because only the current element is alive in the parser; large packets stream instead of loading into memory. The default reference workload sits within a 1500 ms wall / 64 MB peak budget. The reproducibility profile is structural: an XMP packet records modification timestamps. Two builds of the same logical metadata differ in those fields, while their structure is identical.

XmpStreamReader parses untrusted XML and is hardened accordingly. Streamed chunking with an enforced byte cap bounds a memory-amplification denial of service. Rejecting DOCTYPE closes XXE. LIBXML_NONET blocks network entity resolution. Non-UTF-8 input is refused. Still set a deployment-appropriate byteCap for any externally sourced packet instead of relying on the gigabyte default. Treat XMP property values as untrusted strings when they re-enter the application. See the engine threat model in /modules/core/security/.

The packet XmpMetadataBuilder produces is the in-PDF XMP metadata-stream representation defined in ISO 32000-2 §14.3 (). The XMP serialization form itself is governed by the XMP specification (ISO 16684-1), which is not in the verifiable citation corpus. That requirement is referenced by number, not chunk-pinned. These are implementation facts produced by src/Metadata/Xmp/ and exercised by tests/Unit/Metadata/Xmp/. End-to-end metadata conformance for a profile (PDF/A, PDF/UA) is validated by the oracle and golden suites described in /modules/core/conformance/.