Typography: font registry, subsetting, CMap, encoding, BiDi
At a glance
Section titled “At a glance”The typography module turns a font file and a Unicode string into the bytes required by a Portable Document Format (PDF) content stream. It handles font parsing, the process-lifetime registry, glyph subsetting, ToUnicode CMap output, cmap-aware encoding strategies, and the Unicode bidirectional engine.
Install
Section titled “Install”composer require nextpdf/core:^3Conceptual overview
Section titled “Conceptual overview”FontRegistry stores fonts for the lifetime of a process and implements FontRegistryInterface. It parses a TrueType, OpenType, TrueType Collection (TTC), or Type 1 file (Printer Font Binary (PFB) and Adobe Font Metrics (AFM)) once and returns an immutable FontInfo. Use it for long-running workers: warm the font set at boot, then call lock(). After that, the registry rejects every mutation while lookups continue to serve traffic. It holds only pure PHP data: parsed metadata and the raw font bytes. A worker pool can share one instance. registerFromBinary() accepts raw font bytes, which the HyperText Markup Language (HTML) @font-face bridge uses for a font fetched from a remote source or a data URI.
The engine embeds and subsets every font it uses. The embedded font program travels inside the PDF, so the document renders the same in any viewer and does not depend on installed system fonts — ISO 32000-2 §9. A subset carries only the glyphs the document references, which matters for Chinese, Japanese, and Korean (CJK) or Unicode-rich content — ISO 32000-2 §9. FontSubsetter parses the original table directory, extracts the cmap, resolves composite-glyph dependencies as a transitive closure, and rebuilds the head, hhea, maxp, cmap, loca, glyf, and hmtx tables. It preserves the original glyph identifier numbering and zero-fills unused slots, so a CIDToGIDMap of /Identity stays valid. It returns the original font unchanged when the subset would save less than ten percent, avoiding work that does not pay for itself. CffSubsetter performs the same operation for OpenType fonts that carry a Compact Font Format outline table.
Text emission has three translations: Unicode code point, character code in the content stream, and glyph identifier inside the font. The module keeps that path explicit. FontInfo::encodeText() is the facade; FontEncodingStrategyResolver dispatches per font. An embedded TrueType or OpenType font with a Unicode cmap routes to TrueTypeCmapStrategy, which emits a two-byte Identity-H hex stream. That is the shape required by a Type 0 font with an Identity-H CMap and a CIDFontType2 descendant (ISO 32000-2 §9.7.4; the matching retrieval-augmented generation (RAG) chunk digest was returned truncated by the license cap, recorded in _downgraded-claims-o3.md). Every other font — Base 14 standard fonts, Type 1 PFB and AFM — routes to Base14EncodingStrategy, which emits a single-byte WinAnsi literal string. That stream spans the full WinAnsiEncoding (Windows code page 1252) repertoire — accented Latin, the Euro sign, and common typographic punctuation. Code points outside it are dropped from the single-byte stream and take per-cluster font fallback when a covering font is registered (ISO 32000-2 Annex D.2). The resolver covers the full FontInfo value space; there is no nullable path. ToUnicodeCMapBuilder builds the /ToUnicode resource that lets a reader recover the original Unicode from an Identity-H font. It applies greedy bfrange coalescing and a 100-entry block cap.
BidiEngine is the boundary service for the Unicode Bidirectional Algorithm, defined by Unicode Standard Annex #9 (UAX #9), Unicode 16. With isolate support off, it delegates to the legacy resolver so existing callers see the same behavior. With isolate support on, it runs the isolate-aware pipeline: the explicit-isolate stack with a maximum depth of 125, the weak-type passes, the neutral-type passes including paired-bracket resolution, and the implicit-level and line-reordering passes. CJK glyph coverage for a candidate font is a separate diagnostic: CjkFontValidator samples the required Unicode blocks per script and reports a coverage percentage.
API surface
Section titled “API surface”| Type | Kind | Key members | Stability | Since |
|---|---|---|---|---|
FontRegistry | final class | register(), registerType1(), registerFromBinary(), registerFromDirectory(), get(), has(), all(), warmup(), lock(), isLocked(), memoryUsage() | stable | 1.7.0 |
FontInfo | final readonly class | $family, $type, $widths, $unicodeMap, $cmapForward, getKey(), encodeText() | stable | 1.0.0 |
FontSubsetter | final class | subset(string, array<int>, int): string | stable | 1.0.0 |
CffSubsetter | final class | OpenType/CFF outline subsetting | stable | 1.0.0 |
FontEncodingStrategyResolver | final class | resolve(FontInfo): FontEncodingStrategy | stable | 2.7.0 |
ToUnicodeCMapBuilder | final class | buildFromRun(), buildFromMap(), encodeUnicodeUtf16Be() | stable | 2.7.0 |
BidiEngine | final class | UAX #9 isolate-aware resolution | stable | 3.1.0 |
CjkFontValidator | final class | validateCoverage(), detectScript(), isCjkCodepoint() | stable | 1.0.0 |
FontInfo is immutable: its constructor signature and public properties are frozen. The encoding strategies are pure functions of (FontInfo, UTF-8 text): the same input returns the same EncodedGlyphRun on every call.
Code sample — Quick start
Section titled “Code sample — Quick start”<?php
declare(strict_types=1);
require_once __DIR__ . '/../vendor/autoload.php';
use NextPDF\Typography\Encoding\EncodingMode;use NextPDF\Typography\FontRegistry;
$registry = new FontRegistry();$cjkFont = $registry->register('/path/to/NotoSansTC-Regular.ttf', alias: 'NotoSansTC');
$encoded = $cjkFont->encodeText('PDF 2.0 引擎 — 使用 CMap 編碼');
// An embedded CJK TrueType face resolves to the two-byte Identity-H path.assert($encoded->mode === EncodingMode::TwoByteCid);register() parses the font once and returns immutable FontInfo. encodeText() routes through the resolver and returns an EncodedGlyphRun with the byte stream, the PDF string operand, per-glyph advance widths, and the glyph identifier (GID)-to-Unicode map that a /ToUnicode CMap consumes.
Code sample — Production
Section titled “Code sample — Production”<?php
declare(strict_types=1);
require_once __DIR__ . '/../../vendor/autoload.php';
use NextPDF\Exception\NextPdfException;use NextPDF\Typography\FontRegistry;use Psr\Log\LoggerInterface;
final readonly class FontBootstrap{ public function __construct( private FontRegistry $registry, private LoggerInterface $logger, ) {}
/** * Warm a font set at worker boot, then lock the registry for the * lifetime of the process. * * @param list<string> $fontFiles Absolute paths to font files. */ public function boot(array $fontFiles): void { try { $this->registry->warmup($fontFiles); $this->registry->lock(); } catch (NextPdfException $e) { $this->logger->error('Font warmup failed', ['error' => $e->getMessage()]);
throw $e; }
$report = $this->registry->memoryUsage(); $this->logger->info('Font cache primed', [ 'fonts' => $report->entryCount, 'bytes' => $report->currentBytes, ]); }}warmup() followed by lock() is the worker boot sequence. After lock(), every mutation throws, and lookups continue to serve traffic. memoryUsage() returns a MemoryReport, so a worker can track the font cache against its budget.
Edge cases & gotchas
Section titled “Edge cases & gotchas”- When the registry is locked, it rejects
register(),registerFromBinary(),addFontDirectory(), andwarmup(). Warm up and lock at boot; never register during request handling. FontSubsetter::subset()returns the original bytes unchanged when the saving would be under ten percent or when an essential table is missing. A returned font that equals the input is the documented no-gain path, not a failure.- The subsetter preserves original glyph identifier numbering and zero-fills unused glyphs. This keeps
CIDToGIDMap /Identityvalid; do not assume glyph identifiers are renumbered into a contiguous range. registerFromBinary()writes the bytes to a temporary file for parsing and deletes both the extension file and thetempnam()base file in afinallyblock. Untrusted font data is a parsing-attack surface; gate it before it reaches the parser (see Security notes).BidiEnginedelegates verbatim to the legacy resolver when isolate support is off. Isolate formatting characters then pass through as boundary-neutral. Turn isolate support on through the conformance policy for full UAX #9 behavior.CjkFontValidatorsamples code points at a stride rather than testing every one, so its coverage figure is a statistically adequate estimate, not an exhaustive count.
Performance
Section titled “Performance”Font parsing dominates first use; the registry amortizes that cost to once per process. After warmup, get() and has() are O(1) map lookups. Subsetting cost scales with the glyph count the document uses, not with the font’s full glyph table. That is why subsetting improves both size and speed for CJK content: the subsetter handles fonts with 20,000-plus glyphs through binary search, pre-allocated buffers, and bulk string operations. Composite-glyph resolution is bounded; it caps at 100 closure iterations to defend against circular component references. The cmap Format 12 parser caps group and entry counts to limit memory use for hostile font input. The performance_budget of 1500 ms wall and 64 MB peak covers a typical font warmup plus document rendering.
Security notes
Section titled “Security notes”Two surfaces carry security weight. The first is font input. register() and registerFromBinary() parse arbitrary bytes. registerFromBinary() materializes a temporary file. The boundary rejects stream wrappers and null bytes in paths. Untrusted font data must pass an external-resource policy that bounds file size and glyph count before it reaches the parser. The subsetter’s binary readers bounds-check every offset. The cmap parsers cap group, entry, and table counts (numGroups > 31000 and an entry cap of 200,000 in Format 12) so a crafted font cannot drive unbounded allocation. The second surface is text recovery: ToUnicodeCMapBuilder validates that every character code is inside the 16-bit codespace and every Unicode value is a valid scalar. It rejects surrogate halves, so a malformed map cannot produce a corrupt extraction resource. Treat any externally supplied font or text as untrusted.
Conformance
Section titled “Conformance”| Claim | Standard | Clause | Evidence |
|---|---|---|---|
| Every font used by the document is embedded so the document renders without relying on system fonts. | ISO 32000-2 | §9 | |
| The embedded font is subset to the glyphs the document references. | ISO 32000-2 | §9 | |
An embedded CJK TrueType face is emitted as a Type 0 font with an Identity-H CMap and a CIDFontType2 descendant. | ISO 32000-2 | §9.7.4 | RAG digest truncated by licence cap; prefix 7a5258772f508e3b, see _downgraded-claims-o3.md |
The first two clauses are paraphrased and digest-pinned. The third clause’s full RAG digest was not returned (license cap truncation); ADR-013 and the cmap-encoder developer overview corroborate it, and it is recorded as downgraded. NextPDF does not reproduce normative text. PDF/A-4 and PDF/UA-2 conformance for CJK content is gated on the writer-side subsetting and /ToUnicode wiring tracked there.
Commercial context
Section titled “Commercial context”A commercial OpenType feature pack and premium font-fallback chains build on the Core registry and encoding layer. The Core typography module embeds, subsets, and encodes every font without a license; the paid pack adds curated fallback resolution. The omitted conversion link is intentional: this page is documentation, not a sales path.
See also
Section titled “See also”- Font: TrueType, OpenType, and CID registry — font value types, embedding, and fallback.
- Text: shaping, breaking, BiDi — run handling and shaping that consumes encoded glyphs.
- Contracts / Typography — the
FontRegistryInterfaceand text-preprocessor contracts. - HTML rendering engine — the
@font-facebridge that callsregisterFromBinary().