Skip to content

Python SDK overview

The NextPDF Python software development kit (SDK) is for Python applications that need Portable Document Format (PDF) extraction with provenance. It returns structured blocks with citation anchors, including page index, confidence, optional bounding boxes, and semantic node identifiers when the source PDF exposes that structure.

Use the SDK when your pipeline needs to answer questions like “Which page did this text come from?”, “Which table supports this value?”, or “What changed between these two PDFs?” without treating PDF extraction as anonymous plain text.

  • A synchronous NextPDF client for scripts, batch jobs, and notebooks.
  • An asynchronous AsyncNextPDF client for asyncio, FastAPI, and other asynchronous runtimes.
  • A nextpdf command-line interface (CLI) for one-shot extraction from a file path or standard input that writes to standard output or a file.
  • An optional Model Context Protocol (MCP) server so artificial-intelligence (AI) agents can call PDF extraction tools directly.
  • A remote backend for production use with NextPDF Connect.
  • A local backend for offline, library-only extraction through pypdf.

The remote backend sends PDF bytes to a NextPDF Connect server. This is the recommended production path because it centralizes extraction behavior, authentication, quotas, and operational controls.

The local backend runs inside the Python process and reads PDFs through pypdf. It is useful for offline development and tagged PDFs, but it cannot provide precise bounding boxes and uses heuristic paragraph-level extraction for untagged PDFs. The local backend is library-only: inject a LocalBackend into AsyncNextPDF to use it. The nextpdf CLI and the MCP server cannot use it. See Backend choice matrix for the full comparison.

The SDK does not perform optical character recognition (OCR). Scanned or image-only PDFs need an OCR step before NextPDF can extract embedded text. Complex layouts, overlapping text, and unusual PDF producers can also reduce extraction quality.

The nextpdf CLI is remote-only and is not a streaming interface. Each command reads the whole PDF into memory (from a file path or standard input), sends it to a NextPDF Connect server, builds the complete result in memory, and serializes it in a single write. You can redirect that output to a file with --output (or -o) or to standard output, but the result is fully buffered, not produced incrementally. The CLI cannot use the local pypdf backend.

Both clients share one ast method namespace for abstract syntax tree (AST) operations and return the same Pydantic models. They differ only in their concurrency model.

Your contextUseWhy
Scripts and batch jobsNextPDF (sync)Linear control flow; no event loop to manage.
Jupyter notebooksNextPDF (sync)run_sync detects the running event loop and dispatches to a worker thread, so blocking calls work inside cells.
The nextpdf CLINextPDF (sync, internal)The CLI builds a sync client for you.
asyncio servicesAsyncNextPDFNative await; no thread hand-off.
FastAPI, Starlette, Asynchronous Server Gateway Interface (ASGI)AsyncNextPDFShares the request event loop and the same connection pool.
High-concurrency fan-outAsyncNextPDFRun many extractions concurrently with asyncio.gather over one pooled client.

NextPDF wraps an internal AsyncNextPDF and runs each call through run_sync. Inside a running event loop, such as a notebook, run_sync dispatches the coroutine to a single-worker thread with its own loop, so you do not hit the nested-asyncio.run error. In an asyncio or ASGI service, call AsyncNextPDF directly instead of paying for that thread hand-off on every call.

The async client owns an httpx.AsyncClient for connection pooling, so reuse one AsyncNextPDF instance and close it once. The sync NextPDF client does not expose a close() method. For long-lived async workloads, prefer AsyncNextPDF and manage its lifecycle explicitly (see Production operational model).

A backend implements the PdfBackend protocol. The remote backend (RemoteBackend) is selected automatically when you pass base_url and api_key. You must inject the local backend (LocalBackend) explicitly through the backend= parameter of AsyncNextPDF; it is not exported from the top-level nextpdf package and is not reachable from the CLI or the MCP server.

CapabilityRemote (RemoteBackend)Local (LocalBackend)
Selected bybase_url + api_keyAsyncNextPDF(backend=LocalBackend(...))
NetworkNextPDF Connect over Hypertext Transfer Protocol Secure (HTTPS)None; runs in-process
Authentication, quotas, meteringCentralized on the serverNone
Observability and operational controlsServer-sideNone
Tagged PDF (StructTree) extractionYesYes
Untagged PDF extractionServer engineHeuristic paragraph split, confidence 0.5
Bounding boxesYes (when the server provides them)No (bbox is None)
Table extraction on untagged PDFsServer engineReturns no tables
Reachable from CLI / MCP serverYesNo (library-only)
Recommended forProductionOffline development, tagged-PDF tests

Use the remote backend for production because it is the only path with centralized authentication, quota enforcement, metering, and observability. Use the local backend for offline development and tests against tagged PDFs, while accepting heuristic results, no bounding boxes, and no tables on untagged input.

"""Inject the local backend for offline, library-only extraction."""
from nextpdf import AsyncNextPDF
from nextpdf.backends.local import LocalBackend
async def extract_offline(pdf_bytes: bytes) -> None:
"""Extract cited text without a NextPDF Connect server."""
async with AsyncNextPDF(backend=LocalBackend()) as client:
blocks = await client.ast.extract_cited_text(pdf_bytes)
for block in blocks:
# Heuristic blocks on untagged PDFs report confidence 0.5.
print(block.citation.confidence, block.text)

In production, run the remote backend against NextPDF Connect. The patterns below cover client reuse, error handling, retries, quota handling, and timeouts. Every symbol used here exists in the SDK. The SDK does not retry for you, so the retry loop is your responsibility.

RemoteBackend keeps one persistent httpx.AsyncClient for connection pooling. Construct AsyncNextPDF once, share it across requests, and close it on shutdown. Do not create a client per request.

"""Reuse one pooled async client for the lifetime of the process."""
import asyncio
import os
from pathlib import Path
from nextpdf import AsyncNextPDF
async def main() -> None:
"""Run several extractions over a single pooled client."""
base_url = os.environ["NEXTPDF_BASE_URL"]
# Treat the API key as a secret; read it from the environment, never hard-code it.
api_key = os.environ["NEXTPDF_API_KEY"]
async with AsyncNextPDF(base_url=base_url, api_key=api_key) as client:
pdf_paths = (Path("a.pdf"), Path("b.pdf"), Path("c.pdf"))
tasks = [
client.ast.get_document_ast(path.read_bytes())
for path in pdf_paths
]
documents = await asyncio.gather(*tasks)
for document in documents:
print(document.page_count, document.estimated_tokens)
if __name__ == "__main__":
asyncio.run(main())

The async context manager calls close() on exit, which closes the underlying transport. Without a context manager, call await client.close() yourself.

Handle errors with the exception hierarchy

Section titled “Handle errors with the exception hierarchy”

The SDK raises a typed exception hierarchy. All errors derive from NextPDFError; Hypertext Transfer Protocol (HTTP)-level failures derive from NextPDFAPIError and carry a status_code. Catch the specific types you can act on, and fall back to the base type.

ExceptionRaised whenKey attributes
NextPDFErrorBase type for every SDK errorstatus_code
NextPDFAPIErrorAny HTTP error from the serverstatus_code, error_code
NextPDFLicenseErrorHTTP 402; the feature needs a higher server tierstatus_code (402)
QuotaExceededErrorHTTP 429; rate limit or quota exceededretry_after
AstNoStructTreeErrorHTTP 422; untagged PDF with heuristic mode offstatus_code (422)
AstBuildTimeoutErrorHTTP 504; AST build timed outstatus_code (504)
"""Map SDK exceptions to caller-facing outcomes."""
from nextpdf import (
AstBuildTimeoutError,
AstNoStructTreeError,
AsyncNextPDF,
NextPDFAPIError,
NextPDFError,
NextPDFLicenseError,
QuotaExceededError,
)
async def safe_extract(client: AsyncNextPDF, pdf_bytes: bytes) -> str:
"""Extract text, translating known failures into a stable status string."""
try:
blocks = await client.ast.extract_cited_text(pdf_bytes)
except QuotaExceededError as exc:
# exc.retry_after holds the server Retry-After value in seconds, or None.
return f"rate-limited; retry after {exc.retry_after}s"
except NextPDFLicenseError:
return "feature requires a higher server tier"
except AstNoStructTreeError:
return "untagged PDF; enable heuristic mode or use a tagged PDF"
except AstBuildTimeoutError:
return "build timed out; reduce the page range"
except NextPDFAPIError as exc:
return f"server error (status {exc.status_code})"
except NextPDFError:
return "extraction failed"
return "\n".join(block.text for block in blocks)

The SDK does not retry automatically. Wrap calls in your own loop that retries on transient HTTP failures and honors the server Retry-After value, which QuotaExceededError exposes as retry_after (an integer number of seconds, or None). Use exponential backoff for other transient statuses, and do not retry NextPDFLicenseError.

"""Retry transient failures with exponential backoff and Retry-After support."""
import asyncio
from collections.abc import Awaitable, Callable
from typing import TypeVar
from nextpdf import NextPDFAPIError, QuotaExceededError
_RETRYABLE_STATUS = frozenset({500, 502, 503, 504})
_T = TypeVar("_T")
async def with_retry(
coro_factory: Callable[[], Awaitable[_T]],
*,
max_attempts: int = 4,
) -> _T:
"""Call coro_factory() with bounded retries on transient server errors.
Args:
coro_factory: A zero-argument callable returning a fresh awaitable.
max_attempts: Maximum number of attempts before giving up.
Returns:
The awaited result of the first successful attempt.
Raises:
NextPDFAPIError: When all attempts fail or the error is not retryable.
"""
delay = 1.0
for attempt in range(1, max_attempts + 1):
try:
return await coro_factory()
except QuotaExceededError as exc:
if attempt == max_attempts:
raise
await asyncio.sleep(exc.retry_after if exc.retry_after is not None else delay)
delay *= 2.0
except NextPDFAPIError as exc:
if attempt == max_attempts or exc.status_code not in _RETRYABLE_STATUS:
raise
await asyncio.sleep(delay)
delay *= 2.0
raise RuntimeError("unreachable")

Quota and rate-limit enforcement live on the server. On HTTP 429, the SDK raises QuotaExceededError and parses the Retry-After header into retry_after. The remote backend also surfaces X-RateLimit-* headers on render responses, so you can throttle proactively before you hit a hard limit.

Request timeouts use a fixed default of 60 seconds total with a 10-second connect timeout (httpx.Timeout(60.0, connect=10.0)). To bound long AST builds, narrow the work with page_range_start, page_range_end, or token_budget instead of relying on the timeout alone; an over-long build returns AstBuildTimeoutError (HTTP 504).

A batch worker reads PDFs, extracts cited text, and writes structured output. Reuse one pooled client, bound concurrency with a semaphore, and apply the retry helper above.

"""Batch-extract a directory of PDFs over one pooled async client."""
import asyncio
import os
from pathlib import Path
from nextpdf import AsyncNextPDF
async def run_batch(input_dir: Path, concurrency: int = 8) -> None:
"""Extract cited text for every PDF in input_dir, bounded by concurrency."""
semaphore = asyncio.Semaphore(concurrency)
async def worker(client: AsyncNextPDF, path: Path) -> None:
async with semaphore:
blocks = await client.ast.extract_cited_text(path.read_bytes())
out = path.with_suffix(".txt")
out.write_text("\n".join(b.text for b in blocks), encoding="utf-8")
async with AsyncNextPDF(
base_url=os.environ["NEXTPDF_BASE_URL"],
api_key=os.environ["NEXTPDF_API_KEY"],
) as client:
await asyncio.gather(*(worker(client, p) for p in input_dir.glob("*.pdf")))

A FastAPI service shares one AsyncNextPDF across requests during the application lifespan, so every request reuses the connection pool. Read credentials from the environment and treat the API key as a secret.

"""FastAPI service that shares one pooled NextPDF client across requests."""
import os
from contextlib import asynccontextmanager
from fastapi import FastAPI, UploadFile
from nextpdf import AsyncNextPDF
@asynccontextmanager
async def lifespan(app: FastAPI):
"""Create the pooled client on startup and close it on shutdown."""
app.state.client = AsyncNextPDF(
base_url=os.environ["NEXTPDF_BASE_URL"],
api_key=os.environ["NEXTPDF_API_KEY"],
)
try:
yield
finally:
await app.state.client.close()
app = FastAPI(lifespan=lifespan)
@app.post("/extract")
async def extract(file: UploadFile) -> dict[str, list[str]]:
"""Return cited text blocks for an uploaded PDF."""
pdf_bytes = await file.read()
blocks = await app.state.client.ast.extract_cited_text(pdf_bytes)
return {"text": [block.text for block in blocks]}

For AI agents, run the MCP server. It exposes PDF tools (for example nextpdf_extract_text, nextpdf_extract_tables, nextpdf_get_ast, nextpdf_info, nextpdf_search, nextpdf_get_outline, nextpdf_diff, and nextpdf_health) over standard input and output. The server reads NEXTPDF_BASE_URL and NEXTPDF_API_KEY from the environment, so it is remote-backed; like the CLI, it cannot use the local backend. Install the optional extra and run the module.

Terminal window
pip install "nextpdf[mcp]"
python -m nextpdf.mcp

See Python MCP server for the agent integration walkthrough, Python CLI for terminal usage, and Python API reference for the full client, model, and exception surface.