Skip to content

Python SDK developer guide

The NextPDF Python Software Development Kit (SDK) is a thin, typed client for a NextPDF Connect endpoint. Your application owns Portable Document Format (PDF) input validation, credential handling, and concurrency policy. The SDK owns request construction, transport, and response typing. Keep that boundary clear: read the PDF safely, choose a client, call the ast method you need, and handle the specific failure.

Use this guide when you build extraction services, asyncio batch jobs, artificial intelligence (AI) agent tools, or command-line workflows around the SDK. It assumes you have read the overview and quickstart, and that you have Python 3.10 or newer and a NextPDF Connect endpoint.

LayerOwned byResponsibilityDo not put here
Input sourceApplicationAuthorize the caller, validate the PDF source, and choose the extraction policy.Endpoint Uniform Resource Locator (URL) or credential literals.
Client constructionApplicationRead base_url and api_key from the environment or a secret manager.Hard-coded secrets.
NextPDF / AsyncNextPDFSDKBuild the request, call Connect, and return typed Pydantic models.Domain logic or storage policy.
ast method namespaceSDKMap a method call to a Connect endpoint and parse the response.Retry or backoff policy beyond what you configure.
NextPDF Connect endpointDeploymentRun extraction and enforce authentication, quotas, and licensing.Application authorization.

The SDK never performs optical character recognition (OCR). If a PDF is scanned or image-only, run OCR before extraction. Treat that step as an application concern outside this boundary.

StageBehaviorDeveloper action
Client constructionbase_url and api_key are validated; either empty value raises ValueError.Read both from the environment; never inline them.
Backend creationA remote backend opens a pooled connection to Connect.Reuse one client across calls instead of constructing per request.
Method callThe ast method serializes the request, sends PDF bytes, and parses the response into a Pydantic model.Pass already-validated bytes.
Error mappingThe SDK maps a non-success Hypertext Transfer Protocol (HTTP) status to a specific exception subclass.Catch the most specific class first.
ShutdownAsyncNextPDF.close() releases the connection pool; the async context manager calls it for you.Use async with or call close() in a finally block.
PathPurpose
app/pdf/clients.pyBuild and cache a configured NextPDF or AsyncNextPDF.
app/pdf/extraction.pyApplication wrapper around the ast method calls.
app/pdf/validation.pyPDF source validation, size limits, and content checks.
tests/pdf/Extraction, failure-mode, and async-batching tests.

Keep PDF validation separate from extraction. Pass only authorized, size-checked bytes into the extraction layer, and still rely on the endpoint for defense in depth.

import os
from nextpdf import NextPDF
def build_client() -> NextPDF:
"""Construct a synchronous client from environment configuration.
Raises:
KeyError: When a required environment variable is missing.
"""
base_url = os.environ["NEXTPDF_BASE_URL"]
api_key = os.environ["NEXTPDF_API_KEY"]
return NextPDF(base_url=base_url, api_key=api_key)

Use the synchronous NextPDF client for scripts, batch jobs, and notebooks. Validate input before you call the SDK, and handle the specific failures the call can raise.

from pathlib import Path
from nextpdf import (
NextPDF,
CitedTextBlock,
NextPDFAPIError,
NextPDFError,
QuotaExceededError,
)
MAX_PDF_BYTES = 100 * 1024 * 1024 # Reject documents above 100 MiB for the in-memory path.
def read_pdf(path: Path) -> bytes:
"""Read and validate a PDF from disk.
Raises:
ValueError: When the file is missing, empty, oversized, or not a PDF.
"""
if not path.is_file():
raise ValueError(f"Not a file: {path}")
data = path.read_bytes()
if not data:
raise ValueError("PDF is empty")
if len(data) > MAX_PDF_BYTES:
raise ValueError("PDF exceeds the configured size limit; use the CLI streaming path")
if not data.startswith(b"%PDF-"):
raise ValueError("File does not look like a PDF")
return data
def extract_text(client: NextPDF, path: Path) -> list[CitedTextBlock]:
"""Extract cited text blocks, handling the most specific failures first."""
pdf_bytes = read_pdf(path)
try:
return client.ast.extract_cited_text(pdf_bytes)
except QuotaExceededError as error:
raise RuntimeError(f"Quota exceeded; retry after {error.retry_after}s") from error
except NextPDFAPIError as error:
raise RuntimeError(f"API error {error.status_code}: {error}") from error
except NextPDFError as error:
raise RuntimeError(f"SDK error: {error}") from error

One result item has this shape:

block = blocks[0]
print(block.text) # the extracted text
print(block.citation.page_index) # 0-based page index
print(block.citation.confidence) # 0.0 - 1.0

Use the asynchronous AsyncNextPDF client inside asyncio runtimes such as FastAPI. Construct one client as an async context manager and share it across concurrent calls; do not open a client per document. Limit concurrency with a semaphore so you respect the endpoint’s quota.

import asyncio
import os
from nextpdf import (
AsyncNextPDF,
ExtractCitedTablesResponse,
NextPDFError,
QuotaExceededError,
)
async def extract_tables_batch(
pdfs: list[bytes],
*,
max_concurrency: int = 4,
) -> list[ExtractCitedTablesResponse | None]:
"""Extract tables from many PDFs concurrently with one shared client.
Returns one response per input PDF, or None where extraction failed.
"""
base_url = os.environ["NEXTPDF_BASE_URL"]
api_key = os.environ["NEXTPDF_API_KEY"]
semaphore = asyncio.Semaphore(max_concurrency)
async with AsyncNextPDF(base_url=base_url, api_key=api_key) as client:
async def one(pdf_bytes: bytes) -> ExtractCitedTablesResponse | None:
async with semaphore:
try:
return await client.ast.extract_cited_tables(pdf_bytes)
except QuotaExceededError as error:
# Surface the backpressure signal; do not silently drop it.
raise RuntimeError(f"Quota exceeded; retry after {error.retry_after}s") from error
except NextPDFError:
return None
return await asyncio.gather(*(one(pdf) for pdf in pdfs))

Never write an empty except. Act on the failure, convert it to a defined result, or re-raise it.

Extension pointUse it forConstraint
AsyncNextPDF(backend=...)Inject a custom or local backend in tests.The backend must satisfy the PdfBackend protocol.
api_version argumentPin a Connect application programming interface (API) version.Defaults to v1; change only when the endpoint supports the target version.
Environment configurationSupply NEXTPDF_BASE_URL and NEXTPDF_API_KEY to the command-line interface (CLI) and Model Context Protocol (MCP) server.Treat the key as a secret scoped to the workload.
MCP server (python -m nextpdf.mcp)Expose extraction tools to MCP-capable agents.Requires the nextpdf[mcp] extra and a controlled endpoint.
  1. Install the SDK with pip install nextpdf, or use pip install nextpdf[mcp] for the agent server.
  2. Read NEXTPDF_BASE_URL and NEXTPDF_API_KEY from the environment so no secret enters source control.
  3. Validate every PDF source for existence, size, and the %PDF- magic bytes before calling the SDK.
  4. Build one client per process and reuse it; for asyncio, hold it open with async with.
  5. Call the narrowest ast method for the task: extract_cited_text() for prose, extract_cited_tables() for tables, get_document_ast() only when you need the full tree.
  6. Catch the most specific exception you can act on, then fall back to NextPDFError.
  7. For documents over 100 MiB, use the CLI streaming path instead of materializing every block in memory.
  8. Run mypy in strict mode and add a failure-mode test for each exception you handle.
FailureExceptionRecommended response
Untagged PDF, heuristics offAstNoStructTreeError (HTTP 422)Turn on heuristic mode on the endpoint or supply a tagged PDF.
Server-side build timeoutAstBuildTimeoutError (HTTP 504)Reduce the page range and retry.
License tier requiredNextPDFLicenseError (HTTP 402)Upgrade the server license or fall back to a permitted feature.
Rate limit or quotaQuotaExceededError (HTTP 429)Wait for retry_after seconds, then retry with backoff.
Other HTTP errorNextPDFAPIErrorInspect status_code and error_code; log and surface a defined error.
Any SDK errorNextPDFErrorFinal fallback; never let it escape as an unhandled exception.

The endpoint reports failures with HTTP status semantics aligned with Request for Comments (RFC) 9110 and machine-readable error bodies aligned with RFC 9457. Each exception preserves the originating status_code. Map those failures to your own error responses rather than leaking transport detail to callers.

ConcernDefaultWhen to override
API versionv1.Change only when the endpoint supports a newer version.
Transport Layer Security (TLS) verificationEnabled; no insecure switch is exposed.Never disable for production traffic.
CredentialsRead from the environment; never inlined.Use a secret manager in production.
In-memory size limitReject PDFs over 100 MiB on the client path.Lower for multi-tenant services; use the CLI for larger files.
ConcurrencyBounded by a semaphore in async batches.Tune to the endpoint’s quota, not to the host’s core count.
LoggingLog filename, size, status, and duration.Never log PDF bytes or the API key.
  • Construction tests assert that an empty base_url or api_key raises ValueError.
  • Validation tests cover missing, empty, oversized, and non-PDF inputs.
  • Extraction tests assert the returned model types and a CitationAnchor on each block.
  • Failure-mode tests cover AstNoStructTreeError, AstBuildTimeoutError, NextPDFLicenseError, QuotaExceededError, and NextPDFAPIError.
  • Async tests assert the client runs as an async with context manager and that concurrency stays within the semaphore bound.
  • Lifecycle tests assert that close() releases the transport and is idempotent.
  • Inject a fake backend with AsyncNextPDF(backend=...) so tests run without a live endpoint.