Python SDK
At a glance
Section titled “At a glance”Use the NextPDF Python Software Development Kit (SDK) when your Python application, asyncio service, AI agent, or terminal workflow needs PDF extraction with provenance. The SDK returns structured blocks with citation anchors: page index, confidence, optional bounding box, and a semantic node identifier. You can trace every extracted value back to its source location.
The package includes a synchronous NextPDF client for scripts and notebooks, an asynchronous AsyncNextPDF client for asyncio runtimes, a nextpdf command-line interface (CLI) for streaming extraction from large files, and an optional Model Context Protocol (MCP) server that lets AI agents call extraction tools directly. All four paths use the same Abstract Syntax Tree (AST) surface through a NextPDF Connect endpoint.
You need Python 3.10 or newer and, for production extraction, a NextPDF Connect endpoint. Install the SDK with pip install nextpdf. For the agent server, use pip install nextpdf[mcp].
Section map
Section titled “Section map”| Page | Use it for |
|---|---|
| Overview | What the SDK provides, which backend to choose, and where the limits are. |
| Quickstart | Install the SDK and extract cited text with page-level provenance. |
| API reference | Clients, AST method chains, Pydantic models, CLI commands, and exceptions. |
| Developer guide | Architecture boundaries, runtime lifecycle, async batching, and failure handling. |
| CLI | Run citation-aware extraction from the terminal and stream large documents. |
| MCP server | Expose extraction tools to AI agents that support MCP. |
Primary APIs
Section titled “Primary APIs”| Symbol | Role |
|---|---|
NextPDF | Synchronous client for scripts, batch jobs, and notebooks. |
AsyncNextPDF | Asynchronous client and async context manager for asyncio runtimes. |
client.ast.get_document_ast() | Builds the full Semantic AST from PDF bytes. |
client.ast.extract_cited_text() | Extracts text blocks with citation anchors. |
client.ast.extract_cited_tables() | Extracts tables with cell-level citation anchors. |
client.ast.search_ast_nodes() | Finds nodes by type, page, or text query. |
client.ast.get_ast_diff() | Compares two PDF versions structurally. |
nextpdf | Command-line interface for terminal and pipeline extraction. |
See also
Section titled “See also”- Python SDK overview — capabilities, backends, and limits.
- Python SDK quickstart — your first extraction.
- Python API reference — every public symbol.