Convert¶
Transform documents to LLM-friendly Markdown.
Convert PDF, Word, PowerPoint, and Excel documents to Markdown with LLM-optimised output. Each conversion produces two files: a pure content file with exact line numbers, and a separate TOC file with frontmatter and navigation.
Highlights¶
- Two-file output: pure content + separate TOC with line numbers
- Parallel batch processing with glob patterns
- Diff-stable image naming using content hashes
- Format-specific features (PDF bookmarks, speaker notes, formulas)
Functions¶
| Function | Description |
|---|---|
convert.pdf(pattern, output_dir) |
Convert PDF documents to Markdown |
convert.word(pattern, output_dir) |
Convert Word documents (.docx) to Markdown |
convert.powerpoint(pattern, output_dir, include_notes) |
Convert PowerPoint presentations (.pptx) to Markdown |
convert.excel(pattern, output_dir, include_formulas) |
Convert Excel spreadsheets (.xlsx) to Markdown |
convert.auto(pattern, output_dir) |
Auto-detect format and convert |
Key Parameters¶
| Parameter | Type | Description |
|---|---|---|
pattern |
str | Glob pattern for input files (e.g., docs/*.pdf, **/*.docx) |
output_dir |
str | Directory for output Markdown files and images |
include_notes |
bool | Include speaker notes in PowerPoint conversion |
include_formulas |
bool | Include cell formulas in Excel conversion |
Output Format¶
Each conversion produces two files:
Main Content File ({name}.md)¶
Pure content starting at line 1 - no frontmatter or TOC. This ensures line numbers in the TOC are exact.
TOC File ({name}.toc.md)¶
A separate file with frontmatter and table of contents:
---
source: path/to/document.pdf
converted: 2026-01-20T10:30:00Z
pages: 15
checksum: sha256:abc123...
---
# Table of Contents
**Document:** [document.md](document.md)
## How to Use This TOC
Each entry shows `(lines <start>-<end>)` for the main document.
To read a section efficiently:
1. Find the section you need below
2. Use the line range to read only that portion of [document.md](document.md)
3. Line numbers are exact - no offset needed
---
## Contents
- [Introduction](document.md#introduction) (lines 1-50)
- [Background](document.md#background) (lines 15-30)
- [Results](document.md#results) (lines 51-100)
Diff-Stable Images¶
Images are named using content hashes (img_abc123.png) for stable diffs across regenerations.
Examples¶
Converting PDFs¶
# Single file
convert.pdf(pattern="report.pdf", output_dir="output")
# Batch conversion with glob
convert.pdf(pattern="docs/**/*.pdf", output_dir="converted")
Converting Word Documents¶
# Single document
convert.word(pattern="spec.docx", output_dir="output")
# All Word docs in folder
convert.word(pattern="documents/*.docx", output_dir="md")
Converting PowerPoint¶
# Basic conversion
convert.powerpoint(pattern="deck.pptx", output_dir="output")
# Include speaker notes
convert.powerpoint(
pattern="presentations/*.pptx",
output_dir="output",
include_notes=True
)
Converting Excel¶
# Basic conversion (values only)
convert.excel(pattern="data.xlsx", output_dir="output")
# Include formulas
convert.excel(
pattern="spreadsheets/*.xlsx",
output_dir="output",
include_formulas=True
)
Auto-Detection¶
# Convert all supported formats
convert.auto(pattern="documents/*", output_dir="converted")
# Recursive with mixed formats
convert.auto(pattern="input/**/*", output_dir="output")
Supported Formats¶
| Format | Extension | Converter |
|---|---|---|
.pdf |
PyMuPDF (fitz) | |
| Word | .docx |
python-docx |
| PowerPoint | .pptx |
python-pptx |
| Excel | .xlsx |
openpyxl |
Features by Format¶
PDF¶
- Lazy page loading for memory efficiency
- Outline-based heading extraction (uses PDF bookmarks)
- Falls back to "Page N" headers if no outline
- Image extraction with soft-mask transparency support
Word¶
- Heading style detection (Heading 1-6)
- Table conversion to Markdown
- Inline image extraction
- Bold, italic, underline formatting
PowerPoint¶
- Slide titles as H2 headers
- Bullet point detection
- Table conversion
- Image extraction
- Optional speaker notes
Excel¶
- Sheet-based sections
- Streaming for large files
- Markdown table formatting
- Optional formula extraction
Batch Processing¶
All converters support glob patterns and process multiple files in parallel:
# Process all PDFs in parallel
convert.pdf(pattern="archive/**/*.pdf", output_dir="output")
# Converted 25 files, 0 failed
# Outputs:
# output/report1.md + output/report1.toc.md
# output/report2.md + output/report2.toc.md
# ...