Web Fetch¶
Extracts main content from web pages, filtering navigation, ads, and boilerplate.
Highlights¶
- Clean content extraction filtering navigation and ads
- Multiple output formats (markdown, text, json)
- Batch processing with concurrent execution
- Output truncation with max_length parameter
- URL validation with helpful error messages
- JSON-structured errors when using json output format
- Optional HTTP response metadata
- Non-HTML content (plain text, JSON, XML, CSV) returned directly without extraction
Functions¶
| Function | Description |
|---|---|
web.fetch(url, ...) |
Fetch and extract content from a URL |
web.fetch_batch(urls, ...) |
Fetch multiple URLs concurrently |
Key Parameters¶
| Parameter | Type | Description |
|---|---|---|
url |
str | URL to fetch content from |
output_format |
str | "markdown" (default), "text", "json" |
include_links |
bool | Include links in output |
include_images |
bool | Include image references |
include_tables |
bool | Include tables in output (default: True) |
include_comments |
bool | Include comments section |
include_formatting |
bool | Preserve headers/lists (default: True) |
include_metadata |
bool | Include HTTP metadata in JSON output |
favor_precision |
bool | Prefer accuracy over completeness |
favor_recall |
bool | Prefer completeness over accuracy |
fast |
bool | Skip fallback extraction for speed |
target_language |
str | Filter by ISO 639-1 language code |
max_length |
int | Truncate output to this length |
use_cache |
bool | Use cached pages (default: True) |
Note: favor_precision and favor_recall are mutually exclusive.
Examples¶
# Fetch single URL
web.fetch(url="https://example.com/article")
# Fetch with markdown output
web.fetch(url="https://docs.python.org/3/tutorial/", output_format="markdown")
# Fast mode without fallback
web.fetch(url="https://example.com/page", fast=True)
# JSON output with metadata
web.fetch(
url="https://example.com/article",
output_format="json",
include_metadata=True
)
# Precision mode for cleaner extraction
web.fetch(url="https://example.com/page", favor_precision=True)
# Batch fetch multiple URLs
web.fetch_batch(urls=[
"https://example.com/page1",
"https://example.com/page2"
])
# Batch with all options
web.fetch_batch(
urls=["https://example.com/page1", "https://example.com/page2"],
include_links=True,
favor_precision=True,
fast=True
)
# Fetch plain text or JSON files (returned directly without extraction)
web.fetch(url="https://example.com/data.json")
web.fetch(url="https://example.com/robots.txt")