Extract

Build structured data extraction workflows programmatically using the SDK.

Creating Extract Robots

Extract robots can be created using LLM-based extraction or non-LLM rules.

LLM Extraction (Beta)

Create robots using natural language.

from maxun import Extract, Config

extractor = Extract(Config(api_key="your-api-key"))

robot = await extractor.extract(
    url="https://example.com",
    prompt="Extract first 20 product names and prices",
    llm_provider="anthropic",
    llm_model="claude-3-5-sonnet-20241022",
    llm_api_key="your-anthropic-api-key",
)

result = await robot.run()

See AI Mode for provider details and LLM Extraction Prompts for writing effective prompts.

Non LLM Extraction

For non-LLM extraction, you define exact CSS selectors to capture data from web pages.

from maxun import Extract, Config

extractor = Extract(Config(api_key="your-api-key"))

robot = await (
    extractor
    .create("Product Extractor")
    .navigate("https://example.com/products")
    .capture_text({
        "productName": ".product-title",
        "price": ".price",
    })
)

result = await robot.run()

Key Features

1. Auto List Capture

When using captureList, you only need to provide the list item selector. Maxun automatically:

Detects all meaningful fields within each list item
Extracts clean, structured data from those fields

robot = await (
    extractor
    .create("Products")
    .navigate("https://example.com")
    .capture_list({
        "selector": ".product-card"
    })
)

2. Auto Pagination (Optional)

Pagination is completely optional. When you don't specify the pagination field, Maxun automatically detects and handles pagination for you.

.capture_list({
    "selector": ".product-card",
    "maxItems": 100,
})

3. Pagination with Selectors

For precise control, specify the pagination type and selector

.capture_list({
    "selector": ".product-card",
    "pagination": {
        "type": "clickNext",
        "selector": "button.next-page",
    },
    "maxItems": 100,
})

Pagination Types

Type	Description	Selector Required?	Example
`scrollDown`	Infinite scroll (downward)	❌ No	`{ type: 'scrollDown' }`
`scrollUp`	Infinite scroll (upward)	❌ No	`{ type: 'scrollUp' }`
`clickNext`	Click "Next" button/link	✅ Yes	`{ type: 'clickNext', selector: 'a.next' }`
`clickLoadMore`	Click "Load More" button	✅ Yes	`{ type: 'clickLoadMore', selector: 'button.load-more' }`

Methods

navigate(url)

.navigate("https://example.com")

Data Extraction

captureText(fields, name?)

Extract specific text fields using CSS selectors:

.capture_text(
    {
        "title": ".article-title",
        "author": ".author-name",
    },
    name="Article Info",
)

captureList(config, name?)

Extract data from lists with automatic field detection. See Key Features above for details on auto list capture and pagination.

# Simple auto-detection
.capture_list(
    {"selector": ".product-item"},
    name="Products",
)

# With pagination
.capture_list(
    {
        "selector": ".product-item",
        "pagination": {"type": "scrollDown"},
        "maxItems": 50,
    },
    name="Products",
)

captureScreenshot(name?, options?)

.capture_screenshot("Homepage", {"fullPage": True})

Interaction

click(selector)

.click('button.show-more')

type(selector, text, inputType?)

.type("input[name='search']", "web scraping", "text")

Input types: text, email, password, number, tel, url

scroll(direction, distance?)

.scroll('down', 500)
.scroll('up')

Waiting

waitFor(selector, timeout?)

.wait_for(".dynamic-content", 5000))

wait(milliseconds)

.wait(2000)

Configuration

setCookies(cookies)

.set_cookies([
    {
        "name": "session",
        "value": "abc123",
        "domain": ".example.com",
    }
])

Examples

List with Pagination

robot = await (
    extractor
    .create("News Articles")
    .navigate("https://news.example.com")
    .capture_list({
        "selector": "article.news-item",
        "pagination": {
            "type": "clickNext",
            "selector": "a.next-page",
        },
        "maxItems": 100,
    })
)

result = await robot.run()

Multi-Step Workflow

robot = await (
    extractor
    .create("Search Results")
    .navigate("https://example.com")
    .type("input[name='q']", "data extraction")
    .click("button[type='submit']")
    .wait_for(".results")
    .capture_list({"selector": ".result-item"})
)

Form Fill

robot = await (
    extractor
    .create("Login and Extract")
    .navigate("https://example.com/login")
    .type("input[name='email']", "user@example.com", "email")
    .type("input[name='password']", "password123", "password")
    .click("button[type='submit']")
    .wait_for(".dashboard")
    .capture_text({
        "username": ".user-name",
        "balance": ".account-balance",
    })
)

result = await robot.run()

Managing Robots

Get All Robots

robots = await extractor.get_robots()

Get Specific Robot

robot = await extractor.get_robot("robot-id")

Delete Robot

await extractor.delete_robot("robot-id")

Running Robots

# Run immediately
result = await robot.run()

# Run with options
result = await robot.run(
    wait_for_completion=True,
    timeout=60000,
)

See Robot Management for scheduling and webhooks.

Creating Extract Robots​

LLM Extraction (Beta)​

Non LLM Extraction​

Key Features​

1. Auto List Capture​

2. Auto Pagination (Optional)​

3. Pagination with Selectors​

Methods​

Navigation​

Data Extraction​

Interaction​

Waiting​

Configuration​

Examples​

List with Pagination​

Multi-Step Workflow​

Form Fill​

Managing Robots​

Get All Robots​

Get Specific Robot​

Delete Robot​

Running Robots​