Crawl

Crawl automatically discovers and scrapes multiple pages from a website. Instead of manually specifying each URL, crawl intelligently finds pages and extracts content from each page.

How It Works

Enter Starting URL: Provide the webpage where the crawl should begin.
Configure Settings: Set up your crawl preferences.
Automatic Discovery: The robot finds pages using sitemaps and links.
Content Extraction: Each discovered page is visited and scraped.

What Gets Extracted

For each page discovered, the crawl extracts

Page metadata: Title, language, description, favicon, and all meta tags
HTML content: Full page HTML
Text content: Clean body text with word count
Links: All links found on the page
Status information: HTTP status code and scrape timestamp

✅ When to Use Crawl

You need to scrape multiple pages from a website
You want to discover pages automatically without listing URLs manually
You're extracting similar content across many pages (blog posts, product pages, documentation)
The website has a clear structure or sitemap

❌ When Not to Use Crawl

You only need data from a single page (use Extract or Scrape instead)
You need complex interactions like logins or form submissions
You need to extract structured data in a specific format (use Extract)
You need to control the exact order pages are visited

For complex workflows with user interactions, use Extract instead.

Using with SDK

Crawl is available through the Maxun SDK for programmatic usage and integration into your applications.

How It Works​

What Gets Extracted​

✅ When to Use Crawl​

❌ When Not to Use Crawl​

Using with SDK​

How It Works

What Gets Extracted

✅ When to Use Crawl

❌ When Not to Use Crawl

Using with SDK