Configuration

Configure how your crawl robot discovers and scrapes pages from a website.

View how to configure crawl via SDK here.

Basic Configuration

Robot Name

Custom name for your crawl robot

Starting URL

The webpage where the crawl begins
Must be a valid, accessible URL
All discovered pages will be relative to this starting point

Max Pages to Crawl

Maximum number of pages to crawl
Pages are discovered and crawled in order of relevance
Recommended: Start with 10-20 pages to test your configuration

Advanced Options

Crawl Scope

Choose how broadly the robot should crawl from your starting URL:

Domain Mode

Crawls only pages on the exact same domain
Example: Starting at blog.example.com stays on blog.example.com
Best for: Focused crawls on a specific subdomain

Subdomain Mode

Crawls the domain and all its subdomains
Example: Starting at example.com includes blog.example.com, shop.example.com, etc.
Best for: Comprehensive site crawls across multiple sections

Path Mode

Crawls only pages under the same path as the starting URL
Example: Starting at example.com/blog/ stays within /blog/ path
Best for: Crawling specific sections like documentation or blog categories

Max Depth

Maximum Crawl Depth

Controls how many "levels" deep from the starting URL the crawler should go
Each click or navigation from one page to another counts as one level
Higher depth values discover more pages but increase crawl time

URL Filtering

Note: This feature is currently in development and not fully enforced.

Include Paths

Regex patterns for URLs to include in your crawl
Only URLs matching these patterns will be crawled
Leave empty to include all URLs within your scope
Example: /blog/[0-9]{4}/.* for dated blog posts

Exclude Paths

Regex patterns for URLs to exclude from your crawl
URLs matching these patterns will be skipped
Example: .*/admin/.* to skip admin pages
Useful for avoiding login pages, tag pages, or duplicate content

Pattern Tips:

Patterns use JavaScript regular expressions (regex)
The .* matches any characters (wildcard)
Use \\. to match literal dots in URLs
Patterns are case-sensitive by default

Discovery Options

Use Sitemap

When enabled, fetches and parses the website's sitemap.xml
Automatically follows nested sitemaps
Recommended for sites with well-maintained sitemaps

Follow Links

When enabled, extracts all links from each visited page
Crawls pages that match your scope and filters
Recommended when sitemap is incomplete or unavailable

Best Practice: Enable both options for comprehensive discovery. The robot will combine URLs from both sources and remove duplicates.

Robots.txt Compliance

When enabled, the crawler respects robots.txt directives. Recommended for ethical crawling of third-party websites. By default enabled.

Example Configurations

Simple Blog Crawl

{
  mode: 'path',
  limit: 50,
  useSitemap: true,
  followLinks: true
}

Filtered Blog Crawl (Advanced)

{
  mode: 'path',
  limit: 50,
  includePaths: ['/blog/[0-9]{4}/.*'],
  excludePaths: ['.*/tag/.*', '.*/author/.*'],
  useSitemap: true,
  followLinks: true
}

Full Site Crawl

{
  mode: 'subdomain',
  limit: 100,
  excludePaths: ['.*/admin/.*', '.*/login.*'],
  useSitemap: true,
  followLinks: true
}

Documentation Crawl

{
  mode: 'path',
  limit: 200,
  useSitemap: true,
  followLinks: false
}

Basic Configuration​

Advanced Options​

Crawl Scope​

Max Depth​

URL Filtering​

Discovery Options​

Robots.txt Compliance​

Example Configurations​