Configuration
Configure how your crawl robot discovers and scrapes pages from a website.
View how to configure crawl via SDK here.
Basic Configuration
Robot Name
- Custom name for your crawl robot
Starting URL
- The webpage where the crawl begins
- Must be a valid, accessible URL
- All discovered pages will be relative to this starting point
Max Pages to Crawl
- Maximum number of pages to crawl
- Pages are discovered and crawled in order of relevance
- Recommended: Start with 10-20 pages to test your configuration
Advanced Options
Crawl Scope
Choose how broadly the robot should crawl from your starting URL:
Domain Mode
- Crawls only pages on the exact same domain
- Example: Starting at
blog.example.comstays onblog.example.com - Best for: Focused crawls on a specific subdomain
Subdomain Mode
- Crawls the domain and all its subdomains
- Example: Starting at
example.comincludesblog.example.com,shop.example.com, etc. - Best for: Comprehensive site crawls across multiple sections
Path Mode
- Crawls only pages under the same path as the starting URL
- Example: Starting at
example.com/blog/stays within/blog/path - Best for: Crawling specific sections like documentation or blog categories
Max Depth
Maximum Crawl Depth
- Controls how many "levels" deep from the starting URL the crawler should go
- Each click or navigation from one page to another counts as one level
- Higher depth values discover more pages but increase crawl time
URL Filtering
Note: This feature is currently in development and not fully enforced.
Include Paths
- Regex patterns for URLs to include in your crawl
- Only URLs matching these patterns will be crawled
- Leave empty to include all URLs within your scope
- Example:
/blog/[0-9]{4}/.*for dated blog posts
Exclude Paths
- Regex patterns for URLs to exclude from your crawl
- URLs matching these patterns will be skipped
- Example:
.*/admin/.*to skip admin pages - Useful for avoiding login pages, tag pages, or duplicate content
Pattern Tips:
- Patterns use JavaScript regular expressions (regex)
- The
.*matches any characters (wildcard) - Use
\\.to match literal dots in URLs - Patterns are case-sensitive by default
Discovery Options
Use Sitemap
- When enabled, fetches and parses the website's sitemap.xml
- Automatically follows nested sitemaps
- Recommended for sites with well-maintained sitemaps
Follow Links
- When enabled, extracts all links from each visited page
- Crawls pages that match your scope and filters
- Recommended when sitemap is incomplete or unavailable
Best Practice: Enable both options for comprehensive discovery. The robot will combine URLs from both sources and remove duplicates.
Robots.txt Compliance
When enabled, the crawler respects robots.txt directives. Recommended for ethical crawling of third-party websites. By default enabled.
Example Configurations
Simple Blog Crawl
{
mode: 'path',
limit: 50,
useSitemap: true,
followLinks: true
}
Filtered Blog Crawl (Advanced)
{
mode: 'path',
limit: 50,
includePaths: ['/blog/[0-9]{4}/.*'],
excludePaths: ['.*/tag/.*', '.*/author/.*'],
useSitemap: true,
followLinks: true
}
Full Site Crawl
{
mode: 'subdomain',
limit: 100,
excludePaths: ['.*/admin/.*', '.*/login.*'],
useSitemap: true,
followLinks: true
}
Documentation Crawl
{
mode: 'path',
limit: 200,
useSitemap: true,
followLinks: false
}