Crawl

Automatically discover and scrape multiple pages from websites using sitemaps and link following.

Creating Crawl Robots

import { Crawl } from 'maxun-sdk';

const crawler = new Crawl({
  apiKey: process.env.MAXUN_API_KEY
});

const robot = await crawler.create(
  'Blog Crawler',
  'https://example.com/blog',
  {
    mode: 'domain',
    limit: 50,
    useSitemap: true,
    followLinks: true
  }
);

Configuration

Basic Options

mode (required)

Defines the crawl scope:

domain - Only pages on the exact same domain
subdomain - Domain and all its subdomains
path - Only pages under the same path

limit (optional)

Maximum number of pages to crawl. Defaults to 10.

{
  limit: 100
}

Advanced Options

maxDepth (optional)

Maximum crawl depth from starting URL. Each link level counts as one depth.

{
  maxDepth: 3
}

useSitemap (optional)

Fetch and parse the website's sitemap.xml. Defaults to true.

{
  useSitemap: true
}

followLinks (optional)

Extract and follow links from each visited page. Defaults to true.

{
  followLinks: true
}

includePaths (optional)

Regex patterns for URLs to include. Only matching URLs will be crawled.

{
  includePaths: ['/blog/[0-9]{4}/.*']
}

excludePaths (optional)

Regex patterns for URLs to exclude from crawling.

{
  excludePaths: ['.*/admin/.*', '.*/tag/.*']
}

respectRobots (optional)

Respect robots.txt directives. Defaults to true.

{
  respectRobots: true
}

Crawl Modes

Domain Mode

const robot = await crawler.create(
  'Domain Crawler',
  'https://blog.example.com',
  {
    mode: 'domain',
    limit: 50
  }
);

Crawls only blog.example.com. Won't crawl shop.example.com or example.com.

Subdomain Mode

const robot = await crawler.create(
  'Subdomain Crawler',
  'https://example.com',
  {
    mode: 'subdomain',
    limit: 100
  }
);

Crawls example.com, blog.example.com, shop.example.com, etc.

Path Mode

const robot = await crawler.create(
  'Path Crawler',
  'https://example.com/blog',
  {
    mode: 'path',
    limit: 50
  }
);

Crawls only pages under /blog/ path.

Examples

Blog Crawl

const robot = await crawler.create(
  'Blog Posts',
  'https://example.com/blog',
  {
    mode: 'path',
    limit: 50,
    useSitemap: true,
    followLinks: true
  }
);

const result = await robot.run();
console.log('Pages crawled:', result.data.crawlData.length);

Documentation Crawl

const robot = await crawler.create(
  'Documentation',
  'https://docs.example.com',
  {
    mode: 'subdomain',
    limit: 200,
    useSitemap: true,
    followLinks: false,
    maxDepth: 5
  }
);

Filtered Crawl

const robot = await crawler.create(
  'Product Pages',
  'https://example.com',
  {
    mode: 'domain',
    limit: 100,
    includePaths: ['/products/.*'],
    excludePaths: ['.*/reviews/.*', '.*/comments/.*'],
    useSitemap: true,
    followLinks: true
  }
);

Full Site Crawl

const robot = await crawler.create(
  'Full Site',
  'https://example.com',
  {
    mode: 'subdomain',
    limit: 500,
    excludePaths: ['.*/admin/.*', '.*/login.*'],
    useSitemap: true,
    followLinks: true,
    respectRobots: true
  }
);

Accessing Crawl Results

const result = await robot.run();

if (result.data.crawlData) {
  const pages = result.data.crawlData;

  pages.forEach(page => {
    console.log('URL:', page.metadata?.url);
    console.log('Title:', page.metadata?.title);
    console.log('Word count:', page.wordCount);
    console.log('Status:', page.metadata?.statusCode);
  });
}

Each page contains:

metadata - URL, title, description, language, meta tags, favicon, status code
html - Full page HTML
text - Clean body text
wordCount - Number of words
links - All links found on the page

Managing Crawl Robots

// Get all crawl robots
const robots = await crawler.getRobots();

// Get specific robot
const robot = await crawler.getRobot('robot-id');

// Delete robot
await crawler.deleteRobot('robot-id');

Running Crawl Robots

// Run immediately
const result = await robot.run();

// Run with timeout
const result = await robot.run({
  timeout: 60000
});

For scheduling, webhooks, and other robot management features, see Robot Management.

Creating Crawl Robots​

Configuration​

Basic Options​

Advanced Options​

Crawl Modes​

Domain Mode​

Subdomain Mode​

Path Mode​

Examples​

Blog Crawl​

Documentation Crawl​

Filtered Crawl​

Full Site Crawl​

Accessing Crawl Results​

Managing Crawl Robots​

Running Crawl Robots​