/extract

Extract structured data from HTML using AI or rule-based extractors. Converts raw HTML into clean, structured content without the scraping overhead.

POST /extract

When to Use

  • Converting HTML to Markdown for LLM processing
  • Extracting main article content from noisy pages
  • Running custom extraction logic on HTML
  • Processing HTML you've already fetched elsewhere

Request Parameters

Required Parameters

ParameterTypeDescription
inputstringThe HTML content to extract from

Extraction Options

ParameterTypeDescription
presetstringBuilt-in extractor preset (see Presets below)
extractorstringCustom JavaScript extractor code

Extraction Presets

markdown

Converts HTML to clean Markdown with metadata extraction.

Output:

  • markdown - Full page as Markdown
  • html - Cleaned HTML
  • meta - Extracted metadata (title, author, description, date)

markdown_content

Extracts main content and converts to Markdown. Best for articles and blog posts.

Output:

  • markdown - Main content as Markdown
  • html - Main content HTML
  • meta - Metadata

content

Extract main readable content with Readability algorithm.

Output:

  • title - Article title
  • content - Main content HTML
  • textContent - Plain text
  • length - Content length
  • excerpt - Short excerpt
  • byline - Author
  • siteName - Site name

Custom Extractor

Write JavaScript using Cheerio for custom extraction:

extractor.js
function(input, cheerio) {
  const $ = cheerio.load(input);
  return {
    title: $('title').text(),
    heading: $('h1').first().text(),
    links: $('a[href]').map((i, el) => ({
      text: $(el).text(),
      href: $(el).attr('href')
    })).get().slice(0, 10)
  };
}

Example Requests

Using Markdown Preset

request.json
{
  "input": "<html><head><title>My Article</title></head><body><article><h1>Hello World</h1><p>This is content.</p></article></body></html>",
  "preset": "markdown"
}

Using Content Preset

request.json
{
  "input": "<html><body><article><h1>Breaking News</h1><p>Important story content here...</p><p>More details...</p></article><footer>Copyright 2026</footer></body></html>",
  "preset": "content"
}

Custom Extractor

request.json
{
  "input": "<html><body><ul><li>Item 1</li><li>Item 2</li><li>Item 3</li></ul></body></html>",
  "extractor": "function(input, cheerio) { let $ = cheerio.load(input); return $('li').map((i,el) => $(el).text()).get(); }"
}

cURL Example

Terminal
curl -X POST https://scraperex1.p.rapidapi.com/extract \
  -H "Content-Type: application/json" \
  -H "X-RapidAPI-Key: YOUR_API_KEY" \
  -H "X-RapidAPI-Host: scraperex1.p.rapidapi.com" \
  -d '{
    "input": "<html><body><h1>Title</h1><p>Content</p></body></html>",
    "preset": "markdown"
  }'

Response (markdown preset)

response.json
{
  "result": {
    "markdown": "# Hello World\n\nThis is content.",
    "html": "<h1>Hello World</h1><p>This is content.</p>",
    "meta": {
      "title": "My Article",
      "author": null,
      "description": null,
      "date": null
    }
  }
}

Response (content preset)

response.json
{
  "result": {
    "title": "Breaking News",
    "content": "<h1>Breaking News</h1><p>Important story content here...</p><p>More details...</p>",
    "textContent": "Breaking News\nImportant story content here...\nMore details...",
    "length": 85,
    "excerpt": "Important story content here...",
    "byline": null,
    "siteName": null
  }
}

Response Fields

FieldTypeDescription
resultobjectExtraction result (structure depends on preset/extractor)
result.markdownstringMarkdown output (markdown/markdown_content presets)
result.htmlstringCleaned HTML (markdown/markdown_content presets)
result.metaobjectMetadata (markdown/markdown_content presets)
result.titlestringArticle title (content preset)
result.contentstringMain content HTML (content preset)
result.textContentstringPlain text content (content preset)

Use Cases

LLM Content Preparation

Convert web pages to Markdown for feeding to language models:

request.json
{
  "input": "<your-html-content>",
  "preset": "markdown_content"
}

Article Extraction

Extract clean article content from news sites:

request.json
{
  "input": "<news-page-html>",
  "preset": "content"
}

Custom Data Extraction

Extract product data from e-commerce pages:

request.json
{
  "input": "<product-page-html>",
  "extractor": "function(input, cheerio) { let $ = cheerio.load(input); return { name: $('.product-title').text(), price: $('.price').text(), rating: $('.rating').attr('data-value') }; }"
}