Skip to content

Extract structured metadata from HTML using composable rule bundles.

User intent

  • build link previews
  • parse Open Graph/Twitter Cards/JSON-LD from pages
  • extract titles/descriptions/images/authors from HTML
  • create metadata extraction pipelines with custom rules and provider-specific parsers

Installation

npx skills add https://github.com/microlinkhq/skills --skill metascraper
# metascraper

`metascraper` extracts normalized metadata from the input HTML you provide, using ordered rule bundles.

Extraction quality depends on the accuracy of that input HTML.

## Quick Start

Install only what you need:

```bash
npm install metascraper \
  metascraper-author \
  metascraper-date \
  metascraper-description \
  metascraper-image \
  metascraper-logo \
  metascraper-publisher \
  metascraper-title \
  metascraper-url
```

Minimal extraction:

```js
const metascraper = require('metascraper')([
  require('metascraper-author')(),
  require('metascraper-date')(),
  require('metascraper-description')(),
  require('metascraper-image')(),
  require('metascraper-logo')(),
  require('metascraper-publisher')(),
  require('metascraper-title')(),
  require('metascraper-url')()
])

const metadata = await metascraper({
  url: 'https://example.com/article',
  html: '<html>...</html>'
})
```

## Recommended Workflow

1. Get accurate HTML for the target URL.
2. Compose rule bundles for the fields you need.
3. Run `metascraper({ url, html })`.
4. Narrow extraction with `pickPropNames` when speed matters.
5. Add custom rules only when defaults miss required data.

## Getting HTML

`metascraper` needs both `url` and `html`.

- Static pages: fetch raw HTML with your HTTP client.
- JS-heavy pages: use a rendered HTML source (for example, `html-get`).
- At scale: if browser infra is not desired, use Microlink API as managed alternative.

For accurate rendered markup, use the local [html-get skill](../html-get/SKILL.md) before running `metascraper`.

## Core API

Create an instance:

```js
const metascraper = require('metascraper')([/* rule bundles */])
```

Run extraction:

```js
const metadata = await metascraper({
  url,
  html,
  htmlDom,        // optional pre-parsed DOM
  rules,          // optional extra rules at runtime
  pickPropNames,  // optional Set of fields to compute
  omitPropNames,  // optional Set of fields to skip
  validateUrl     // default true
})
```

## Property Selection Patterns

Compute only selected fields:

```js
const metadata = await metascraper({
  url,
  html,
  pickPropNames: new Set(['title', 'description', 'image'])
})
```

Skip expensive or unwanted fields:

```js
const metadata = await metascraper({
  url,
  html,
  omitPropNames: new Set(['logo'])
})
```

When both are present, `pickPropNames` takes precedence.

## Rule Bundle Notes

- Rules are evaluated in order.
- The first rule that resolves a field wins.
- Put specific rules before generic fallbacks.
- Use package-level options for bundle behavior.

Example with bundle options:

```js
const metascraper = require('metascraper')([
  require('metascraper-logo')({
    filter: imageUrl => imageUrl.endsWith('.png')
  })
])
```

## Common Bundles

Core article fields:

- `metascraper-author`
- `metascraper-date`
- `metascraper-description`
- `metascraper-image`
- `metascraper-logo`
- `metascraper-publisher`
- `metascraper-title`
- `metascraper-url`

Media and enrichment:

- `metascraper-audio`
- `metascraper-video`
- `metascraper-lang`
- `metascraper-iframe`
- `metascraper-readability`

Provider-specific examples:

- `metascraper-youtube`
- `metascraper-spotify`
- `metascraper-soundcloud`
- `metascraper-x`
- `metascraper-instagram`
- `metascraper-tiktok`

## Troubleshooting

- Missing metadata: verify the HTML contains rendered tags (Open Graph, Twitter, JSON-LD, etc.).
- Wrong relative URLs: ensure `url` is absolute and matches the fetched page.
- Slow extraction: use `pickPropNames` to compute only needed fields.
- URL validation failures: pass `validateUrl: false` only when input is intentionally non-standard.

## Output Fields (Typical)

Common fields include:

- `title`
- `description`
- `author`
- `publisher`
- `date`
- `image`
- `logo`
- `url`
- `lang`
- `audio`
- `video`

The returned shape depends on loaded rule bundles.