Skip to content

Retrieve normalized, render-ready HTML from any URL using fetch or headless prerender.

User intent

  • get rendered HTML from JavaScript-heavy pages
  • normalize relative URLs to absolute for downstream parsing
  • prepare HTML for metadata extraction pipelines
  • choose between fast fetch and full browser rendering per URL

Installation

npx skills add https://github.com/microlinkhq/skills --skill html-get
# html-get

`html-get` returns reliable HTML for a URL, choosing `fetch` or `prerender` depending on page needs.

## Quick Start

Install:

```bash
npm install html-get browserless puppeteer
```

Minimal usage:

```js
const createBrowserless = require('browserless')
const getHTML = require('html-get')

const browser = createBrowserless()
const context = browser.createContext()

const result = await getHTML('https://example.com', {
  getBrowserless: () => context
})

console.log(result.html)

await context((browserless) => browserless.destroyContext())
await browser.close()
```

## Recommended Workflow

1. Start with default `prerender: 'auto'`.
2. Set `prerender: false` for static pages when speed is priority.
3. Enable `rewriteUrls: true` when downstream parsing needs absolute links.
4. Enable `rewriteHtml: true` when source pages have broken meta tags.
5. Reuse one browser process and create/destroy contexts per request.

## CLI

One-off usage:

```bash
npx -y html-get https://example.com
```

Debug output with mode, timing, and headers:

```bash
npx -y html-get https://example.com --debug
```

## Core Options

- `getBrowserless` (function): required unless `prerender: false`.
- `prerender` (`'auto' | true | false`): mode selector.
- `rewriteUrls` (boolean): rewrite relative HTML/CSS URLs to absolute.
- `rewriteHtml` (boolean): normalize common meta-tag mistakes.
- `headers` (object): request headers for fetch/prerender.
- `gotOpts` (object): extra options for `got` in fetch mode.
- `puppeteerOpts` (object): options passed to browserless evaluate flow.
- `serializeHtml` (function): custom output serializer from Cheerio instance.
- `encoding` (string): output encoding, default `utf-8`.

## Output Shape

`getHTML(url, opts)` resolves to:

- `html`: serialized HTML (or custom serializer output fields).
- `url`: final URL.
- `statusCode`: HTTP status.
- `headers`: response headers.
- `redirects`: redirect chain.
- `stats`: `{ mode, timing }`.

## Common Patterns

Force fast fetch mode for known static targets:

```js
const result = await getHTML(url, {
  prerender: false,
  rewriteUrls: true
})
```

Prepare HTML for metadata extraction:

```js
const page = await getHTML(url, {
  getBrowserless,
  rewriteUrls: true,
  rewriteHtml: true
})

const metadata = await metascraper({ url: page.url, html: page.html })
```

Custom serializer (avoid returning full HTML):

```js
const result = await getHTML(url, {
  getBrowserless,
  serializeHtml: ($) => ({
    html: $.html(),
    title: $('title').first().text()
  })
})
```

## Reliability Notes

- If `getBrowserless` is missing and `prerender` is not `false`, `html-get` throws.
- PDF URLs are fetched and can be converted via `mutool` when available.
- Media URLs are normalized to HTML wrappers (`img`, `video`, `audio`) for consistent downstream parsing.
- For large batch jobs, control concurrency outside `html-get` and always clean up browser contexts.