@origints/html
HTML parsing with CSS selector queries, attribute extraction, and Markdown conversion.
Installation
Section titled “Installation”npm install @origints/html @origints/coreFeatures
Section titled “Features”- Parse HTML with source position tracking
- CSS selector queries (via hast-util-select)
- Type-safe extractors for elements and attributes
- Convert HTML to Markdown
- Navigation API for tree traversal
Usage with Planner
Section titled “Usage with Planner”Extract content
Section titled “Extract content”import { Planner, loadFile, run } from '@origints/core'import { parseHtml } from '@origints/html'
const plan = new Planner() .in(loadFile('page.html')) .mapIn(parseHtml()) .emit((out, $) => out .add('title', $.select('h1').text()) .add('href', $.select('a').attr('href')) ) .compile()
const result = await run(plan, { readFile, registry })// result.value: { title: 'Welcome', href: '/about' }Extract collections with selectAll
Section titled “Extract collections with selectAll”const plan = new Planner() .in(loadFile('page.html')) .mapIn(parseHtml()) .emit((out, $) => out .add('items', $.select('ul').selectAll('li', node => node.text())) ) .compile()Structured data from repeated elements
Section titled “Structured data from repeated elements”.emit((out, $) => out .add('links', $.selectAll('a', node => ({ kind: 'object', properties: { href: node.attr('href'), text: node.text(), }, }))))Children extraction
Section titled “Children extraction”.emit((out, $) => out .add('sections', $.select('main').children(node => node.text())))Standalone usage
Section titled “Standalone usage”import { parseHtmlImpl, HtmlNode } from '@origints/html'
const node = parseHtmlImpl.execute(htmlString) as HtmlNode
const title = node.select('h1')if (title.ok) console.log(title.value.text())
const items = node.selectAll('li')for (const item of items) console.log(item.text())Markdown conversion
Section titled “Markdown conversion”import { parseHtmlImpl, toMarkdown } from '@origints/html'
const node = parseHtmlImpl.execute(htmlContent) as HtmlNodeconst markdown = toMarkdown(node)| Export | Description |
|---|---|
parseHtml(options?) | Transform AST for Planner.mapIn() |
parseHtmlImpl | Sync transform implementation |
parseHtmlAsyncImpl | Async transform implementation |
registerHtmlTransforms(registry) | Register HTML transforms |
HtmlNode | Navigable wrapper with CSS selector support |
toMarkdown(node) | Convert HTML to Markdown |
toJson(node, options?) | Convert to JSON |
License
Section titled “License”MIT