Skip to content

Extraction System

All data extraction — regardless of format — uses a single generic type. JSON, XLSX, CSV, YAML, HTML, Markdown, and TOML all produce the same spec shape.

interface ExtractSpec<S = unknown, E = unknown> {
readonly kind: 'extract'
readonly format: string // 'json', 'xlsx', 'csv', 'yaml', ...
readonly steps: readonly S[] // format-specific navigation
readonly extract: E // format-specific terminal extraction
}

Each format fills in its own step and extraction types:

// JSON
{ kind: 'extract', format: 'json', steps: ['user', 'name'], extract: 'string' }
// XLSX
{ kind: 'extract', format: 'xlsx', steps: [{ kind: 'sheet', name: 'Sales' }, { kind: 'cell', ref: 'B2' }], extract: 'number' }

Each format registers an executor that handles its navigation steps and terminal extraction. JSON is not a special case — it’s just another registered executor.

registerSpecExecutor('json', jsonExecutor)
registerSpecExecutor('xlsx', xlsxExecutor)
registerSpecExecutor('csv', csvExecutor)

executeSpec() dispatches on spec.format to the registered executor.

Specs compose into larger structures:

SpecPurpose
ExtractSpecTerminal extraction from a data source
ArraySpecMap over a collection
ObjectSpecConstruct an object from named properties
LiteralSpecConstant value
MatchSpecConditional extraction based on runtime predicates
ConcatSpecConcatenate multiple array results
PanicSpecSignal an unrecoverable condition
TrySpecOrdered fallback chain
MapSpecTransform an extracted value
GuardSpecValidate an extracted value
ForEachSpecIterate over values and execute body per item
VariableRefSpecReference a bound variable from forEach scope

The full Spec union:

type Spec =
| ExtractSpec
| ArraySpec
| ObjectSpec
| LiteralSpec
| MatchSpec
| PanicSpec
| ConcatSpec
| TrySpec
| MapSpec
| GuardSpec
| ForEachSpec
| VariableRefSpec

ConcatSpec concatenates multiple ArraySpec results into a single flat array. Useful for combining data from separate regions of a document.

import { concat } from '@origints/core'
concat(
header.down().eachSlice('down', hasData, investmentRow('realized')),
totalRealized.down().eachSlice('down', hasData, investmentRow('unrealized'))
)

MatchSpec enables conditional extraction based on runtime predicates. Evaluate the data at a given path, match against cases, and extract differently depending on the result. A PanicSpec can serve as the default to signal unrecoverable conditions.

In practice, you rarely construct spec objects directly. The builder API ($ parameter in .emit()) provides a fluent interface:

.emit((out, $) => out
.add('name', $.get('name').string()) // ExtractSpec
.add('items', $.get('items').array(i => i.string())) // ArraySpec
.addLiteral('version', '1.0.0') // LiteralSpec
)

Each format package provides its own builder that produces ExtractSpec instances with the right step and extraction types.

Two helpers construct ObjectSpec from properties:

  • extract($ => ({ ... })) — provides a JSON SpecBuilder root for navigation. Use when building specs from JSON data.
  • object({ ... }) — takes a plain record of already-constructed specs. Use when composing specs from different sources (e.g., inside forEach bodies).
import { extract, object, variableRef } from '@origints/core'
// extract() — when you need the JSON SpecBuilder
extract($ => ({
name: $.get('name').string(),
age: $.get('age').number(),
}))
// object() — when composing existing specs
object({
company: variableRef('company', { extract: 'string' }),
items: someXlsxArraySpec,
})

SpecBuilder.objects() provides a shorthand for the common array(item => object(...)) pattern:

// Before
$.get('users').array(item =>
object({
name: item.get('name').string(),
age: item.get('age').number(),
})
)
// After
$.get('users').objects(item => ({
name: item.get('name').string(),
age: item.get('age').number(),
}))

All consumer positions accept SpecLike — a union of Spec | FluentSpec<Spec>. This means you can pass fluent-chained specs anywhere a Spec is expected:

import { fluent, literal } from '@origints/core'
// Fluent chaining replaces nested combinators
fluent($.get('price').string())
.map(v => parseFloat(v as string), 'parseFloat')
.guard(v => (v as number) > 0, 'Must be positive')
.or(literal(0))

Replace repetitive .add() chains with a single call:

out.addAll({
name: $.get('name').string(),
age: $.get('age').number(),
role: $.get('role').string(),
})