Skip to content

Extraction System

All data extraction — regardless of format — uses a single generic type. JSON, XLSX, CSV, YAML, HTML, Markdown, and TOML all produce the same spec shape.

interface ExtractSpec<S = unknown, E = unknown> {
readonly kind: 'extract'
readonly format: string // 'json', 'xlsx', 'csv', 'yaml', ...
readonly steps: readonly S[] // format-specific navigation
readonly extract: E // format-specific terminal extraction
}

Each format fills in its own step and extraction types:

// JSON
{ kind: 'extract', format: 'json', steps: ['user', 'name'], extract: 'string' }
// XLSX
{ kind: 'extract', format: 'xlsx', steps: [{ kind: 'sheet', name: 'Sales' }, { kind: 'cell', ref: 'B2' }], extract: 'number' }

Each format registers an executor that handles its navigation steps and terminal extraction. JSON is not a special case — it’s just another registered executor.

registerSpecExecutor('json', jsonExecutor)
registerSpecExecutor('xlsx', xlsxExecutor)
registerSpecExecutor('csv', csvExecutor)

executeSpec() dispatches on spec.format to the registered executor.

Specs compose into larger structures:

SpecPurpose
ExtractSpecTerminal extraction from a data source
ArraySpecMap over a collection
ObjectSpecConstruct an object from named properties
LiteralSpecConstant value
MatchSpecConditional extraction based on runtime predicates
ConcatSpecConcatenate multiple array results
PanicSpecSignal an unrecoverable condition
TrySpecOrdered fallback chain
MapSpecTransform an extracted value
GuardSpecValidate an extracted value

The full Spec union:

type Spec =
| ExtractSpec
| ArraySpec
| ObjectSpec
| LiteralSpec
| MatchSpec
| PanicSpec
| ConcatSpec
| TrySpec
| MapSpec
| GuardSpec

ConcatSpec concatenates multiple ArraySpec results into a single flat array. Useful for combining data from separate regions of a document.

import { concat } from '@origints/core'
concat(
header.down().eachSlice('down', hasData, investmentRow('realized')),
totalRealized.down().eachSlice('down', hasData, investmentRow('unrealized')),
)

MatchSpec enables conditional extraction based on runtime predicates. Evaluate the data at a given path, match against cases, and extract differently depending on the result. A PanicSpec can serve as the default to signal unrecoverable conditions.

In practice, you rarely construct spec objects directly. The builder API ($ parameter in .emit()) provides a fluent interface:

.emit((out, $) => out
.add('name', $.get('name').string()) // ExtractSpec
.add('items', $.get('items').array(i => i.string())) // ArraySpec
.addLiteral('version', '1.0.0') // LiteralSpec
)

Each format package provides its own builder that produces ExtractSpec instances with the right step and extraction types.