/data / docs

Data model

The MoneyGuru Finance Data corpus lives in the insurance Postgres schema. Ten tables, all foreign-keyed, with per-vertical filtered views generated from the verticals table.

Tables REST + MCP endpoints Access & SLAs sample.json →

Live row counts

Live · refreshes with the site build

Table	Rows	Purpose
insurance.verticals	22	One row per insurance vertical (life, trauma, health, business-motor, etc.). All other tables filter by vertical_id.
insurance.insurers	111	Legal entities issuing policies. Scoped per vertical (the same brand can appear once per vertical).
insurance.products	142	A specific retail product issued by an insurer (e.g. "Vero Commercial Motor"). Wordings hang off products.
insurance.wordings	188	Versioned PDS documents. Each wording is one snapshot of the policy text at a point in time.
insurance.policy_clauses	6,865	Section-level chunks of a wording, each ~500 tokens, embedded for semantic search and tagged by type.
insurance.wording_facts	134	Per-wording structured-field extraction (excess amounts, sub-limits, sums insured) as typed JSON. Schema differs per vertical — see verticals.facts_jsonschema.
insurance.policy_diffs	51	Section-level diff between two consecutive wording versions of the same product. The engine behind Diff & Watch.
insurance.topics	—	Cross-vertical canonical question/topic catalogue (e.g. "Windscreen damage", "Pre-existing condition cover"). Wordings are indexed against topics for retrieval-augmented Q&A.
insurance.wording_topic_extracts	1	Per-wording per-topic markdown extract: "what does THIS wording say about THIS topic?" The retrieval payload AdviserGPT cites from.

Tables

All tables are in the insurance schema. Per-vertical views (e.g. insurance.trauma_policy_clauses) filter each base table by vertical_id.

insurance.verticals

One row per insurance vertical (life, trauma, health, business-motor, etc.). All other tables filter by vertical_id.

Rows: 22

Definition

CREATE TABLE insurance.verticals (
  id              uuid PRIMARY KEY DEFAULT gen_random_uuid(),
  slug            text UNIQUE NOT NULL,        -- 'life', 'trauma', 'business-motor'
  name            text NOT NULL,               -- 'Life Insurance'
  primary_site    text,                        -- canonical Pangaea site for the vertical
  facts_jsonschema jsonb,                      -- per-vertical schema for wording_facts.facts
  created_at      timestamptz DEFAULT now()
);

Notes

• Every vertical has a generated set of filtered views: insurance.{base}_insurers, insurance.{base}_products, insurance.{base}_wordings, insurance.{base}_policy_clauses, insurance.{base}_wording_facts, insurance.{base}_policy_diffs.
• base = slug with hyphens replaced by underscores.

Example query

-- list all verticals with at least one wording
select v.slug, v.name, count(w.id) as wording_count
from insurance.verticals v
left join insurance.products p on p.vertical_id = v.id
left join insurance.wordings w on w.product_id = p.id
group by v.slug, v.name
order by wording_count desc;

insurance.insurers

Legal entities issuing policies. Scoped per vertical (the same brand can appear once per vertical).

Rows: 111

Definition

CREATE TABLE insurance.insurers (
  id                uuid PRIMARY KEY DEFAULT gen_random_uuid(),
  vertical_id       uuid NOT NULL REFERENCES insurance.verticals(id),
  slug              text NOT NULL,             -- 'vero', 'aia'
  name              text NOT NULL,
  legal_entity      text,                      -- 'Vero Insurance New Zealand Limited'
  brand_family      text,                      -- 'Suncorp', 'IAG'
  primary_website   text,
  rbnz_strength_rating text,                   -- 'A+', etc., where applicable
  fsp_number        text,                      -- NZ FSP register number
  created_at        timestamptz DEFAULT now(),
  UNIQUE (vertical_id, slug)
);

Notes

• Use slug for stable references in URLs and API parameters; name for display.
• brand_family lets you group sister brands (e.g. Vero and AAI under Suncorp).

Example query

-- all insurers in the trauma vertical with at least one current wording
select i.slug, i.name
from insurance.trauma_insurers i
join insurance.trauma_products p on p.insurer_id = i.id
join insurance.trauma_wordings w on w.product_id = p.id
where w.superseded_at is null
group by i.slug, i.name
order by i.name;

insurance.products

A specific retail product issued by an insurer (e.g. "Vero Commercial Motor"). Wordings hang off products.

Rows: 142

Definition

CREATE TABLE insurance.products (
  id                  uuid PRIMARY KEY DEFAULT gen_random_uuid(),
  insurer_id          uuid NOT NULL REFERENCES insurance.insurers(id),
  vertical_id         uuid NOT NULL REFERENCES insurance.verticals(id),
  slug                text NOT NULL,
  name                text NOT NULL,
  status              text NOT NULL DEFAULT 'active',  -- 'active' | 'withdrawn'
  positioning_summary text,                            -- editorial one-paragraph summary
  source_url          text,                            -- canonical insurer product page
  created_at          timestamptz DEFAULT now(),
  UNIQUE (insurer_id, slug)
);

Notes

• status='withdrawn' rows are kept so historical wordings remain queryable, but they're excluded from default API responses.
• A product can have many wordings over time; use the wordings table to find current vs historical.

Example query

-- products with no current wording (needs ingestion)
select i.name as insurer, p.name as product
from insurance.products p
join insurance.insurers i on i.id = p.insurer_id
where not exists (
  select 1 from insurance.wordings w
  where w.product_id = p.id and w.superseded_at is null
);

insurance.wordings

Versioned PDS documents. Each wording is one snapshot of the policy text at a point in time.

Rows: 188

Definition

CREATE TABLE insurance.wordings (
  id              uuid PRIMARY KEY DEFAULT gen_random_uuid(),
  product_id      uuid NOT NULL REFERENCES insurance.products(id),
  wording_version text,                        -- '2024-08' or 'PDS-v3.2'
  source_url      text NOT NULL,               -- direct link to insurer PDS PDF
  pdf_hash        text,                        -- sha256 of the source PDF
  page_count      int,
  markdown        text,                        -- parsed text body
  ingested_at     timestamptz NOT NULL DEFAULT now(),
  superseded_at   timestamptz,                 -- set when a newer wording is ingested
  superseded_by   uuid REFERENCES insurance.wordings(id),
  insurer_correct boolean DEFAULT false        -- true if insurer-confirmed via insurer-of-record write
);

Notes

• Throw-on-stale SLA: when (now() - ingested_at) > 12 months and superseded_at is null, the read endpoint returns HTTP 410 Gone.
• pdf_hash lets clients cache aggressively and detect republished-but-unchanged wordings.

Example query

-- current wording per product for an insurer
select p.name as product, w.wording_version, w.ingested_at
from insurance.wordings w
join insurance.products p on p.id = w.product_id
join insurance.insurers i on i.id = p.insurer_id
where i.slug = 'vero'
  and w.superseded_at is null
order by p.name;

insurance.policy_clauses

Section-level chunks of a wording, each ~500 tokens, embedded for semantic search and tagged by type.

Rows: 6,865

Definition

CREATE TABLE insurance.policy_clauses (
  id              uuid PRIMARY KEY DEFAULT gen_random_uuid(),
  wording_id      uuid NOT NULL REFERENCES insurance.wordings(id),
  product_id      uuid NOT NULL REFERENCES insurance.products(id),
  section_path    text,                        -- '§5.3 Excess > Young driver'
  source_page     int,                         -- page in the source PDF
  chunk_idx       int NOT NULL,
  clause_text     text NOT NULL,
  clause_types    text[],                      -- ['excess','sub_limit','definition',...]
  token_count     int,
  embedding       vector(1536),                -- text-embedding-3-small
  embedding_model text DEFAULT 'text-embedding-3-small',
  embedded_at     timestamptz DEFAULT now()
);

CREATE INDEX policy_clauses_embedding_idx
  ON insurance.policy_clauses
  USING hnsw (embedding vector_cosine_ops);

Notes

• clause_types is a text[] — a single chunk often qualifies as more than one type (e.g. an exclusion that defines a sub-limit).
• Valid clause_type tags: excess, sub_limit, exclusion, definition, waiting_period, sum_insured, stand_down, premium_loading, pre_existing, optional_benefit, claim_trigger, territorial_scope.
• HNSW index on the embedding column powers semantic search with ~10ms p95 over the full corpus.

Example query

-- semantic search: clauses about young-driver excess across all Vero products
with q as (
  select openai.embed('young driver excess') as v
)
select p.name as product, pc.section_path, left(pc.clause_text, 200) as excerpt
from insurance.policy_clauses pc
join insurance.products p on p.id = pc.product_id
join insurance.insurers i on i.id = p.insurer_id
cross join q
where i.slug = 'vero'
  and 'excess' = any(pc.clause_types)
order by pc.embedding <=> q.v
limit 10;

insurance.wording_facts

Per-wording structured-field extraction (excess amounts, sub-limits, sums insured) as typed JSON. Schema differs per vertical — see verticals.facts_jsonschema.

Rows: 134

Definition

CREATE TABLE insurance.wording_facts (
  wording_id      uuid PRIMARY KEY REFERENCES insurance.wordings(id),
  vertical_id     uuid NOT NULL REFERENCES insurance.verticals(id),
  facts           jsonb NOT NULL,              -- vertical-specific shape
  confidence_tier text NOT NULL,               -- 'high' | 'medium' | 'low'
  extracted_at    timestamptz DEFAULT now(),
  extractor_version text                       -- pipeline version that produced facts
);

Notes

• confidence_tier reflects extractor self-assessment + insurer-confirmation status. Default API responses include only high-tier facts unless ?include_low=true is passed.
• facts shape is vertical-specific. For business-motor it includes { standard_excess, young_driver_excess, glass_excess, sum_insured_max, territorial_scope } etc.

Example query

-- standard-excess distribution across all current business-motor wordings
select i.name, p.name as product, wf.facts->>'standard_excess' as excess
from insurance.business_motor_wording_facts wf
join insurance.wordings w on w.id = wf.wording_id
join insurance.products p on p.id = w.product_id
join insurance.insurers i on i.id = p.insurer_id
where w.superseded_at is null
  and wf.confidence_tier = 'high'
order by (wf.facts->>'standard_excess')::numeric;

insurance.policy_diffs

Section-level diff between two consecutive wording versions of the same product. The engine behind Diff & Watch.

Rows: 51

Definition

CREATE TABLE insurance.policy_diffs (
  id                uuid PRIMARY KEY DEFAULT gen_random_uuid(),
  product_id        uuid NOT NULL REFERENCES insurance.products(id),
  from_wording_id   uuid NOT NULL REFERENCES insurance.wordings(id),
  to_wording_id     uuid NOT NULL REFERENCES insurance.wordings(id),
  diff_jsonb        jsonb NOT NULL,            -- section-level [{section, before, after, op}]
  summary_markdown  text,                      -- plain-English summary, LLM-generated, insurer-reviewable
  change_tags       text[],                    -- ['excess_increase','definition_change','new_exclusion']
  chars_added       int NOT NULL DEFAULT 0,
  chars_removed     int NOT NULL DEFAULT 0,
  generated_at      timestamptz DEFAULT now(),
  reviewed_at       timestamptz,
  reviewed_by       text                       -- 'insurer' | 'editorial' | null (auto-only)
);

Notes

• change_tags vocabulary: excess_increase, excess_decrease, sub_limit_change, new_exclusion, removed_exclusion, definition_change, waiting_period_change, scope_expansion, scope_contraction, formatting_only.
• formatting_only tag is set when the diff is whitespace + reflow only — useful filter for "wake me only on real change".
• Webhook subscribers receive this row plus the resolved insurer/product/version names within 60 seconds of generated_at.

Example query

-- biggest substantive diffs across the corpus in the last 90 days
select i.name, p.name as product,
       pd.change_tags, pd.chars_added, pd.chars_removed,
       left(pd.summary_markdown, 200) as summary_excerpt
from insurance.policy_diffs pd
join insurance.products p on p.id = pd.product_id
join insurance.insurers i on i.id = p.insurer_id
where pd.generated_at > now() - interval '90 days'
  and not ('formatting_only' = any(pd.change_tags))
order by (pd.chars_added + pd.chars_removed) desc
limit 25;

insurance.topics

Cross-vertical canonical question/topic catalogue (e.g. "Windscreen damage", "Pre-existing condition cover"). Wordings are indexed against topics for retrieval-augmented Q&A.

Rows: —

Definition

CREATE TABLE insurance.topics (
  id           uuid PRIMARY KEY DEFAULT gen_random_uuid(),
  vertical_id  uuid REFERENCES insurance.verticals(id),  -- null = cross-vertical
  slug         text UNIQUE NOT NULL,
  name         text NOT NULL,
  description  text,                              -- canonical question phrasing
  created_at   timestamptz DEFAULT now()
);

Notes

• AdviserGPT uses topics as the retrieval scaffold — a question is matched to one or more topics, then answered from wording_topic_extracts.

Example query

-- topics with the best coverage across the business-motor corpus
select t.slug, t.name, count(wte.wording_id) as wordings_with_extract
from insurance.topics t
left join insurance.wording_topic_extracts wte on wte.topic_id = t.id
join insurance.verticals v on v.id = t.vertical_id
where v.slug = 'business-motor'
group by t.slug, t.name
order by wordings_with_extract desc;

insurance.wording_topic_extracts

Per-wording per-topic markdown extract: "what does THIS wording say about THIS topic?" The retrieval payload AdviserGPT cites from.

Rows: 1

Definition

CREATE TABLE insurance.wording_topic_extracts (
  wording_id      uuid REFERENCES insurance.wordings(id),
  topic_id        uuid REFERENCES insurance.topics(id),
  markdown_extract text NOT NULL,
  source_page     int,
  confidence_tier text NOT NULL,                 -- 'high' | 'medium' | 'low'
  extracted_at    timestamptz DEFAULT now(),
  PRIMARY KEY (wording_id, topic_id)
);

Notes

• A null row (no extract for a wording × topic combination) means the wording does not address that topic.
• markdown_extract preserves the source citation inline as "(see §X.Y, p.Z)" — never strip these when displaying to end users.

Example query

-- how each insurer's current business-motor wording addresses windscreen damage
select i.name, p.name as product, w.wording_version,
       wte.source_page, wte.markdown_extract
from insurance.wording_topic_extracts wte
join insurance.topics t on t.id = wte.topic_id
join insurance.wordings w on w.id = wte.wording_id
join insurance.products p on p.id = w.product_id
join insurance.insurers i on i.id = p.insurer_id
where t.slug = 'windscreen-damage'
  and w.superseded_at is null
  and wte.confidence_tier = 'high'
order by i.name;

REST + MCP endpoints

Each table is exposed through a REST endpoint and an MCP tool. Subscribers can use either interchangeably — the MCP server speaks the same data model as the REST API.

Resource	REST	MCP tools
insurers	GET /api/v1/insurers GET /api/v1/insurers/{slug}	list_insurers, get_insurer
products	GET /api/v1/products?vertical={slug}&insurer={slug} GET /api/v1/products/{id}	list_products, get_product
wordings	GET /api/v1/wordings/{id} GET /api/v1/products/{id}/wordings	get_wording, list_product_wordings
policy_clauses	GET /api/v1/clauses?product_id={id}&section={path} POST /api/v1/clauses/search (semantic)	search_clauses (semantic), get_clauses_for_product
wording_facts	GET /api/v1/wordings/{id}/facts	get_wording_facts
policy_diffs	GET /api/v1/diffs?since={iso8601}&insurer={slug} POST /webhooks/diffs/subscribe	list_diffs, subscribe_diffs
wording_topic_extracts	GET /api/v1/topics/{slug}/extracts POST /api/v1/ask (AdviserGPT)	get_topic_extracts, ask_wording
insurer-of-record	POST /api/v1/insurer-write (OAuth, audit-logged) GET /api/v1/audit-log?insurer={slug}	submit_insurer_correction, list_audit_log

Access & SLAs

Authentication

Bearer token over HTTPS. Tokens are scoped per organisation and per vertical, with optional IP allow-list. Insurer-of-record write access requires OAuth + SSO from your compliance domain.

Rate limits

600 req/min sustained per token, 5000 burst. Semantic search (/clauses/search) and AdviserGPT (/ask) are metered separately at 60 req/min default.

Throw-on-stale

SLA

When a wording is more than 12 months past its last ingest and has no superseding version, reads return 410 Gone with the last-known ingested_at in the response. Stale data is never silently served.

Freshness

New wordings detected within 24 hours of insurer publication for monitored sources. Insurer-authored corrections are live within 5 minutes of approval.

Webhooks

Subscribe to policy_diffs events scoped by insurer, vertical or product. Signed payloads (HMAC-SHA256), at-least-once delivery, exponential-backoff retry up to 24 hours.

Data residency

Primary database in Sydney (ap-southeast-2). PDF artefacts in Auckland. No personal data is stored — the corpus contains only published PDS content and structured derivatives.

Want an API key?

Tell us what you'd build and we'll provision a sandbox token within a couple of working days.

partners@moneyguru.co.nz