PropRaven Marketplace v1 — Methodology

How each canonical entity and intelligence signal is produced. Customers building decisioning systems should read this before relying on any individual field.

Stable IDs (PRPV_*)

PropRaven IDs are deterministic 16-char hex hashes of canonicalized natural keys.

prpv_id = 'PRPV_' + entity + '_' + UPPER(SUBSTR(SHA2(canonical_key, 256), 1, 16))

Where canonical_key joins the natural-key fields with CHR(31) (ASCII unit separator), and each field is UPPER(TRIM(...))-ed (NULLs become CHR(0)).

Properties:

Idempotent: same input → same ID, every time
Whitespace + case tolerant: ' 37 ' and '37' produce the same ID
64 bits of entropy → P(collision in 256M rows) ≈ 1e-3 in any single entity (acceptable for v1)

What this means for you:

Your queries can JOIN on PRPV_* IDs with confidence
Source-system APN format changes do NOT change the PRPV ID (unless the field actually changes)
Within a major version, IDs are forever-stable — your warehouse can store them

Owner entity resolution

Every parcel has at most one canonical prpv_owner_id via the EDGE_OWNER_PARCEL table.

Pipeline:

Source cluster ID — PROPZILLA.CURATED.PARCEL_ENRICHED.OWNER_ENTITY_ID carries an internal cluster ID (er- prefix for entities, er-i- for individuals) covering 99.7% of parcels.
Aggregate to one row per cluster: most-frequent name, classified entity_type, parcel_count rollup.
Name-classifier override — owners whose name contains LLC, CORP, TRUST, COUNTY, CITY OF, etc. get their entity_type reclassified accordingly. This corrects PE's tendency to mark obvious-corporate names as INDIVIDUAL.
Canonical merge (v1.1) — corporate entities (LLC/CORP/LP/GOVT/NONPROFIT) with the same normalized name + entity_type collapse into a single prpv_owner_id. The one with the most parcels wins. INDIVIDUAL and TRUST entities are NOT merged (too prone to first-name-only false positives like "John" matching across families).
Placeholder drop — owners whose name is empty, "Not Available", "Current Owner", "Multiple Owners", "Unknown", "[Limited Access]" are mapped to NULL canonical and their edges dropped. ~22K placeholder clusters removed.

Confidence scoring:

1.00 — Tier A: known-institutional list match (e.g., "Invitation Homes", "Blackstone")
0.95 — Tier B: name-classified entity type (regex-matched corporate suffix)
0.85 — Tier C: er-i-* individual cluster
Edges inherit owner confidence

Limitations:

Multi-grantee names like "URBAN SITES OF AMERICA LLC; WILLIAMS MARY; WILLIAMS-DAUGHTRY EBETTE" are treated as one entity. Splitting at ; and re-resolving is a v1.2 improvement.
We do NOT resolve beneficial ownership (LLC → controlling natural person). That requires Secretary of State filings + FinCEN BOI access — slated for the Person entity in v1.1.

Address comparison (for is_owner_occupied / is_absentee)

PE provides two addresses that are always semantically meaningful but format-incompatible:

Situs: '1230 DRESDEN DR WEST CHARLOTTE NC' (street + city + state)
Mailing: '1230 DRESDEN DR WEST 28205' (street + zip5)

Naïve string comparison fails. Our methodology:

Strip non-alphanumeric, upper-case both
Take leading 16 chars (≈ street_number + 1-2 words of street name)
If prefixes equal → owner-occupied
If situs_zip5 + mail_zip5 + LEFT(prefix,8) all match → owner-occupied (weak match)
Otherwise → absentee

Absentee classification:

OUT_OF_STATE — mailing state ≠ situs state (highest signal — likely investor)
LOCAL_ABSENTEE — mailing in same ZIP but different street (often a property manager or relative)
OUT_OF_COUNTY — mailing in same state but different county
PO_BOX — mailing is a P.O. box (deliberate obfuscation)

Confidence scales with how many corroborating fields agree (street + zip + state).

Institutional buyer classification

Three tiers, in order of confidence:

Tier A (conf 1.00): name matches a curated list of known institutional buyers — Invitation Homes, AMH, Tricon, Blackstone, etc. Twenty-one canonical patterns shipped in v1.
Tier B (conf 0.90): entity_type ∈ (LLC/CORP/LP) AND parcel_count >= 10 (multi-property corporate ownership).
Tier C (conf 0.75): entity_type ∈ (LLC/CORP/LP) AND parcel_count >= 5.

Customer queries should default to Tier A+B (confidence_score >= 0.90). Tier C is for screen-out / lead-gen workflows tolerant of noise.

The curated list will grow. Customers requesting additions can suggest patterns; we ship them in the next release.

Recent transfer recency

Source: PROPZILLA.CURATED.PARCEL_ENRICHED.LAST_SALE_DATE + LAST_SALE_PRICE.

Buckets: LAST_90 / LAST_180 / LAST_365 / OVER_365 / NEVER.

Known data-quality issue: PE has ~313K parcels (Mecklenburg alone) with LAST_SALE_DATE set to dates in 2026 — these are upstream backfill artifacts, not real recent sales. Customers should filter confidence_score >= 0.95 (which requires both date AND price) for clean signal. Working with upstream to clean for v1.1.

Geometry storage

Parcel boundaries stored as GEOGRAPHY type (Snowflake's WGS84 spatial type). Coverage:

167.8M of 230M+ parcels have polygons (~73%)
has_polygon boolean lets customers fast-filter
Centroid lat/lon also exposed for non-spatial queries

Why INSURANCE covers 255.7M parcels (wider than Core's 230M+)

The Core / Ontology / Owner-Graph bundles are scoped to the canonical SILVER.PARCEL set — 230M+ parcels that have a confirmed matching SILVER.PROPERTY record. The INSURANCE bundle is intentionally wider: 255.7M parcels including every row in CURATED.PARCEL_ENRICHED, even those without a v1-canonical property record yet.

Why the gap? Property classification (year_built, building area, owner-occupancy) requires county assessor data that's published on a slower cadence than the underlying parcel boundaries. Some 25M parcels — primarily in fast-growing exurbs, recently-subdivided land, agricultural and vacant lots, and special-purpose parcels (cemeteries, easements, right-of-ways) — have NRI / NFIP / NOAA peril coverage but no synced property record yet. For carrier underwriting and reinsurance cat-modeling, the wider grain is the right tradeoff: every parcel with a meaningful peril vector belongs in the cohort, even if its property attributes are still pending.

What this means for customer queries:

INSURANCE.PARCEL_RISK and INSURANCE.PARCEL_HIGH_PERIL_FLAGS cover the full 255.7M.
Joining INSURANCE entities to SILVER.PARCEL via prpv_parcel_id will produce ~92% match rate (230M+ of 255.7M). The unmatched 25M are parcels with peril data but no canonical property — filter on EXISTS (SELECT 1 FROM SILVER.PARCEL ...) if you want only fully-canonicalized parcels.
The passthrough views INSURANCE.PARCEL / INSURANCE.PROPERTY / INSURANCE.OWNER are pinned to the 230M+ canonical set (they project SILVER directly), so customers who want narrower coverage can use those instead of PARCEL_RISK.

This is deliberate v1 scoping. v1.2 will close the gap by accelerating property-record ingestion for the missing 25M.

Known coverage gaps (v1 accepted)

Field	Coverage	Why	Workaround
`SILVER.OWNER.name_raw`	74% nationally	County feeds in NJ, ME, NH, RI, CA Prop 13 partially redact owner names at the publishing layer. Cannot enrich — the data isn't published upstream.	Filter to `name_raw IS NOT NULL` for owner-keyed analytics; expect lower coverage in affected states.
`SILVER.TAX_RECORD.assessed_value_total = 0`	~6.56M parcels in MD + ND	MD and ND county feeds publish TAV=0 for tax-exempt parcels (government, religious, nonprofit).	Filter `assessed_value_total > 0` to exclude exempt parcels (usually the desired behavior).
`SILVER.PERMIT → SILVER.PARCEL` FK	73.9% resolves	Some `MATCHED_PARCEL_ID` values point to parcels in the wider `CURATED.PARCEL_ENRICHED` set (255.7M) that aren't in canonical `SILVER.PARCEL` (230M+) — see the INSURANCE-bundle wider-grain note above.	For full permit-to-parcel reach, use the legacy `PROPRAVEN_PARCEL_INTELLIGENCE` share (over PROPZILLA.CURATED).
`SILVER.PROPERTY.property_type_class` normalization	MISMO codes verbatim	The normalized abbreviations (SFR / MF / CRE) ship in v1.1. Current values: `SingleFamily`, `VacantLand`, `Commercial`, `Condominium`, etc.	Filter on the MISMO values for now (see sample queries Q1, Q16).

What we DO NOT include

Per v1 PII boundary (locked in plan §13):

No DOB, SSN, email, phone, or device IDs on any individual
No image data (parcel photos)
No proprietary AVM (we expose source-county assessed values; customers run their own AVMs)
No personal addresses for natural persons (only for entities)

Revisit each in v2 with legal review.

Data freshness SLA (v1)

Tier	Refresh cadence
Ontology (PARCEL/PROPERTY/TAX_RECORD/PERMIT/OWNER/GEOGRAPHY/TRANSACTION/LOAN/ADDRESS + edges + signals)	MONTHLY
LENDING / SFR / CRE / GOVTECH bundles	MONTHLY
INSURANCE bundle (FEMA NRI annual; NFIP / NOAA quarterly)	QUARTERLY
BANKING_MARKETPLACE (FDIC SOD annual; analytics monthly)	MONTHLY

Source data inherits its publishing cadence — county assessor rolls publish monthly to annually, federal data (FEMA / Census / FDIC) annually. The _LOADED_AT and _SOURCE_AS_OF columns expose actual freshness per row.

META.SOURCE_LINEAGE.last_refreshed_at shows when each entity last updated.

Versioning

Every row carries _resolution_version (e.g., v1.0.0-national). Pin queries to a version for reproducibility:

SELECT * FROM SILVER.OWNER WHERE _resolution_version = 'v1.0.0-national';

Major versions (v1 → v2) are breaking. Old major lives in a parallel listing for 90 days. Minor versions (v1.1) are additive — new entities, new columns, refined methodology — and don't break existing queries.