PropRaven Marketplace v1 — Methodology

How each canonical entity and intelligence signal is produced. Customers building decisioning systems should read this before relying on any individual field.


Stable IDs (PRPV_*)

PropRaven IDs are deterministic 16-char hex hashes of canonicalized natural keys.

prpv_id = 'PRPV_' + entity + '_' + UPPER(SUBSTR(SHA2(canonical_key, 256), 1, 16))

Where canonical_key joins the natural-key fields with CHR(31) (ASCII unit separator), and each field is UPPER(TRIM(...))-ed (NULLs become CHR(0)).

Properties:

  • Idempotent: same input → same ID, every time
  • Whitespace + case tolerant: ' 37 ' and '37' produce the same ID
  • 64 bits of entropy → P(collision in 256M rows) ≈ 1e-3 in any single entity (acceptable for v1)

What this means for you:

  • Your queries can JOIN on PRPV_* IDs with confidence
  • Source-system APN format changes do NOT change the PRPV ID (unless the field actually changes)
  • Within a major version, IDs are forever-stable — your warehouse can store them

Owner entity resolution

Every parcel has at most one canonical prpv_owner_id via the EDGE_OWNER_PARCEL table.

Pipeline:

  1. Source cluster IDPROPZILLA.CURATED.PARCEL_ENRICHED.OWNER_ENTITY_ID carries an internal cluster ID (er- prefix for entities, er-i- for individuals) covering 99.7% of parcels.
  2. Aggregate to one row per cluster: most-frequent name, classified entity_type, parcel_count rollup.
  3. Name-classifier override — owners whose name contains LLC, CORP, TRUST, COUNTY, CITY OF, etc. get their entity_type reclassified accordingly. This corrects PE's tendency to mark obvious-corporate names as INDIVIDUAL.
  4. Canonical merge (v1.1) — corporate entities (LLC/CORP/LP/GOVT/NONPROFIT) with the same normalized name + entity_type collapse into a single prpv_owner_id. The one with the most parcels wins. INDIVIDUAL and TRUST entities are NOT merged (too prone to first-name-only false positives like "John" matching across families).
  5. Placeholder drop — owners whose name is empty, "Not Available", "Current Owner", "Multiple Owners", "Unknown", "[Limited Access]" are mapped to NULL canonical and their edges dropped. ~22K placeholder clusters removed.

Confidence scoring:

  • 1.00 — Tier A: known-institutional list match (e.g., "Invitation Homes", "Blackstone")
  • 0.95 — Tier B: name-classified entity type (regex-matched corporate suffix)
  • 0.85 — Tier C: er-i-* individual cluster
  • Edges inherit owner confidence

Limitations:

  • Multi-grantee names like "URBAN SITES OF AMERICA LLC; WILLIAMS MARY; WILLIAMS-DAUGHTRY EBETTE" are treated as one entity. Splitting at ; and re-resolving is a v1.2 improvement.
  • We do NOT resolve beneficial ownership (LLC → controlling natural person). That requires Secretary of State filings + FinCEN BOI access — slated for the Person entity in v1.1.

Address comparison (for is_owner_occupied / is_absentee)

PE provides two addresses that are always semantically meaningful but format-incompatible:

  • Situs: '1230 DRESDEN DR WEST CHARLOTTE NC' (street + city + state)
  • Mailing: '1230 DRESDEN DR WEST 28205' (street + zip5)

Naïve string comparison fails. Our methodology:

  1. Strip non-alphanumeric, upper-case both
  2. Take leading 16 chars (≈ street_number + 1-2 words of street name)
  3. If prefixes equal → owner-occupied
  4. If situs_zip5 + mail_zip5 + LEFT(prefix,8) all match → owner-occupied (weak match)
  5. Otherwise → absentee

Absentee classification:

  • OUT_OF_STATE — mailing state ≠ situs state (highest signal — likely investor)
  • LOCAL_ABSENTEE — mailing in same ZIP but different street (often a property manager or relative)
  • OUT_OF_COUNTY — mailing in same state but different county
  • PO_BOX — mailing is a P.O. box (deliberate obfuscation)

Confidence scales with how many corroborating fields agree (street + zip + state).


Institutional buyer classification

Three tiers, in order of confidence:

  • Tier A (conf 1.00): name matches a curated list of known institutional buyers — Invitation Homes, AMH, Tricon, Blackstone, etc. Twenty-one canonical patterns shipped in v1.
  • Tier B (conf 0.90): entity_type ∈ (LLC/CORP/LP) AND parcel_count >= 10 (multi-property corporate ownership).
  • Tier C (conf 0.75): entity_type ∈ (LLC/CORP/LP) AND parcel_count >= 5.

Customer queries should default to Tier A+B (confidence_score >= 0.90). Tier C is for screen-out / lead-gen workflows tolerant of noise.

The curated list will grow. Customers requesting additions can suggest patterns; we ship them in the next release.


Recent transfer recency

Source: PROPZILLA.CURATED.PARCEL_ENRICHED.LAST_SALE_DATE + LAST_SALE_PRICE.

Buckets: LAST_90 / LAST_180 / LAST_365 / OVER_365 / NEVER.

Known data-quality issue: PE has ~313K parcels (Mecklenburg alone) with LAST_SALE_DATE set to dates in 2026 — these are upstream backfill artifacts, not real recent sales. Customers should filter confidence_score >= 0.95 (which requires both date AND price) for clean signal. Working with upstream to clean for v1.1.


Geometry storage

Parcel boundaries stored as GEOGRAPHY type (Snowflake's WGS84 spatial type). Coverage:

  • 167.8M of 230M+ parcels have polygons (~73%)
  • has_polygon boolean lets customers fast-filter
  • Centroid lat/lon also exposed for non-spatial queries

Why INSURANCE covers 255.7M parcels (wider than Core's 230M+)

The Core / Ontology / Owner-Graph bundles are scoped to the canonical SILVER.PARCEL set — 230M+ parcels that have a confirmed matching SILVER.PROPERTY record. The INSURANCE bundle is intentionally wider: 255.7M parcels including every row in CURATED.PARCEL_ENRICHED, even those without a v1-canonical property record yet.

Why the gap? Property classification (year_built, building area, owner-occupancy) requires county assessor data that's published on a slower cadence than the underlying parcel boundaries. Some 25M parcels — primarily in fast-growing exurbs, recently-subdivided land, agricultural and vacant lots, and special-purpose parcels (cemeteries, easements, right-of-ways) — have NRI / NFIP / NOAA peril coverage but no synced property record yet. For carrier underwriting and reinsurance cat-modeling, the wider grain is the right tradeoff: every parcel with a meaningful peril vector belongs in the cohort, even if its property attributes are still pending.

What this means for customer queries:

  • INSURANCE.PARCEL_RISK and INSURANCE.PARCEL_HIGH_PERIL_FLAGS cover the full 255.7M.
  • Joining INSURANCE entities to SILVER.PARCEL via prpv_parcel_id will produce ~92% match rate (230M+ of 255.7M). The unmatched 25M are parcels with peril data but no canonical property — filter on EXISTS (SELECT 1 FROM SILVER.PARCEL ...) if you want only fully-canonicalized parcels.
  • The passthrough views INSURANCE.PARCEL / INSURANCE.PROPERTY / INSURANCE.OWNER are pinned to the 230M+ canonical set (they project SILVER directly), so customers who want narrower coverage can use those instead of PARCEL_RISK.

This is deliberate v1 scoping. v1.2 will close the gap by accelerating property-record ingestion for the missing 25M.


Known coverage gaps (v1 accepted)

FieldCoverageWhyWorkaround
SILVER.OWNER.name_raw74% nationallyCounty feeds in NJ, ME, NH, RI, CA Prop 13 partially redact owner names at the publishing layer. Cannot enrich — the data isn't published upstream.Filter to name_raw IS NOT NULL for owner-keyed analytics; expect lower coverage in affected states.
SILVER.TAX_RECORD.assessed_value_total = 0~6.56M parcels in MD + NDMD and ND county feeds publish TAV=0 for tax-exempt parcels (government, religious, nonprofit).Filter assessed_value_total > 0 to exclude exempt parcels (usually the desired behavior).
SILVER.PERMIT → SILVER.PARCEL FK73.9% resolvesSome MATCHED_PARCEL_ID values point to parcels in the wider CURATED.PARCEL_ENRICHED set (255.7M) that aren't in canonical SILVER.PARCEL (230M+) — see the INSURANCE-bundle wider-grain note above.For full permit-to-parcel reach, use the legacy PROPRAVEN_PARCEL_INTELLIGENCE share (over PROPZILLA.CURATED).
SILVER.PROPERTY.property_type_class normalizationMISMO codes verbatimThe normalized abbreviations (SFR / MF / CRE) ship in v1.1. Current values: SingleFamily, VacantLand, Commercial, Condominium, etc.Filter on the MISMO values for now (see sample queries Q1, Q16).

What we DO NOT include

Per v1 PII boundary (locked in plan §13):

  • No DOB, SSN, email, phone, or device IDs on any individual
  • No image data (parcel photos)
  • No proprietary AVM (we expose source-county assessed values; customers run their own AVMs)
  • No personal addresses for natural persons (only for entities)

Revisit each in v2 with legal review.


Data freshness SLA (v1)

TierRefresh cadence
Ontology (PARCEL/PROPERTY/TAX_RECORD/PERMIT/OWNER/GEOGRAPHY/TRANSACTION/LOAN/ADDRESS + edges + signals)MONTHLY
LENDING / SFR / CRE / GOVTECH bundlesMONTHLY
INSURANCE bundle (FEMA NRI annual; NFIP / NOAA quarterly)QUARTERLY
BANKING_MARKETPLACE (FDIC SOD annual; analytics monthly)MONTHLY

Source data inherits its publishing cadence — county assessor rolls publish monthly to annually, federal data (FEMA / Census / FDIC) annually. The _LOADED_AT and _SOURCE_AS_OF columns expose actual freshness per row.

META.SOURCE_LINEAGE.last_refreshed_at shows when each entity last updated.


Versioning

Every row carries _resolution_version (e.g., v1.0.0-national). Pin queries to a version for reproducibility:

SELECT * FROM SILVER.OWNER WHERE _resolution_version = 'v1.0.0-national';

Major versions (v1v2) are breaking. Old major lives in a parallel listing for 90 days. Minor versions (v1.1) are additive — new entities, new columns, refined methodology — and don't break existing queries.