PropRaven Marketplace v1 — Methodology
How each canonical entity and intelligence signal is produced. Customers building decisioning systems should read this before relying on any individual field.
Stable IDs (PRPV_*)
PropRaven IDs are deterministic 16-char hex hashes of canonicalized natural keys.
prpv_id = 'PRPV_' + entity + '_' + UPPER(SUBSTR(SHA2(canonical_key, 256), 1, 16))
Where canonical_key joins the natural-key fields with CHR(31) (ASCII unit separator), and each field is UPPER(TRIM(...))-ed (NULLs become CHR(0)).
Properties:
- Idempotent: same input → same ID, every time
- Whitespace + case tolerant:
' 37 'and'37'produce the same ID - 64 bits of entropy → P(collision in 256M rows) ≈ 1e-3 in any single entity (acceptable for v1)
What this means for you:
- Your queries can JOIN on PRPV_* IDs with confidence
- Source-system APN format changes do NOT change the PRPV ID (unless the field actually changes)
- Within a major version, IDs are forever-stable — your warehouse can store them
Owner entity resolution
Every parcel has at most one canonical prpv_owner_id via the EDGE_OWNER_PARCEL table.
Pipeline:
- Source cluster ID —
PROPZILLA.CURATED.PARCEL_ENRICHED.OWNER_ENTITY_IDcarries an internal cluster ID (er-prefix for entities,er-i-for individuals) covering 99.7% of parcels. - Aggregate to one row per cluster: most-frequent name, classified entity_type, parcel_count rollup.
- Name-classifier override — owners whose name contains
LLC,CORP,TRUST,COUNTY,CITY OF, etc. get theirentity_typereclassified accordingly. This corrects PE's tendency to mark obvious-corporate names as INDIVIDUAL. - Canonical merge (v1.1) — corporate entities (LLC/CORP/LP/GOVT/NONPROFIT) with the same normalized name + entity_type collapse into a single
prpv_owner_id. The one with the most parcels wins. INDIVIDUAL and TRUST entities are NOT merged (too prone to first-name-only false positives like "John" matching across families). - Placeholder drop — owners whose name is empty, "Not Available", "Current Owner", "Multiple Owners", "Unknown", "[Limited Access]" are mapped to NULL canonical and their edges dropped. ~22K placeholder clusters removed.
Confidence scoring:
1.00— Tier A: known-institutional list match (e.g., "Invitation Homes", "Blackstone")0.95— Tier B: name-classified entity type (regex-matched corporate suffix)0.85— Tier C: er-i-* individual cluster- Edges inherit owner confidence
Limitations:
- Multi-grantee names like "URBAN SITES OF AMERICA LLC; WILLIAMS MARY; WILLIAMS-DAUGHTRY EBETTE" are treated as one entity. Splitting at
;and re-resolving is a v1.2 improvement. - We do NOT resolve beneficial ownership (LLC → controlling natural person). That requires Secretary of State filings + FinCEN BOI access — slated for the Person entity in v1.1.
Address comparison (for is_owner_occupied / is_absentee)
PE provides two addresses that are always semantically meaningful but format-incompatible:
- Situs:
'1230 DRESDEN DR WEST CHARLOTTE NC'(street + city + state) - Mailing:
'1230 DRESDEN DR WEST 28205'(street + zip5)
Naïve string comparison fails. Our methodology:
- Strip non-alphanumeric, upper-case both
- Take leading 16 chars (≈ street_number + 1-2 words of street name)
- If prefixes equal → owner-occupied
- If situs_zip5 + mail_zip5 + LEFT(prefix,8) all match → owner-occupied (weak match)
- Otherwise → absentee
Absentee classification:
OUT_OF_STATE— mailing state ≠ situs state (highest signal — likely investor)LOCAL_ABSENTEE— mailing in same ZIP but different street (often a property manager or relative)OUT_OF_COUNTY— mailing in same state but different countyPO_BOX— mailing is a P.O. box (deliberate obfuscation)
Confidence scales with how many corroborating fields agree (street + zip + state).
Institutional buyer classification
Three tiers, in order of confidence:
- Tier A (conf 1.00): name matches a curated list of known institutional buyers — Invitation Homes, AMH, Tricon, Blackstone, etc. Twenty-one canonical patterns shipped in v1.
- Tier B (conf 0.90): entity_type ∈ (LLC/CORP/LP) AND parcel_count >= 10 (multi-property corporate ownership).
- Tier C (conf 0.75): entity_type ∈ (LLC/CORP/LP) AND parcel_count >= 5.
Customer queries should default to Tier A+B (confidence_score >= 0.90). Tier C is for screen-out / lead-gen workflows tolerant of noise.
The curated list will grow. Customers requesting additions can suggest patterns; we ship them in the next release.
Recent transfer recency
Source: PROPZILLA.CURATED.PARCEL_ENRICHED.LAST_SALE_DATE + LAST_SALE_PRICE.
Buckets: LAST_90 / LAST_180 / LAST_365 / OVER_365 / NEVER.
Known data-quality issue: PE has ~313K parcels (Mecklenburg alone) with LAST_SALE_DATE set to dates in 2026 — these are upstream backfill artifacts, not real recent sales. Customers should filter confidence_score >= 0.95 (which requires both date AND price) for clean signal. Working with upstream to clean for v1.1.
Geometry storage
Parcel boundaries stored as GEOGRAPHY type (Snowflake's WGS84 spatial type). Coverage:
- 167.8M of 230M+ parcels have polygons (~73%)
has_polygonboolean lets customers fast-filter- Centroid lat/lon also exposed for non-spatial queries
Why INSURANCE covers 255.7M parcels (wider than Core's 230M+)
The Core / Ontology / Owner-Graph bundles are scoped to the canonical SILVER.PARCEL set — 230M+ parcels that have a confirmed matching SILVER.PROPERTY record. The INSURANCE bundle is intentionally wider: 255.7M parcels including every row in CURATED.PARCEL_ENRICHED, even those without a v1-canonical property record yet.
Why the gap? Property classification (year_built, building area, owner-occupancy) requires county assessor data that's published on a slower cadence than the underlying parcel boundaries. Some 25M parcels — primarily in fast-growing exurbs, recently-subdivided land, agricultural and vacant lots, and special-purpose parcels (cemeteries, easements, right-of-ways) — have NRI / NFIP / NOAA peril coverage but no synced property record yet. For carrier underwriting and reinsurance cat-modeling, the wider grain is the right tradeoff: every parcel with a meaningful peril vector belongs in the cohort, even if its property attributes are still pending.
What this means for customer queries:
INSURANCE.PARCEL_RISKandINSURANCE.PARCEL_HIGH_PERIL_FLAGScover the full 255.7M.- Joining INSURANCE entities to
SILVER.PARCELviaprpv_parcel_idwill produce ~92% match rate (230M+ of 255.7M). The unmatched 25M are parcels with peril data but no canonical property — filter onEXISTS (SELECT 1 FROM SILVER.PARCEL ...)if you want only fully-canonicalized parcels. - The passthrough views
INSURANCE.PARCEL/INSURANCE.PROPERTY/INSURANCE.OWNERare pinned to the 230M+ canonical set (they project SILVER directly), so customers who want narrower coverage can use those instead ofPARCEL_RISK.
This is deliberate v1 scoping. v1.2 will close the gap by accelerating property-record ingestion for the missing 25M.
Known coverage gaps (v1 accepted)
| Field | Coverage | Why | Workaround |
|---|---|---|---|
SILVER.OWNER.name_raw | 74% nationally | County feeds in NJ, ME, NH, RI, CA Prop 13 partially redact owner names at the publishing layer. Cannot enrich — the data isn't published upstream. | Filter to name_raw IS NOT NULL for owner-keyed analytics; expect lower coverage in affected states. |
SILVER.TAX_RECORD.assessed_value_total = 0 | ~6.56M parcels in MD + ND | MD and ND county feeds publish TAV=0 for tax-exempt parcels (government, religious, nonprofit). | Filter assessed_value_total > 0 to exclude exempt parcels (usually the desired behavior). |
SILVER.PERMIT → SILVER.PARCEL FK | 73.9% resolves | Some MATCHED_PARCEL_ID values point to parcels in the wider CURATED.PARCEL_ENRICHED set (255.7M) that aren't in canonical SILVER.PARCEL (230M+) — see the INSURANCE-bundle wider-grain note above. | For full permit-to-parcel reach, use the legacy PROPRAVEN_PARCEL_INTELLIGENCE share (over PROPZILLA.CURATED). |
SILVER.PROPERTY.property_type_class normalization | MISMO codes verbatim | The normalized abbreviations (SFR / MF / CRE) ship in v1.1. Current values: SingleFamily, VacantLand, Commercial, Condominium, etc. | Filter on the MISMO values for now (see sample queries Q1, Q16). |
What we DO NOT include
Per v1 PII boundary (locked in plan §13):
- No DOB, SSN, email, phone, or device IDs on any individual
- No image data (parcel photos)
- No proprietary AVM (we expose source-county assessed values; customers run their own AVMs)
- No personal addresses for natural persons (only for entities)
Revisit each in v2 with legal review.
Data freshness SLA (v1)
| Tier | Refresh cadence |
|---|---|
| Ontology (PARCEL/PROPERTY/TAX_RECORD/PERMIT/OWNER/GEOGRAPHY/TRANSACTION/LOAN/ADDRESS + edges + signals) | MONTHLY |
| LENDING / SFR / CRE / GOVTECH bundles | MONTHLY |
| INSURANCE bundle (FEMA NRI annual; NFIP / NOAA quarterly) | QUARTERLY |
| BANKING_MARKETPLACE (FDIC SOD annual; analytics monthly) | MONTHLY |
Source data inherits its publishing cadence — county assessor rolls publish monthly to annually, federal data (FEMA / Census / FDIC) annually. The _LOADED_AT and _SOURCE_AS_OF columns expose actual freshness per row.
META.SOURCE_LINEAGE.last_refreshed_at shows when each entity last updated.
Versioning
Every row carries _resolution_version (e.g., v1.0.0-national). Pin queries to a version for reproducibility:
SELECT * FROM SILVER.OWNER WHERE _resolution_version = 'v1.0.0-national';
Major versions (v1 → v2) are breaking. Old major lives in a parallel listing for 90 days. Minor versions (v1.1) are additive — new entities, new columns, refined methodology — and don't break existing queries.