Data Methodology

Every record in the GeoData dataset contains 24 structured fields. This page explains what each field contains, how our data quality pipeline works, and what you can expect in terms of completeness and accuracy.

Complete Field Reference

Field	Type	Example Value
`place_id`	id	08f2a100d8a54a0203e9e4580898d3ea
`name`	text	Bella Napoli Pizzeria
`brand`	text	empty
`basic_category`	category	Italian Restaurant
`category`	category	Restaurant
`alternate_categories`	category	Pizza Restaurant, Wine Bar
`address`	text	287 Grand St
`city`	text	New York
`state`	text	NY
`zipcode`	text	10002
`country`	text	US
`lat`	number	40.718432
`lon`	number	-73.992158
`websites`	url	https://bellanapolipizza.com
`phones`	phone	(212) 555-0847
`facebook`	social	https://facebook.com/bellanapolinyc
`operating_status`	status	Active
`location_count`	number	1
`market_name`	text	New York
`county_fips`	text	36061
`tract_geoid`	text	36061002300
`extracted_emails`	email	info@bellanapolipizza.com
`extracted_instagram`	social	bellanapolinyc
`extracted_at`	timestamp	2026-02-15T08:23:41Z

All 24 data fields shown with a sample restaurant record

Identification Fields

Place ID -- A unique identifier for each business location. This stays consistent across monthly updates, so you can track the same business over time.

Name -- The official business name as listed in public records and on the business's own website. For franchises and chains, this is the specific location name (e.g., "Starbucks" or "McDonald's").

Brand -- The parent brand or franchise name, if applicable. This field is populated for chain businesses and can be used to filter out or group by brand. Independent businesses typically have this field empty.

Category -- The primary top-level category: Restaurant, Cafe, or Bar.

Basic Category -- A more specific classification within the top-level category (e.g., "Pizza Restaurant", "Coffee Shop", "Cocktail Bar").

Alternate Categories -- Additional business classifications. A single business may have multiple categories (e.g., a location might be classified as both "Italian Restaurant" and "Wine Bar"). Comma-separated when multiple.

Location Count -- For chain brands, this indicates the number of locations that brand has in the dataset.

Contact Information

Extracted Emails -- Email addresses extracted from the business's website. These may include general inquiries, owner contacts, or booking addresses. A single record can have multiple email addresses, separated by commas.

Phones -- Business phone numbers extracted from their website. Formatted consistently for easy use in outreach tools.

Websites -- The primary website URL for the business. These are validated monthly -- we check that the URL returns a successful response and isn't a dead link.

Extracted Instagram -- The business's Instagram handle, extracted from their website's social links.

Facebook -- The business's Facebook page URL, extracted from their website's social links.

Location Fields

Address -- The street address of the business location.

City -- The city name.

State -- Two-letter state abbreviation (e.g., "CA", "NY", "TX").

Zipcode -- The 5-digit ZIP code.

Country -- Always "US" for the current dataset.

Lat / Lon -- Precise geographic coordinates from the Overture Maps Foundation. These are accurate enough for map plotting, distance calculations, and geofencing applications.

Market Name -- The metropolitan market or DMA (Designated Market Area) the business falls within (e.g., "New York", "Los Angeles", "Chicago").

County FIPS -- The 5-digit FIPS code identifying the US county the business is located in (e.g., "06037" for Los Angeles County, CA).

Census Tract -- The 11-digit GEOID identifying the US Census tract the business falls within (e.g., "06037264100"). Census tracts are small geographic areas defined by the Census Bureau, typically containing 1,200–8,000 residents.

Metadata Fields

Operating Status -- Whether the business is currently active or has closed. The vast majority of records are active businesses.

Extracted At -- The timestamp of the most recent enrichment cycle that processed this record. Since we re-enrich the entire dataset monthly, most records will show a date within the last 30 days.

Data Quality Pipeline

Our data quality is the product of a multi-stage pipeline that runs every month:

Stage 1: Authoritative Source Data

The base layer comes from the Overture Maps Foundation. This provides us with verified business names, addresses, coordinates, categories, and brand information. Overture's data is aggregated from multiple authoritative sources and goes through its own quality validation before we ever see it.

Stage 2: Website Crawling

We visit every business website in the dataset -- over 1.7 million URLs. We check for:

Whether the website is still live and returning a valid response
Updated content that may contain new contact information
Social media links in headers, footers, and contact pages

Stage 3: AI-Powered Extraction

The crawled website content is processed through our AI extraction pipeline. This isn't simple regex matching -- we use large language models to understand the context of the page and extract structured data:

Email addresses are distinguished from login forms or unsubscribe links
Phone numbers are identified as business lines rather than fax numbers or support queues
Social media handles are matched to the correct platform

Stage 4: Validation & Deduplication

Every extracted piece of data goes through validation:

Email addresses are checked for valid format and domain
Phone numbers are normalized to a consistent format
URLs are validated for live responses
Duplicate records are identified and merged

Enrichment Coverage

Not every business has a website, and not every website lists an email address. Here are the approximate field coverage percentages across the full dataset:

Field	Records	Coverage
Name, Address, City, State, Zip	1.7M	100%
Lat, Lon	1.7M	100%
Category, Basic Category	1.7M	100%
Phone	~1.6M	~95%
Website	~1.5M	~85%
Facebook	~1.3M	~75%
Instagram	~525K	~31%
Email	~350K	~20%

Coverage varies by state and category. Urban areas with more digitally-present businesses tend to have higher enrichment rates. You can check state-specific enrichment statistics by exploring any state in the Data Shop.

Monthly Updates

The entire dataset is refreshed monthly. Each update cycle includes:

New Overture release ingestion -- picking up newly opened businesses and address changes
Full website re-crawl -- checking every URL for updated content
AI re-extraction -- processing all crawled content through the latest extraction models
Validation pass -- removing closed businesses, deduplicating, and normalizing

When you purchase a state dataset, you receive the data as of the most recent completed enrichment cycle. The "Last Updated" field on each record tells you exactly when it was last processed.