Data Methodology
Every record in the GeoData dataset contains 24 structured fields. This page explains what each field contains, how our data quality pipeline works, and what you can expect in terms of completeness and accuracy.
Complete Field Reference
| Field | Type | Example Value |
|---|---|---|
place_id | id | 08f2a100d8a54a0203e9e4580898d3ea |
name | text | Bella Napoli Pizzeria |
brand | text | empty |
basic_category | category | Italian Restaurant |
category | category | Restaurant |
alternate_categories | category | Pizza Restaurant, Wine Bar |
address | text | 287 Grand St |
city | text | New York |
state | text | NY |
zipcode | text | 10002 |
country | text | US |
lat | number | 40.718432 |
lon | number | -73.992158 |
websites | url | https://bellanapolipizza.com |
phones | phone | (212) 555-0847 |
facebook | social | https://facebook.com/bellanapolinyc |
operating_status | status | Active |
location_count | number | 1 |
market_name | text | New York |
county_fips | text | 36061 |
tract_geoid | text | 36061002300 |
extracted_emails | info@bellanapolipizza.com | |
extracted_instagram | social | bellanapolinyc |
extracted_at | timestamp | 2026-02-15T08:23:41Z |
All 24 data fields shown with a sample restaurant record
Identification Fields
Place ID -- A unique identifier for each business location. This stays consistent across monthly updates, so you can track the same business over time.
Name -- The official business name as listed in public records and on the business's own website. For franchises and chains, this is the specific location name (e.g., "Starbucks" or "McDonald's").
Brand -- The parent brand or franchise name, if applicable. This field is populated for chain businesses and can be used to filter out or group by brand. Independent businesses typically have this field empty.
Category -- The primary top-level category: Restaurant, Cafe, or Bar.
Basic Category -- A more specific classification within the top-level category (e.g., "Pizza Restaurant", "Coffee Shop", "Cocktail Bar").
Alternate Categories -- Additional business classifications. A single business may have multiple categories (e.g., a location might be classified as both "Italian Restaurant" and "Wine Bar"). Comma-separated when multiple.
Location Count -- For chain brands, this indicates the number of locations that brand has in the dataset.
Contact Information
Extracted Emails -- Email addresses extracted from the business's website. These may include general inquiries, owner contacts, or booking addresses. A single record can have multiple email addresses, separated by commas.
Phones -- Business phone numbers extracted from their website. Formatted consistently for easy use in outreach tools.
Websites -- The primary website URL for the business. These are validated monthly -- we check that the URL returns a successful response and isn't a dead link.
Extracted Instagram -- The business's Instagram handle, extracted from their website's social links.
Facebook -- The business's Facebook page URL, extracted from their website's social links.
Location Fields
Address -- The street address of the business location.
City -- The city name.
State -- Two-letter state abbreviation (e.g., "CA", "NY", "TX").
Zipcode -- The 5-digit ZIP code.
Country -- Always "US" for the current dataset.
Lat / Lon -- Precise geographic coordinates from the Overture Maps Foundation. These are accurate enough for map plotting, distance calculations, and geofencing applications.
Market Name -- The metropolitan market or DMA (Designated Market Area) the business falls within (e.g., "New York", "Los Angeles", "Chicago").
County FIPS -- The 5-digit FIPS code identifying the US county the business is located in (e.g., "06037" for Los Angeles County, CA).
Census Tract -- The 11-digit GEOID identifying the US Census tract the business falls within (e.g., "06037264100"). Census tracts are small geographic areas defined by the Census Bureau, typically containing 1,200–8,000 residents.
Metadata Fields
Operating Status -- Whether the business is currently active or has closed. The vast majority of records are active businesses.
Extracted At -- The timestamp of the most recent enrichment cycle that processed this record. Since we re-enrich the entire dataset monthly, most records will show a date within the last 30 days.
Data Quality Pipeline
Our data quality is the product of a multi-stage pipeline that runs every month:
Stage 1: Authoritative Source Data
The base layer comes from the Overture Maps Foundation. This provides us with verified business names, addresses, coordinates, categories, and brand information. Overture's data is aggregated from multiple authoritative sources and goes through its own quality validation before we ever see it.
Stage 2: Website Crawling
We visit every business website in the dataset -- over 1.7 million URLs. We check for:
- Whether the website is still live and returning a valid response
- Updated content that may contain new contact information
- Social media links in headers, footers, and contact pages
Stage 3: AI-Powered Extraction
The crawled website content is processed through our AI extraction pipeline. This isn't simple regex matching -- we use large language models to understand the context of the page and extract structured data:
- Email addresses are distinguished from login forms or unsubscribe links
- Phone numbers are identified as business lines rather than fax numbers or support queues
- Social media handles are matched to the correct platform
Stage 4: Validation & Deduplication
Every extracted piece of data goes through validation:
- Email addresses are checked for valid format and domain
- Phone numbers are normalized to a consistent format
- URLs are validated for live responses
- Duplicate records are identified and merged
Enrichment Coverage
Not every business has a website, and not every website lists an email address. Here are the approximate field coverage percentages across the full dataset:
| Field | Records | Coverage |
|---|---|---|
| Name, Address, City, State, Zip | 1.7M | 100% |
| Lat, Lon | 1.7M | 100% |
| Category, Basic Category | 1.7M | 100% |
| Phone | ~1.6M | ~95% |
| Website | ~1.5M | ~85% |
| ~1.3M | ~75% | |
| ~525K | ~31% | |
| ~350K | ~20% |
Coverage varies by state and category. Urban areas with more digitally-present businesses tend to have higher enrichment rates. You can check state-specific enrichment statistics on each state's detail page in the Data Shop.
Monthly Updates
The entire dataset is refreshed monthly. Each update cycle includes:
- New Overture release ingestion -- picking up newly opened businesses and address changes
- Full website re-crawl -- checking every URL for updated content
- AI re-extraction -- processing all crawled content through the latest extraction models
- Validation pass -- removing closed businesses, deduplicating, and normalizing
When you purchase a state dataset, you receive the data as of the most recent completed enrichment cycle. The "Last Updated" field on each record tells you exactly when it was last processed.