Extracting building permits from 100+ US municipal portals
Built a production scraping system that extracts building permit, inspection, and contractor data from 13+ distinct portal types across 100+ US cities and counties — each with different tech stacks, authentication schemes, and data models.
US building permit data lives behind hundreds of independent municipal websites — each running different software (Accela, Click2Gov, Posse, custom systems), with different data models, authentication, anti-bot protections, and page structures. There is no unified API. The client needed a production system that could extract the full historical backlog of permits, inspections, fees, contractors, and property data from all of them — then keep running incrementally to capture new records daily, reliably, at scale, with no data gaps.
I built a modular, configuration-driven scraping framework where each portal type is a self-contained module with its own spiders, item definitions, and settings — but all share common infrastructure for state management, crash recovery, and data assembly. The architecture adapts its discovery strategy to each portal's behavior automatically.
The portal landscape
US building permits aren’t centralized. Every city, county, and state runs its own permitting system — and there are only a handful of commercial platforms that power them. Each platform has its own quirks, and even sites running the same software can look and behave completely differently depending on how the jurisdiction configured it.
Here’s what I had to scrape across 13+ portal types:
Accela Citizen Access — one-to-many (40+ jurisdictions)
The most common government permitting platform in the US. One codebase serves municipalities from Arapahoe County, CO to Solano County, CA — but each agency customizes their fields, page structure, and available data differently.
- Scale: 40+ jurisdictions from a single scraping framework
- Challenge: Even though they all run Accela, every agency renames fields, restructures pages, and enables different data tabs. A rigid scraper breaks the moment you onboard a new city.
- Strategy: Configuration-driven architecture — each jurisdiction is defined as a config profile with its own parsing rules and feature flags. Adding a new city requires minimal code instead of building from scratch each time.
- Data: Permits, inspections, contractors, applicants, property info, fees, licensed professionals
Click2Gov (40+ jurisdictions)
A permit management system deployed across Florida, California, North Carolina, Texas, and 10+ other states.
- Scale: 40+ municipalities across 12+ states
- Challenge: No bulk search API exists. Permits can only be looked up individually, and a naive approach would take weeks per city.
- Strategy: Designed a smart search algorithm that efficiently discovers all permits in a fraction of the time — turning a weeks-long crawl into an under-an-hour job per jurisdiction.
- Data: 7 distinct record types per permit (applications, structures, inspections, fees, plan tracking) with 40+ fields each.
Broward County, FL (Posse system)
A single-jurisdiction portal running proprietary software, with permit records going back to 1965.
- Challenge: Every permit requires extracting data from 6+ separate views (master permit, plan review, individual permits, inspections, fees, documents), each behind a separate form submission.
- Strategy: Built a multi-step extraction pipeline that navigates through all views per permit and assembles them into a single hierarchical record.
- Data: 60+ years of permit history with full document attachments
MaintStar (8 jurisdictions across 4 states)
An API-driven permit management system used by cities in Florida, California, Illinois, and Colorado.
- Scale: 8 active jurisdictions — Plant City, Palm Springs, Waukegan, La Mesa, Novato, Alameda County, City of Orange, Mesa County
- Challenge: The API requires authenticated sessions with token management, and the system detects automated access patterns.
- Strategy: Built transparent authentication that manages sessions across multiple accounts with automatic rotation. The system tracks exactly where it left off between runs, so daily updates only fetch new permits — no redundant work.
- Data: 60+ fields per permit including contacts, fees, inspections, timeline events, and entry forms
Pierce County, WA (PALS Online)
A REST API serving permit data across non-contiguous ID ranges.
- Challenge: Data spread across multiple endpoints per permit. Each permit requires 7 separate API calls to collect all related records (inspections, reviews, status, documents).
- Strategy: Parallel extraction — fetch the permit header first, then fire off concurrent requests for all related data types. Found the optimal QPS for each endpoint to maximize throughput without getting blocked.
- Data: 7 child record types assembled into a single parent record
Massachusetts Professional Licenses
State-wide license database (electricians, accountants, contractors) across all 50 US states.
- Challenge: The portal chokes on large result sets — querying an entire state at once causes server-side timeouts.
- Strategy: Partitioned searches by geographic region (state-level for smaller states, zip-code prefix splits for dense states like Massachusetts) to keep result sets within the portal’s limits. 76 targeted queries cover all 50 states.
Timmons Group (2 jurisdictions)
A permit platform where each jurisdiction uses a completely different access model — one requires login authentication, the other is fully public.
- Scale: Rogers, AR and Winston-Salem, NC
- Challenge: The same platform behaves like two different products. Rogers requires login for both search and detail pages; Winston-Salem exposes detail pages publicly but has no search API. A single rigid approach can’t handle both.
- Strategy: Built a pluggable authentication layer — the same spider adapts its discovery approach based on what each deployment exposes. Session expiration is detected automatically and re-authentication happens transparently.
- Data: Full permit lifecycle — inspections, contractors, fees, payments, documents, contacts, and conditions
MyGovernmentOnline / MGConnect (415 jurisdictions across 28 states)
One of the largest portal types by jurisdiction count — 415 cities and counties across 28 US states, concentrated in Louisiana, Illinois, and Texas. The public-facing MyGovernmentOnline site has disabled permit searching, so discovery and detail extraction run through the MGConnect API.
- Scale: The single largest portal integration in the system by number of jurisdictions
- Challenge: MyGovernmentOnline disabled their permit search, so the standard approach of searching on the portal itself doesn’t work. All discovery and extraction had to be built against the MGConnect API with authenticated access and session management across hundreds of jurisdictions.
- Strategy: Built a unified approach that covers all 415 jurisdictions from one framework, with parallel data fetching per permit (details, contacts, addresses, inspections, documents) and smart resume logic for interrupted runs
And more portals
Jacksonville (JAXEPICS), Logis, Miami-Dade, Munis, and Massgov — each with its own scraping strategy, data model, and deployment patterns.
Architecture
The system follows a modular, portal-per-package architecture:
The system is organized as one module per portal type. Each module owns its own search strategy, data parsing, record definitions, and settings — but all modules share common infrastructure for state management, crash recovery, and record assembly.
What’s shared across all portals
- State management — persistent checkpoints for crash recovery
- Record assembly — a pipeline that links child records (inspections, fees, documents) to their parent permits
- Base behavior — logging, error handling, retry logic, QPS management
- Data validation — consistency checks, placeholder detection, schema enforcement
What’s unique per portal
- Discovery strategy — how permits are found (date-range search, binary search, sequential enumeration, API calls)
- Data extraction — how records are parsed from the source (HTML, JSON, form submissions)
- Record structure — which fields exist and how they relate to each other
- Tuning — concurrency, request pacing, proxy strategy
This separation means a bug fix in Accela cannot break Click2Gov. A new portal is a new module — no changes to shared infrastructure. And onboarding a new jurisdiction within an existing portal type is often just a configuration change.
Key innovations
Smart permit discovery
Some portals have no search API — permits can only be looked up individually. A brute-force approach would take weeks per city. I designed efficient search algorithms tailored to each portal’s behavior, turning multi-week crawls into jobs that finish in under an hour per jurisdiction.
Configuration-driven multi-tenancy
Accela and Click2Gov each power dozens of jurisdictions, but every city customizes their deployment differently — different field names, different page layouts, different available data.
Rather than building a separate scraper per city, I built a configuration-driven system where each jurisdiction is a config profile. Adding a new city takes minutes, not days — and zero code changes.
Intelligent address parsing
US addresses are deceptively hard to parse. “123 Main RD” — is “RD” a road suffix or the state code for Rhode Island? I built a validation layer that resolves these ambiguities accurately, without relying on heavyweight NLP tools.
Zero data gaps on crash recovery
Long-running jobs crash — it’s inevitable. The system saves progress continuously, so when a job restarts it picks up exactly where it left off. It also handles the edge case where new permits are filed during downtime, ensuring nothing falls through the cracks between the crash and the recovery.
Data model
Every permit produces a rich, hierarchical data structure. A single permit can generate 6+ linked records — the permit itself, plus its inspections, fees, plan reviews, contractor details, and attached documents. The system assembles all of these into a single, complete permit profile ready for the client’s database.
Across all portal types, common fields include:
| Category | Fields |
|---|---|
| Permit | Number, type, status, issue date, expiration, description, job value |
| Property | Street address, city, state, zip, APN/parcel number |
| Owner | Name, mailing address, phone |
| Contractor | Name, license number, business name, contact info |
| Inspections | Type, scheduled date, result, inspector, comments |
| Fees | Description, amount, date, payment status |
Portal-specific fields push the count to 100+ per permit for richer sources like Jacksonville and Pierce County.
Resilience
Crash recovery
Jobs crash — network failures, server restarts, memory issues. The system saves progress continuously, so every restart picks up exactly where it left off. No duplicate work, no missed records, no manual intervention.
Multiple portals run concurrently without interfering with each other’s state.
Authentication and session handling
Each portal type has its own authentication requirements — CSRF tokens, session cookies, multi-step form workflows, or API keys. The system manages all of these transparently, including gracefully skipping permits that are behind access restrictions rather than crashing the entire job.
I also reverse-engineered the optimal request pacing for each portal — fast enough for production throughput, respectful enough to never trigger blocks.
Quality validation
Data goes through multiple validation layers before delivery:
- Empty and placeholder values are filtered out automatically
- Cross-field consistency checks catch malformed records
- Typed schemas enforce structure at the extraction layer
- Comprehensive test suites cover 98% of critical extraction logic
Results
The system runs in production daily, powering the client’s construction analytics platform:
- 13+ portal types handled from a single unified codebase
- 100+ US jurisdictions covered across 15+ states
- Millions of permit records extracted and kept up to date
- Rich data — 40 to 120 fields per permit depending on the source
- Zero data gaps across months of continuous operation
- Fast onboarding — new jurisdictions onboarded with minimal code instead of building from scratch each time
- Fully tested — 98% coverage on critical extraction logic
The architecture scales horizontally. Each portal runs independently, so adding capacity or debugging one source never affects the others. The client gets clean, structured permit data delivered on schedule — without needing to understand what’s happening under the hood.
Need something similar built?
I build production scraping systems for teams that need reliable data at scale. Let's talk about your project.