Healthcare Data Pipeline

Production

Environment

Airflow

Orchestration

AWS

Infrastructure

PostgreSQL

Database

The Problem

Healthcare claims data is messy in ways that other industries rarely encounter. Different payers submit data in different formats. Field names vary across systems. Codes change between ICD-10 revisions. A single patient can appear under multiple identifiers across different providers. And the stakes are high — incorrect data transformations can produce wrong analytics outputs that affect actual healthcare decisions.

At Sift Healthcare, we built and maintained the ETL pipeline that takes raw claims data from multiple sources, transforms it into a consistent schema, validates it against business rules, and loads it into the analytics warehouse that downstream teams depend on.

The Approach

The pipeline follows a classic Extract-Transform-Load pattern, but the complexity lives in the details. Ingestion handles multiple file formats and delivery mechanisms — SFTP drops, API pulls, and batch file uploads. Each source has its own parser because healthcare data standardization is more aspirational than real.

The transformation layer is where most of the engineering effort sits. Claims records go through normalization (standardizing codes, dates, and identifiers), deduplication (the same claim can arrive through multiple channels), enrichment (joining against reference data for provider and facility information), and validation (checking that required fields exist, codes are valid, and relationships between fields make sense).

Healthcare data has a unique problem: you cannot just drop invalid records. A claim with a suspicious-looking code might be a data entry error, or it might be a legitimate edge case in a specialty you have never seen before. The pipeline flags anomalies for review rather than silently discarding them — because in healthcare, the weird data is often the most important data.

Technical Decisions

Airflow for orchestration over a cron-based approach. The pipeline has dependencies that cron cannot express cleanly — validation cannot start until all source ingestions complete, and loading cannot start until validation passes. Airflow DAGs make these dependencies explicit, provide retry logic with exponential backoff for transient failures, and give visibility into what ran, when, and whether it succeeded. When something fails at 3 AM, the Airflow UI tells us exactly which task in which DAG failed and why.

PostgreSQL for the warehouse, not a dedicated OLAP engine. For the current data volume, PostgreSQL with proper indexing, partitioning, and materialized views handles the analytical query patterns well. The decision was pragmatic — the team already knows PostgreSQL, the operational overhead is lower, and migrating to something like Redshift or BigQuery is straightforward if we outgrow it. Over-engineering the storage layer before you have a scaling problem is a common trap.

Pandas for transformation with SQL for bulk operations. Complex row-level transformations (code mapping, date normalization, deduplication logic) run in pandas where the logic is easier to express and test. Bulk operations (inserts, updates, aggregations) use raw SQL for performance. The hybrid approach means we are not forcing either tool into a use case it handles poorly.

Validation as a first-class pipeline stage, not an afterthought. Every record passes through validation rules before it touches the warehouse. Rules check field-level constraints (valid date ranges, known code sets, required fields), cross-field consistency (procedure codes matching diagnosis codes, provider IDs existing in reference tables), and statistical anomaly detection (sudden volume spikes or drops that might indicate a data feed problem). Failed validations produce structured error reports, not silent drops.

What We Learned

The biggest lesson was that data pipeline reliability is harder than data pipeline correctness. Getting the transformations right is straightforward engineering. Keeping the pipeline running reliably when sources change their formats without warning, files arrive late or corrupted, and upstream systems have unannounced maintenance — that is the real job. Idempotent operations, comprehensive logging, and alerting on anomalies matter more than clever transformation logic.

The second lesson was about testing data pipelines. Unit tests for individual transformations are necessary but insufficient. The failures that actually hurt are integration-level: a new code that does not exist in the reference table, a date format that changed in one source but not others, a field that used to be required becoming optional. Testing with realistic data samples from each source, refreshed regularly, catches the problems that unit tests miss.

This is production software that runs daily and feeds real analytics. The code is proprietary — we cannot share it — but the engineering patterns are the same ones we apply to every pipeline we build.

Need a production ETL pipeline? Book a free call to discuss your data infrastructure.

Healthcare Data Pipeline

The Problem

The Approach

Technical Decisions

What We Learned

Want Something Like This Built?