← Hub · Data engineering

Data engineering — moving data at scale

Data engineers build reliable paths from raw sources to analytics and ML consumers: ingestion, transformation, storage, and delivery—with clear contracts (schemas), observability, and cost control.

01 ETL vs ELT

ETL (extract → transform → load) applies business rules before landing in the warehouse—good when the target schema is fixed. ELT loads raw or lightly cleaned data first, then transforms inside the warehouse using SQL/Spark—scales with modern warehouses and keeps raw history for reprocessing.

Figure — two pipeline shapes

ELT keeps raw data close to the warehouse so you can replay transformations when definitions change—common in analytics engineering and feature pipelines.

02 Medallion architecture

A layered lakehouse pattern: Bronze = immutable raw ingest (append-only), Silver = cleaned, conformed, deduplicated entities, Gold = business aggregates and curated marts for BI/ML. Each layer reduces ambiguity and cost for downstream users.

Figure — bronze → silver → gold

Not every org names layers this way, but the progressive refinement idea is universal: isolate ingestion bugs in bronze before they pollute gold KPIs.

03 Batch vs streaming

Batch jobs (hourly, daily) maximize throughput and simplify correctness with idempotent partitions. Streaming (event-by-event or micro-batch) minimizes latency for alerts and near-real-time features—at the cost of operational complexity (watermarks, late data, exactly-once semantics).

Figure — latency vs simplicity (conceptual)

Hybrid is common: stream into a real-time store for online features, batch into the warehouse for training sets and reporting.

Airflow / Dagster / dbt Kafka / Flink / Spark Streaming Data contracts Observability (data quality SLAs)

04 Mental model

Good data engineering makes ML and analytics possible: without reliable grain, timestamps, and joins, models learn noise. Invest in lineage (where did this column come from?), tests (unique keys, not-null), and cost-aware storage (partitioning, clustering, lifecycle policies).

Common pitfall

Training-serving skew: features computed differently offline vs online break deployed models—keep transformation logic as shared as possible.