01 ETL vs ELT
ETL (extract → transform → load) applies business rules before landing in the warehouse—good when the target schema is fixed. ELT loads raw or lightly cleaned data first, then transforms inside the warehouse using SQL/Spark—scales with modern warehouses and keeps raw history for reprocessing.
ELT keeps raw data close to the warehouse so you can replay transformations when definitions change—common in analytics engineering and feature pipelines.
02 Medallion architecture
A layered lakehouse pattern: Bronze = immutable raw ingest (append-only), Silver = cleaned, conformed, deduplicated entities, Gold = business aggregates and curated marts for BI/ML. Each layer reduces ambiguity and cost for downstream users.
Not every org names layers this way, but the progressive refinement idea is universal: isolate ingestion bugs in bronze before they pollute gold KPIs.
03 Batch vs streaming
Batch jobs (hourly, daily) maximize throughput and simplify correctness with idempotent partitions. Streaming (event-by-event or micro-batch) minimizes latency for alerts and near-real-time features—at the cost of operational complexity (watermarks, late data, exactly-once semantics).
Hybrid is common: stream into a real-time store for online features, batch into the warehouse for training sets and reporting.
04 Mental model
Good data engineering makes ML and analytics possible: without reliable grain, timestamps, and joins, models learn noise. Invest in lineage (where did this column come from?), tests (unique keys, not-null), and cost-aware storage (partitioning, clustering, lifecycle policies).