Analytics engineering & this guide
Role fitWhat analytics engineers are tested on
Analytics engineers sit between data platform and stakeholders: they ship trustworthy metrics and models (often in dbt or the warehouse) and answer ad hoc questions without breaking grain or double-counting. Interviews usually stress SQL depth, dimensional thinking (facts vs dimensions, grain), window functions for cohorts and rolling KPIs, and knowing when a query is wrong before it is slow.
- Core SQL: execution order, joins, aggregates,
NULLsemantics, portable patterns across warehouses. - Advanced: CTEs for readability, window frames, data-quality checks,
EXPLAIN/ profiles for regressions. - Context here: downtime, machines, shifts, and OEE mirror real-world operational analytics; reuse the same SQL ideas for any event stream and dimension tables.
Fundamentals
Q1–Q5WHERE filters rows before aggregation. HAVING filters groups after aggregation. You cannot use aggregate functions inside WHERE.
Execution order visualization
-- ❌ WRONG — can't use COUNT() in WHERE SELECT machine_id, COUNT(*) FROM downtime_events WHERE COUNT(*) > 5 GROUP BY machine_id; -- ✅ CORRECT SELECT machine_id, COUNT(*) AS event_count FROM downtime_events WHERE shift_date >= '2024-01-01' -- row-level filter ✓ GROUP BY machine_id HAVING COUNT(*) > 5; -- group-level filter ✓
Visual — shaded region = rows returned
-- INNER: only machines that have had downtime events SELECT m.machine_name, COUNT(d.event_id) FROM machines m INNER JOIN downtime_events d ON m.id = d.machine_id GROUP BY m.machine_name; -- LEFT: ALL machines, NULL count if no downtime events SELECT m.machine_name, COUNT(d.event_id) AS event_count FROM machines m LEFT JOIN downtime_events d ON m.id = d.machine_id GROUP BY m.machine_name; -- Zero-downtime machines show count = 0, not missing
UNION removes duplicates — requires a sort pass, slower. UNION ALL keeps all rows including duplicates — no sort, faster. Use UNION ALL by default unless you specifically need deduplication.
-- Combine downtime events from two shifts -- UNION ALL: keep all (may include duplicates if data is duped) SELECT machine_id, event_time, reason FROM shift_1_events UNION ALL SELECT machine_id, event_time, reason FROM shift_2_events; -- UNION: deduplicate (slower — only use when needed) SELECT machine_id, event_time, reason FROM shift_1_events UNION SELECT machine_id, event_time, reason FROM shift_2_events;
Returns the first non-NULL value from its argument list. Essential for handling sensor dropouts, missing readings, or LEFT JOIN NULLs.
-- Sensor dropout: fall back to previous reading, then to 0 SELECT sensor_id, recorded_at, COALESCE(current_reading, prev_reading, 0) AS display_value, CASE WHEN current_reading IS NULL THEN 'IMPUTED' ELSE 'ACTUAL' END AS data_source FROM sensor_readings; -- NULL-safe division (avoid division-by-zero) COALESCE(units_good * 1.0 / NULLIF(units_total, 0), 0) AS quality_rate
COALESCE(value / NULLIF(denominator, 0), 0)| Command | What it removes | Rollback? | Speed | Factory rule |
|---|---|---|---|---|
| DELETE | Specific rows (WHERE) | Yes | Slow (logged) | Use for archiving specific records |
| TRUNCATE | All rows, keeps structure | No (usually) | Fast | Almost never — destroys raw event history |
| DROP | Entire table + structure | No | Instant | Never on production tables |
Aggregation & Grouping
Q6–Q9SELECT shift_id, machine_id, total_downtime, rnk FROM ( SELECT shift_id, machine_id, SUM(down_mins) AS total_downtime, RANK() OVER ( PARTITION BY shift_id ORDER BY SUM(down_mins) DESC ) AS rnk FROM production_events GROUP BY shift_id, machine_id ) ranked WHERE rnk <= 3 ORDER BY shift_id, rnk; -- Snowflake shorthand: QUALIFY instead of subquery SELECT shift_id, machine_id, SUM(down_mins) AS total_downtime, RANK() OVER (PARTITION BY shift_id ORDER BY SUM(down_mins) DESC) AS rnk FROM production_events GROUP BY shift_id, machine_id QUALIFY rnk <= 3;
Given values: 100, 100, 90 — how each function ranks them
| Score | ROW_NUMBER() | RANK() | DENSE_RANK() |
|---|---|---|---|
| 100 | 1 | 1 | 1 |
| 100 | 2 | 1 | 1 |
| 90 | 3 | 3 (skipped 2) | 2 |
ROW_NUMBER — always unique. Tie-breaking is arbitrary (non-deterministic). RANK — ties get same rank, then skips. DENSE_RANK — ties get same rank, no skip. For factory use: use DENSE_RANK when ranking machines by defect count — rank 2 should mean second worst, not third worst because two tied for first.
-- ROW_NUMBER: deduplicate — get exactly one row per machine SELECT * FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY machine_id ORDER BY ts DESC) AS rn FROM sensor_readings ) WHERE rn = 1; -- latest reading per machine -- DENSE_RANK: rank machines by defect count DENSE_RANK() OVER (ORDER BY defect_count DESC) AS defect_rank
ROWS frame — what's "in the window" for row 5 (today)
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
SELECT event_date, daily_defects, AVG(daily_defects) OVER ( ORDER BY event_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW ) AS rolling_7day_avg, -- Calendar-based (handles missing dates differently): AVG(daily_defects) OVER ( ORDER BY event_date RANGE BETWEEN INTERVAL '6 days' PRECEDING AND CURRENT ROW ) AS rolling_7day_calendar_avg FROM daily_quality_summary ORDER BY event_date;
-- Step 1: Find duplicates SELECT machine_id, event_timestamp, COUNT(*) AS occurrences FROM sensor_readings GROUP BY machine_id, event_timestamp HAVING COUNT(*) > 1; -- Step 2: Keep only the latest row per (machine, timestamp) WITH deduped AS ( SELECT *, ROW_NUMBER() OVER ( PARTITION BY machine_id, event_timestamp ORDER BY ingested_at DESC ) AS rn FROM sensor_readings ) SELECT * FROM deduped WHERE rn = 1;
Window Functions
Q10–Q12-- ROW_NUMBER approach (preferred — deterministic) SELECT machine_id, sensor_type, reading_value, recorded_at FROM ( SELECT *, ROW_NUMBER() OVER ( PARTITION BY machine_id, sensor_type ORDER BY recorded_at DESC ) AS rn FROM sensor_readings WHERE dq_flag = 'GOOD' ) t WHERE rn = 1; -- Snowflake shorthand with QUALIFY SELECT machine_id, sensor_type, reading_value, recorded_at FROM sensor_readings WHERE dq_flag = 'GOOD' QUALIFY ROW_NUMBER() OVER (PARTITION BY machine_id, sensor_type ORDER BY recorded_at DESC) = 1;
LAG() accesses the previous row's value. LEAD() accesses the next row's value. Both accept an offset and a default for when no prior/next row exists.
SELECT machine_id, event_start, LAG(event_start) OVER ( PARTITION BY machine_id ORDER BY event_start ) AS prev_event_start, DATEDIFF('minute', LAG(event_start) OVER (PARTITION BY machine_id ORDER BY event_start), event_start ) AS mins_since_last_downtime, LEAD(event_start) OVER ( PARTITION BY machine_id ORDER BY event_start ) AS next_event_start FROM downtime_events ORDER BY machine_id, event_start;
AVG(mins_since_last_downtime) on top of this query — a key factory reliability metric.ROWS is physical — counts actual row positions. RANGE is logical — groups rows with the same ORDER BY value, or uses an interval for date/numeric ranges.
ROWS vs RANGE — 3-day window with a missing date
| Date | Value | ROWS (3 PREC) | RANGE (2 days) |
|---|---|---|---|
| Jan 1 | 10 | AVG(10) | AVG(10) |
| Jan 2 | 20 | AVG(10,20) | AVG(10,20) |
| Jan 4 (gap!) | 30 | AVG(10,20,30) | AVG(20,30) only |
| Jan 5 | 40 | AVG(20,30,40) | AVG(30,40) |
-- ROWS: exactly 3 preceding rows regardless of date gaps AVG(reading) OVER (ORDER BY ts ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) -- RANGE: all rows within a 3-hour interval (handles gaps correctly) AVG(reading) OVER ( ORDER BY ts RANGE BETWEEN INTERVAL '3 hours' PRECEDING AND CURRENT ROW )
CTEs & Subqueries
Q13–Q14A CTE (WITH clause) names an intermediate result you can reference multiple times. Prefer CTEs when: the same intermediate result is used more than once, the logic is complex enough that naming it improves readability, or you need recursion (CTEs only).
WITH shift_oee AS ( SELECT pe.shift_id, pe.machine_id, SUM(pe.run_mins) * 1.0 / MAX(s.planned_mins) AS availability, SUM(pe.units_good) * 1.0 / NULLIF(SUM(pe.units_total), 0) AS quality, SUM(pe.units_total * dm.ideal_cycle_secs) / NULLIF(SUM(pe.run_mins) * 60, 0) AS performance FROM production_events pe JOIN dim_shift s ON pe.shift_id = s.shift_id JOIN dim_machine dm ON pe.machine_id = dm.machine_id GROUP BY pe.shift_id, pe.machine_id ) SELECT shift_id, machine_id, ROUND(availability * 100, 1) AS avail_pct, ROUND(performance * 100, 1) AS perf_pct, ROUND(quality * 100, 1) AS qual_pct, ROUND(availability * performance * quality * 100, 1) AS oee_pct FROM shift_oee WHERE availability < 0.75 -- flag underperforming shifts ORDER BY oee_pct;
How the recursion works
WITH RECURSIVE org AS ( -- Anchor: root nodes (no manager) SELECT employee_id, employee_name, manager_id, 0 AS depth, employee_name AS path FROM employees WHERE manager_id IS NULL UNION ALL -- Recursive: each employee joins to their manager's row SELECT e.employee_id, e.employee_name, e.manager_id, o.depth + 1, o.path || ' > ' || e.employee_name FROM employees e JOIN org o ON e.manager_id = o.employee_id ) SELECT employee_id, employee_name, depth, path FROM org ORDER BY path; -- Sample output: -- Alice | depth 0 | Alice -- Bob | depth 1 | Alice > Bob -- Dana | depth 2 | Alice > Bob > Dana -- Eve | depth 2 | Alice > Bob > Eve -- Carol | depth 1 | Alice > Carol
Data Quality & Real-World Problems
Q15–Q17Classic islands and gaps technique. Subtracting two row numbers creates a constant group ID for consecutive identical values.
How the group ID works
| Value | RN overall | RN per value | RN - RN_val = group_id |
|---|---|---|---|
| 21.5 | 1 | 1 | 0 |
| 21.5 | 2 | 2 | 0 |
| 21.5 | 3 | 3 | 0 |
| 22.1 | 4 | 1 | 3 (new group) |
| 21.5 | 5 | 4 | 1 (new group) |
WITH grouped AS ( SELECT sensor_id, reading_value, recorded_at, ROW_NUMBER() OVER (PARTITION BY sensor_id ORDER BY recorded_at) - ROW_NUMBER() OVER (PARTITION BY sensor_id, reading_value ORDER BY recorded_at) AS grp_id FROM sensor_readings ), consecutive_counts AS ( SELECT sensor_id, reading_value, grp_id, COUNT(*) AS consecutive_count, MIN(recorded_at) AS streak_start, MAX(recorded_at) AS streak_end FROM grouped GROUP BY sensor_id, reading_value, grp_id ) SELECT sensor_id, reading_value, consecutive_count, streak_start, streak_end FROM consecutive_counts WHERE consecutive_count >= 10 ORDER BY consecutive_count DESC;
WITH sensor_stats AS ( -- Compute baseline stats on GOOD readings only SELECT sensor_id, AVG(reading_value) AS mean_val, STDDEV(reading_value) AS stddev_val FROM sensor_readings WHERE dq_flag = 'GOOD' GROUP BY sensor_id ) SELECT r.sensor_id, r.reading_value, r.recorded_at, ROUND(s.mean_val, 3) AS mean_val, ROUND(ABS(r.reading_value - s.mean_val) / NULLIF(s.stddev_val, 0), 2) AS z_score, CASE WHEN ABS(r.reading_value - s.mean_val) / NULLIF(s.stddev_val, 0) > 3 THEN 'SUSPECT' ELSE 'GOOD' END AS dq_flag FROM sensor_readings r JOIN sensor_stats s ON r.sensor_id = s.sensor_id;
Two intervals [A_start, A_end] and [B_start, B_end] overlap if and only if: A_start < B_end AND A_end > B_start
Overlap condition — visualized
SELECT a.machine_id, a.event_id AS event_a, b.event_id AS event_b, a.start_time AS a_start, a.end_time AS a_end, b.start_time AS b_start, b.end_time AS b_end FROM downtime_events a JOIN downtime_events b ON a.machine_id = b.machine_id AND a.event_id < b.event_id -- avoid self-join & duplicate pairs AND a.start_time < b.end_time -- overlap condition part 1 AND a.end_time > b.start_time; -- overlap condition part 2
Performance & Design
Q18–Q20Clustered index — physically sorts table rows by the index key. One per table. Non-clustered index — separate structure pointing back to rows. Multiple allowed.
In Snowflake, data is stored in micro-partitions automatically. The equivalent is a cluster key — defining one causes Snowflake to sort data within micro-partitions, enabling more aggressive partition pruning on your most common filter columns.
-- Define cluster key on most common filter columns ALTER TABLE production_events CLUSTER BY (shift_date, machine_id); -- Check clustering information SELECT SYSTEM$CLUSTERING_INFORMATION('production_events');
shift_date and machine_id — these are the most common filter columns in OEE, downtime, and quality queries. Snowflake will prune micro-partitions that don't match, making queries dramatically faster at scale.| Command | Runs query? | Shows | When to use |
|---|---|---|---|
| EXPLAIN | No | Estimated plan, estimated row counts | Query too expensive to run speculatively |
| EXPLAIN ANALYZE | Yes | Actual plan, actual row counts, actual time per node, spills | Diagnosing a slow query already in production |
| Snowflake Query Profile | Yes (post-run) | Node-level stats, spill to disk, partition pruning ratio | Always use after any Snowflake query optimization |
Factory Analytics Patterns
OEE + Star SchemaOEE = Availability × Performance × Quality
Star schema — fact_production_event
id · date · planned_mins
id · name · ideal_rate
run_mins · down_mins
units_good · actual_rate
sku · spec_rate
code · category · planned?
WITH oee_calc AS ( SELECT s.shift_date, pe.machine_id, m.machine_name, SUM(pe.run_mins) * 1.0 / MAX(s.planned_mins) AS availability, SUM(pe.units_total * m.ideal_cycle_secs / 60.0) / NULLIF(SUM(pe.run_mins), 0) AS performance, SUM(pe.units_good) * 1.0 / NULLIF(SUM(pe.units_total), 0) AS quality, COUNT(CASE WHEN pe.dq_flag = 'GOOD' THEN 1 END) * 1.0 / NULLIF(COUNT(*), 0) AS data_quality_score FROM fact_production_event pe JOIN dim_shift s ON pe.shift_id = s.shift_id JOIN dim_machine m ON pe.machine_id = m.machine_id WHERE s.shift_date >= DATEADD('day', -30, CURRENT_DATE()) GROUP BY s.shift_date, pe.machine_id, m.machine_name ) SELECT shift_date, machine_id, machine_name, ROUND(availability * 100, 1) AS avail_pct, ROUND(performance * 100, 1) AS perf_pct, ROUND(quality * 100, 1) AS qual_pct, ROUND(availability * performance * quality * 100, 1) AS oee_pct, ROUND(data_quality_score * 100, 1) AS dq_score_pct, -- Rolling 7-day OEE trend AVG(availability * performance * quality) OVER ( PARTITION BY machine_id ORDER BY shift_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW ) * 100 AS oee_7day_rolling_avg FROM oee_calc ORDER BY shift_date, oee_pct;
Cheat Sheet
| Function | Purpose | Key gotcha |
|---|---|---|
| WHERE | Filter rows before GROUP BY | Cannot use aggregate functions |
| HAVING | Filter groups after GROUP BY | Requires GROUP BY (or window context) |
| COALESCE(a,b,c) | First non-NULL value | Evaluates left to right, stops at first non-NULL |
| NULLIF(a,b) | Return NULL if a=b, else a | Pair with COALESCE for safe division |
| ROW_NUMBER() | Unique row rank (no ties) | Tie-breaking is non-deterministic |
| RANK() | Rank with ties, skips next | Gaps in sequence when ties occur |
| DENSE_RANK() | Rank with ties, no skip | Use for factory machine rankings |
| LAG(col, n, default) | Previous row value | Returns NULL if no prior row (set default) |
| LEAD(col, n, default) | Next row value | Returns NULL if no next row (set default) |
| ROWS BETWEEN | Physical row count window | Always exact N rows regardless of date gaps |
| RANGE BETWEEN | Logical value-based window | Use with INTERVAL for time-series |
| WITH RECURSIVE | Hierarchical traversal | Must use UNION ALL, not UNION |
| QUALIFY | Filter on window function result | Snowflake / BigQuery only |
| UNION ALL | Combine results, keep dupes | Faster than UNION — use by default |
| EXPLAIN ANALYZE | Actual execution stats | Actually runs the query — use on slow prod queries |
| SUM(CASE WHEN …) | Conditional aggregation (all warehouses) | Prefer COUNT_IF / BOOL_AND in Snowflake when it reads cleaner |
| FILTER (WHERE …) | Postgres-style conditional agg | Not everywhere; CASE/SUM is the portable default |
| Surrogate key | Stable id for a dimension row | Often ROW_NUMBER() or hash — document grain and SCD rules |
| SCD Type 2 | History when attributes change | valid_from / valid_to or is_current; join on “as-of” date |
| Incremental model | dbt-style merge by key + watermark | Align keys with business grain; test uniqueness |
Live-coding tip
Many analytics engineering loops end with one end-to-end metric query—here, OEE from a star schema with a rolling trend. Practice explaining grain, joins, and why each term is multiplied so you can adapt the same structure to revenue, retention, or funnel conversion.