ML/AI Engineer Interview

LLM Fundamentals 5 questions ▶

01 What is tokenization, and how does it affect generation? ▶

Tokenization converts raw text into integer IDs from a fixed vocabulary (e.g., 50K tokens for GPT-4 using BPE). The model never sees characters — only token IDs and their embeddings.

Why it matters for generation:

Cost: billing is per token. "manufacturing" might be 1 token; "Weirton" might be 2–3. Know your domain vocabulary.
Arithmetic: models are notoriously bad at math because "1234" may be tokenized as ["12", "34"] — the model sees two chunks, not a single number.
Context window: 1 token ≈ 0.75 English words. 128K context ≠ 128K words.
Rare/domain terms: "FeO" or "OEE" may split unexpectedly. Test your prompts with a tokenizer (tiktoken for OpenAI).

For factory AI: sensor tags like CELL_3_TEMP_AVG are often split 4–6 ways. Pre-processing matters.

02 How do embeddings really work? ▶

An embedding is a dense vector (e.g., 768 or 1536 dimensions) that encodes semantic meaning. Tokens → embeddings via a learned lookup table. The transformer then operates in this vector space.

Key intuitions:

Semantically similar text → nearby vectors (cosine similarity)
Embeddings capture context: "iron" near "battery" is different from "iron" near "steel"
They're not human-interpretable — there's no "dimension 42 = temperature"

In RAG systems: you embed chunks at index time and embed queries at retrieval time, then find nearest neighbors. Embedding model consistency matters — switching models requires full reindex.

⚡ INTERVIEWER TRAP: Don't confuse token embeddings (learned lookup table) with sentence/document embeddings (pooled or special [CLS] token from an encoder model like BERT/E5).

03 What's the role of attention and positional encoding? ▶

Attention lets each token attend to every other token in the sequence. For each token, you compute a weighted sum of all other token values, where weights come from query-key dot products (scaled, softmaxed).

Multi-head attention runs this in parallel across multiple subspaces, allowing the model to simultaneously capture different types of relationships (syntactic, semantic, coreference).

Positional encoding solves the problem that attention itself is order-agnostic — the same set of tokens in any order would produce the same output. Positional encodings inject token position information into the embedding before attention. Modern models use RoPE (Rotary Position Embedding) which applies rotation to Q/K vectors, enabling better length extrapolation (and tricks like NTK-aware / YaRN scaling to stretch context at inference).

Staff-level depth: full attention is O(n²) in sequence length, which is why context is expensive. The key optimizations to name:

KV-cache: past K/V computed once, reused per decode step — turns generation from O(n²) recompute into O(n) incremental.
FlashAttention: IO-aware kernel that tiles the attention computation in SRAM and never materializes the full n×n matrix — same math, far less memory bandwidth.
GQA / MQA (Grouped/Multi-Query Attention): share K/V heads across query heads to shrink the KV-cache — the main lever for serving long context cheaply (Llama 3, Mistral use GQA).

⚡ Interviewer trap: "why is the KV-cache the bottleneck for long-context serving?" → it grows linearly with sequence × layers × heads and lives in VRAM; GQA and PagedAttention (see Staff-Level §) exist to manage exactly this.

04 What changes during fine-tuning? (optimizers, schedulers, layer freezing) ▶

Fine-tuning updates model weights on a task-specific dataset. Key decisions:

What to freeze: early layers encode general syntax/features; later layers are task-specific. Freeze early layers, train later layers to preserve general capability.
Optimizer: AdamW (Adam + weight decay) is standard. Weight decay prevents overfitting on small datasets. Use smaller LR than pretraining (e.g., 1e-5 vs 3e-4).
LR Scheduler: warmup (linear or cosine) prevents early large updates from destabilizing pretrained weights. Cosine annealing for smooth decay.
Catastrophic forgetting: fine-tuning too aggressively destroys pretrained capability. Addressed by LoRA (see next question) or lower LR + fewer steps.
Data quality: 10K high-quality examples beats 1M noisy ones.

05 LoRA vs QLoRA vs full fine-tune — tradeoffs? ▶

Full fine-tune: update all weights. Best quality, but requires massive GPU memory and compute. Risk of catastrophic forgetting. Only justified if you have massive domain data.

LoRA (Low-Rank Adaptation): freeze base weights, inject small trainable rank-decomposition matrices (A×B) into attention layers. Train only ~0.1–1% of parameters. Nearly equivalent quality to full fine-tune for most tasks.

QLoRA: LoRA on a 4-bit quantized base model. Reduces VRAM from ~80GB to ~12GB for a 13B model. Some quality degradation from quantization but acceptable for most applications. The practical choice for most teams.

Decision matrix:

Limited GPU, domain adaptation → QLoRA
Good GPU, production model → LoRA
Massive domain shift, unlimited compute → Full fine-tune
No data, just behavior tuning → Prompt engineering first

⚡ Factory AI angle: for anomaly detection or alert classification on sensor data, QLoRA on Llama 3.1 8B is my default. Cheaper than GPT-4, deployable on-prem.

PROMPT Prompting & Context Engineering 5 questions ▶

06 Few-shot vs zero-shot — which works better where? ▶

Zero-shot: just a task description. Works well for large models (GPT-4, Claude 3.5+) on common tasks. Faster iteration, less prompt engineering.

Few-shot: include 3–8 input/output examples in the prompt. Critical when: (1) output format is non-standard, (2) domain terminology is unusual, (3) task requires specific reasoning patterns, or (4) using smaller/open-source models.

When few-shot wins: classification of manufacturing defect codes, extracting structured JSON from sensor logs, domain-specific summarization. The examples calibrate format + tone.

Pitfall: poor-quality examples are worse than zero-shot. Garbage few-shot = garbage output.

⚡ Chain-of-thought prompting (add "think step by step") is its own category — empirically improves multi-step reasoning even at zero-shot.

07 How do you design system prompts that are robust across users? ▶

Robust system prompts must handle adversarial inputs, edge cases, and diverse user styles without breaking. Key principles:

Explicit over implicit: don't say "be helpful" — say exactly what actions are in-scope and out-of-scope
Constrain output format: specify JSON schema, length limits, field names. Don't leave format to the model's discretion in production
Persona + boundary: "You are a manufacturing process assistant. You only answer questions about this factory's equipment. If asked about unrelated topics, respond: 'I can only help with plant floor questions.'"
Inject context slots: use template variables ({{user_role}}, {{shift}}, {{equipment_id}}) — same base prompt, parameterized per user
Test with adversarial prompts: jailbreak attempts, out-of-distribution queries, empty inputs

08 How do you make LLM output deterministic? ▶

Temperature = 0 is the primary lever. This makes sampling greedy (always pick highest probability token). Not fully deterministic in practice due to floating-point nondeterminism across GPU runs, but functionally stable.

Additional controls:

top_p = 1.0, top_k = disabled — no sampling variation
Seed parameter (supported by OpenAI API) — same seed + same prompt ≈ same output
Output structure: JSON mode or function calling constrain the output space mechanically
Constrained decoding: libraries like outlines or guidance enforce grammar-based output — guarantees valid JSON/SQL regardless of temperature

Honest answer for interviews: you can get high reproducibility but not true determinism across different hardware/versions. Design systems to handle occasional variation.

09 How do you track, version, and backfill changing context? ▶

Context evolves: system prompts change, docs get updated, new equipment is added. Treat context like code:

Prompt versioning: store prompts in Git (not hardcoded). Tag releases. Log which prompt version generated each output.
Context metadata: each stored context chunk gets version, created_at, source, hash fields
Backfill strategy: when a source document updates, mark old chunks as stale, re-embed, swap atomically (blue/green embedding index)
Audit trail: every LLM call logs (prompt_version, context_chunks_used, model_version, output_hash) — essential for debugging regressions

10 How do you build and maintain memory in an LLM system? ▶

Memory types and when to use each:

In-context (short-term): include recent conversation turns in the prompt. Cheap but limited by context window. Truncate with sliding window or summarization.
External vector store (long-term episodic): embed past interactions, retrieve relevant ones. Good for "what did we discuss about Cell 3 last week?"
Structured memory (entity/state): maintain a JSON/DB record of known entities. "User is a process engineer, prefers metric units, works Line 2." Update programmatically, not via LLM.
Semantic compression: periodically summarize old context into a compact representation, discard raw turns

Anti-pattern: using the LLM itself to manage memory state. LLMs hallucinate; your memory store must be a reliable database with deterministic reads/writes.

RAG RAG Systems 4 questions ▶

11 What's your chunking strategy — by length, semantics, or structure? ▶

No single strategy is universal. Match chunking to document structure:

Fixed-size with overlap: simple, 512 tokens, 50-token overlap. Baseline. Works for homogeneous docs.
Semantic chunking: split on embedding similarity drops between sentences. Preserves topical coherence. More expensive to index but better retrieval.
Structural chunking: for docs with known structure (SOPs, manuals, tables) — split by section headers, table rows, or numbered steps. Best precision for technical docs.
Hierarchical (parent-child): index small chunks for retrieval, but return larger parent chunk as context. Best of both worlds — precision in retrieval, completeness in generation.

My default for factory SOPs: structural chunking by section, with metadata injection (doc title, section header, last updated) into every chunk.

12 How do you choose a vector DB — Chroma, Pinecone, OpenSearch? ▶

Chroma: local dev, prototyping. No ops overhead. Not for production scale.
Pinecone: managed, fast, good filtering. But vendor lock-in, no on-prem. Good for cloud-native teams.
OpenSearch / Elasticsearch: hybrid search (BM25 + vector) in one system. Strong if you already run ES. Operationally complex.
pgvector: PostgreSQL extension. Best if you're already on Postgres and scale is ≤10M vectors. Zero new infrastructure.
Weaviate / Qdrant: open-source, self-hosted, production-grade. Qdrant is particularly fast with good filtering support.

Decision criteria: (1) do you need on-prem? → Qdrant/Weaviate, (2) already on Postgres? → pgvector, (3) need hybrid BM25+vector? → OpenSearch, (4) cloud-only, fast start? → Pinecone

⚡ For FactoryOps: on-prem requirement + ClickHouse already in stack → pgvector or Qdrant. Don't add Pinecone to an air-gapped factory network.

13 Can you update or backfill embeddings with zero downtime? ▶

Yes — treat it like a database migration:

Dual-index strategy: keep the old index live while building the new one in parallel. Route queries to old index until new is ready.
Shadow indexing: new documents go into both old and new indexes simultaneously during migration window
Atomic cutover: flip router to new index, verify, then decommission old
Incremental backfill: batch-process stale chunks in background, mark each as re_embedded=true when done

Key constraint: you cannot mix embeddings from different models in the same index (different vector spaces). Plan model upgrades as full migrations, not in-place updates.

14 How do you evaluate retrieval quality — precision@k, reranking, citation? ▶

Offline metrics:

Precision@k: of top-k retrieved chunks, what % are relevant? Build a golden QA dataset with labeled relevant chunks.
Recall@k: does the correct answer appear in top-k? More important for RAG than precision alone.
MRR (Mean Reciprocal Rank): rewards systems that put the best chunk at position 1

Reranking: after initial vector retrieval (top-20), use a cross-encoder (e.g., Cohere rerank, BGE-reranker) that scores query+chunk jointly. More expensive but significantly improves precision.

Citation tracking: every answer chunk gets source_doc, chunk_id, page metadata. Expose citations in the UI — the best hallucination detector is asking users "does this match the source?"

Online evaluation: thumbs up/down on answers, correction tracking, time-to-answer as proxy for user confidence.

MLOPS MLOps & LLMOps 4 questions ▶

15 Sketch a pipeline: raw data → model → serving → feedback ▶

Data: ingest → validate schema → clean → version with DVC or Delta Lake

Training: feature store → model training → experiment tracking (MLflow/W&B) → model registry with metadata (dataset hash, metrics, training config)

Serving: containerized model (Docker) → serving layer (FastAPI/Triton/vLLM) → load balancer → API gateway with auth + rate limiting

Observability: log every prediction (input, output, latency, model version) → Prometheus metrics → Grafana dashboard → alerting on latency spikes or confidence drops

Feedback loop: user corrections → label store → periodic retraining trigger → A/B test new model → canary deploy (5% traffic) → full rollout

⚡ For LLMs specifically: add prompt/output logging to S3/ClickHouse, LLM-as-judge evals on a sample, and human review queue for low-confidence outputs.

16 How would you monitor performance drift or hallucinations? ▶

Drift detection:

Statistical tests (PSI, KL divergence) on input feature distributions vs training baseline
Output distribution shifts — if answer length, topic distribution, or confidence scores change significantly
Model accuracy on a held-out reference set run weekly

Hallucination detection:

LLM-as-judge: secondary call to Claude/GPT-4 asking "does this answer contradict the provided context?" on a sample
Factual consistency score: NLI models (e.g., DeBERTa) to check entailment between retrieved context and generated answer
Citation grounding: structured outputs that include source reference — verify reference exists and contains claimed content
Human review queue: route low-confidence outputs (or random 2%) to human reviewers

17 How do you log prompts and outputs for debugging and auditing? ▶

Every LLM call should emit a structured log event with:

request_id, session_id, user_id
model, model_version, prompt_version
system_prompt_hash, rendered_prompt (or hash + pointer to S3)
context_chunks: list of chunk IDs used
raw_output, parsed_output
latency_ms, input_tokens, output_tokens, cost_usd
temperature, timestamp

Store full prompts in object storage (S3/GCS), log only the pointer. ClickHouse is excellent for aggregating LLM telemetry at scale — fast on cost analysis, latency percentiles, error rates by prompt version.

18 CI/CD for LLM workflows — what's different from standard ML? ▶

Standard ML CI/CD: test code, validate model metrics, deploy container. LLM adds:

Prompt regression testing: a golden test suite of input/expected_output pairs run against every prompt change. Catch regressions before deploy.
Eval harness: automated LLM-as-judge scoring on your test suite (not just unit tests — semantic correctness)
Non-determinism: run each test case 3–5x at temp=0 and check consistency. Flag high variance outputs.
Model version pinning: gpt-4o-2024-11-20 not gpt-4o. API model aliases change. Pin exact versions in prod.
Shadow deployment: new prompt version serves 5% traffic, compare output quality vs control before full rollout

COST Cost & Latency Tradeoffs 4 questions ▶

19 How do you reduce token usage? ▶

Compress system prompts: audit prompts for redundancy. "Please be concise and helpful and professional" → cut to task definition only.
Selective retrieval: retrieve top-3 chunks, not top-10. Each chunk = ~200 tokens. Know what you're spending.
Output constraints: max_tokens limit + format constraints. JSON outputs are cheaper than prose for structured tasks.
Smaller model routing: classify query complexity first (fast, cheap), route simple queries to GPT-3.5/Haiku, complex to GPT-4/Sonnet
Cache identical prompts: semantic deduplication. Same question asked 50 times → serve cached response (see Q21)
Summarize history: instead of appending full conversation, maintain a 200-token running summary

20 When should you quantize a model? ▶

Quantization reduces weight precision (FP32 → INT8/INT4) to shrink model size and speed up inference.

Use when:

Deploying on-prem with limited GPU VRAM (quantize to fit a 7B/13B on a single GPU)
Latency is critical and you can accept minor quality degradation
Cost optimization on self-hosted models (more requests per GPU)

Avoid when:

Task requires high numerical precision (math, code)
You're already using a tiny model (further quantization hits quality cliff)

Methods: GPTQ (post-training, accurate), AWQ (fast inference), GGUF (CPU-friendly via llama.cpp). INT8 is safe for most tasks; INT4 needs testing.

21 What's your batching and caching strategy to reduce latency? ▶

KV-cache (built-in): transformers cache key-value pairs for the prompt prefix. Constant system prompts are cached automatically — keep them at the top of the prompt.

Semantic caching: embed incoming queries, check vector similarity against recent queries. If cosine sim > 0.97, serve cached response. Redis + pgvector works well. Reduces API cost dramatically for repetitive use cases.

Request batching: group multiple requests into a single model call (for offline/async tasks). vLLM's continuous batching is the standard for high-throughput serving.

Streaming: use SSE/streaming responses to reduce perceived latency — user sees first tokens in 200ms instead of waiting 5s for full response.

Speculative decoding: small draft model generates tokens, large model verifies in parallel. 2–3x throughput improvement for large models.

22 When to use hosted APIs vs open-source models? ▶

Hosted API (OpenAI, Anthropic, Gemini):

Best model quality today, no infra overhead
Use for: prototypes, low-volume production, tasks needing frontier capability
Avoid for: data privacy requirements, air-gapped environments, high-volume low-margin workloads

Open-source (Llama 3, Mistral, Qwen):

Self-host on-prem or cloud VMs. Full data control.
Use for: factory/OT environments with data sovereignty, high-volume inference where cost matters, fine-tuning for domain specialization
Requires: GPU infra, model serving (vLLM/Triton), ongoing ops

My rule: prototype on hosted API, productionize on open-source if volume justifies it or data can't leave the building. For Form Energy: everything stays on-prem.

SYSTEM System Design Thinking 4 questions ▶

23 How do you make an AI system more deterministic and less brittle? ▶

Constrained outputs: JSON schema validation, function calling, constrained decoding (outlines/guidance). Never trust free-text output as structured data.
Parse → validate → act: three-layer pipeline. LLM generates, deterministic parser validates, business logic executes. Failures at layer 2 never reach layer 3.
Idempotent tools: if the LLM calls a tool twice, the second call should be safe. Write-once, append-only designs reduce blast radius.
Explicit state machine: for agentic workflows, maintain system state in a deterministic FSM. The LLM suggests transitions; the FSM validates and executes.
Fallback chains: LLM call fails → structured fallback → human escalation. Never a dead end.
Input normalization: pre-process user input (lowercase, strip noise, canonical forms) before it hits the model. Garbage in = garbage out.

24 What fallback do you use if the LLM fails mid-task? ▶

Fail modes: API timeout, malformed JSON output, rate limit, context overflow, hallucinated tool call.

Fallback strategy:

Retry with backoff: transient errors (timeout, 429). Exponential backoff, max 3 attempts.
Model fallback: primary GPT-4 fails → fallback Claude Haiku or local Llama. Accept lower quality over downtime.
Prompt retry: if output failed schema validation, retry with explicit error feedback in the prompt: "Your previous response was invalid JSON. Fix it."
Graceful degradation: if LLM copilot is down, fall back to keyword search + static templates. System still works, just dumber.
Human escalation: for high-stakes decisions (maintenance approval, alert acknowledgment), route to human if LLM confidence is low or system has retried 3+ times.

25 Can you solve this without an LLM or vector DB? ▶

This is the most important question an AI engineer can ask. LLMs are expensive, slow, and non-deterministic. Before reaching for one, check:

Can a regex or rule engine handle this? → Yes: use it.
Can a traditional ML model (XGBoost, logistic regression) solve it? → Yes: use it. Faster, cheaper, explainable.
Can BM25 full-text search replace the vector DB? → For many doc retrieval tasks, yes.
Is the "AI" feature actually just a pre-built NLP library? spaCy, HuggingFace classifiers, sentence-transformers?

Use LLMs when: the task requires reasoning, generation, or language understanding that structured approaches genuinely can't provide. Not just because it's cool.

⚡ Interviewers love this mindset. It signals engineering judgment over hype.

26 What's the right database for this task — SQL, NoSQL, or vector? ▶

SQL (PostgreSQL, ClickHouse): structured data, ACID transactions, complex joins, analytics. Time-series OEE data → ClickHouse. Transactional records → PostgreSQL.
NoSQL (MongoDB, Cassandra): schema-flexible documents, high write throughput, no complex joins needed. Event logs, session data, device metadata.
Vector DB (pgvector, Qdrant): semantic similarity search. Required for RAG. Not for structured queries.
Graph DB (Neo4j): highly connected data, relationship traversal. Equipment dependency graphs, supply chain linkages.
Hybrid: most real systems use 2–3. PostgreSQL + pgvector covers both relational and semantic. ClickHouse + Redis covers analytics + low-latency cache.

The anti-pattern: using a vector DB as a general-purpose store because it's "AI." Know what each database is optimized for.

INDUSTRIAL Industrial & Time-Series ML 7 questions ▶

27 How do you frame predictive maintenance as an ML problem? ▶

"Predict failures" is not one problem — it's a family of framings, and choosing the right one is the senior signal:

Anomaly detection (unsupervised): when you have almost no labeled failures. Learn "normal" from healthy operation, flag deviations. Most factory starting points live here because failures are rare and labels are weak.
Binary classification (will it fail in horizon H?): needs labeled failure events. Define a prediction window and a lead time; beware label leakage from post-failure sensor readings.
Remaining Useful Life (RUL regression): predict time-to-failure. Powerful but needs run-to-failure histories (rare) — often synthesized or transfer-learned.
Survival analysis: models censored data correctly (most assets haven't failed yet). Underused but the statistically honest choice.

The hard part is labels, not models. Failure labels come from CMMS / work orders and are noisy: wrong timestamps, "replaced X but Y was the cause," preventive vs reactive maintenance mixed together. Spend your effort on label quality and a leakage-free temporal split (train on past, validate on future), not on swapping XGBoost for a transformer.

⚡ Factory angle: I'd start with anomaly detection + a rules baseline on the top 5 failure modes by downtime cost, prove ROI, then graduate to RUL only where run-to-failure data exists.

28 Time-series anomaly detection on sensor data — methods and pitfalls? ▶

Methods, roughly increasing in power/cost:

Statistical / control charts: rolling mean ± kσ, EWMA, CUSUM, Western Electric rules. Cheap, explainable, and often the right answer on the plant floor.
Decomposition: STL / seasonal-trend decomposition, then threshold the residual — handles shift patterns and seasonality.
Distance / density: Isolation Forest, matrix profile (great for motifs/discords in univariate signals).
Reconstruction-based: autoencoders / LSTM-AE / VAE for multivariate; high reconstruction error = anomaly. Captures cross-sensor correlations a single-tag threshold misses.
Forecast-residual: predict next value, alarm on large residual.

Pitfalls that sink projects:

No ground truth → you can't compute precision/recall. Build a labeled incident set with operators or you're flying blind.
Concept drift: "normal" changes with product mix, tooling, season. A static threshold becomes an alarm flood.
Alarm fatigue: precision matters more than recall here. One credible alert beats 50 noisy ones. Add hysteresis / minimum-duration / debounce.
Sensor faults masquerade as anomalies: a stuck or dropped sensor looks "anomalous" — validate data quality before model output.

29 Data drift vs model drift in OT environments — detect and handle? ▶

Distinguish three things:

Covariate / data drift: input distribution moves (new raw material lot, ambient temperature, recalibrated sensor). Detect with PSI / KS test / population stability on key features.
Concept drift: the input→output relationship changes (same vibration profile now means something different after a retrofit). Only detectable with fresh labels or proxy outcomes.
Data-quality drift: missing tags, unit changes, clock skew, OPC-UA gaps. In OT this is the most common and most ignored.

Handling: schema + range validation at ingest (Great Expectations-style), a drift monitor that compares live windows to the training baseline, scheduled re-fit cadence tied to known process changes (tool change, line requal), and a fast rollback path. Tie drift alerts to a human-in-the-loop review, not auto-retrain — auto-retraining on drifted/garbage data is how you bake a bad state in permanently.

⚡ Senior framing: in manufacturing, "the model degraded" is usually a data pipeline or instrumentation problem, not a modeling problem. Instrument the pipeline first.

30 Edge vs cloud inference for factory ML — how do you decide? ▶

Decide on four axes — latency, connectivity, data gravity, and safety:

Edge when: control-loop or closed-loop latency (<10–100ms), intermittent/air-gapped network, the line must keep running if the cloud is down, or data volume is too large to ship (high-rate vibration, vision frames). Run quantized models on an industrial PC / Jetson / PLC-adjacent gateway.
Cloud when: heavy models, cross-line/fleet aggregation, retraining, and analytics that tolerate seconds of latency.
Hybrid (the real answer): inference at the edge for real-time decisions, stream features/results to cloud for fleet learning, push model updates back down (MLOps over OT). This is the classic OT/IT split.

OT realities to name: determinism > raw accuracy near safety functions, change control / validation (you can't hot-swap a model on a safety-rated line casually), and the IT/OT trust boundary (a model service should never have write access to control logic without an interlock).

31 How do you handle extreme class imbalance (rare defects/failures)? ▶

Pick the right metric first: accuracy is meaningless at 0.1% defect rate. Use PR-AUC, recall at a fixed precision, or cost-weighted metrics tied to scrap/escape cost. Align the threshold to the business cost ratio of false-alarm vs missed-defect.
Resampling: class weights or focal loss before SMOTE; SMOTE on high-dim sensor/image data often fabricates unrealistic samples. Undersample the majority for speed.
Reframe as anomaly detection: if defects are rare and varied, model "good" and flag outliers rather than learning every defect class.
Threshold + human review: ship a high-recall model that routes flagged units to inspection, not auto-reject. Capture the inspector's verdict to grow the labeled set (active learning).

⚡ The escape (missed defect reaching the customer) usually costs 10–100× the false reject. Make that ratio explicit and let it set the operating point.

32 How do you build a feature store / pipeline for streaming sensor data? ▶

Sensor ML lives or dies on point-in-time correctness and train/serve consistency:

Ingest: PLC/OPC-UA → MQTT/Kafka → a time-series store (TimescaleDB/ClickHouse) for raw, plus a feature layer.
Windowed features: rolling stats (mean/std/min/max/RMS), FFT bands for vibration, rate-of-change, time-since-last-event. Define them once and compute the same way offline (training) and online (serving) — the #1 source of training/serving skew.
Point-in-time joins: when labeling a failure, only join features available before the event. No peeking into the future = no leakage.
Online/offline parity: a feature store (Feast-style) or a shared transformation library guarantees the serving path uses identical logic.
Late / out-of-order data: OT networks drop and reorder. Use event-time windows with watermarks, not processing-time.

33 Where does an LLM/RAG layer actually add value on the plant floor? ▶

Be the engineer who knows where not to use an LLM. Real-time control and anomaly detection are classical/streaming ML — an LLM in a control loop is an anti-pattern. LLMs earn their place at the human-interface and knowledge layer:

Tribal-knowledge RAG: query SOPs, maintenance manuals, and historical work orders in natural language ("what's the fix when Cell 3 throws torque fault 0x4?").
Alert enrichment: when the anomaly model fires, an LLM drafts the likely root cause + recommended action by retrieving similar past incidents — the classical model decides, the LLM explains.
Text-to-SQL / conversational analytics over OEE and downtime data for engineers who don't write SQL.
Shift handover & 8D/5-Why drafting from structured event logs.

Architecture: deterministic ML and rules make decisions; the LLM sits on top for retrieval, explanation, and natural-language access — always with citations back to the source SOP/record.

⚡ This is exactly the QualityMind-RAG / AEGIS split: streaming correlation engine for detection, RAG + Text-to-SQL for the human layer.

STAFF Staff-Level: Serving, Training & Advanced RAG 6 questions ▶

34 How does vLLM / PagedAttention improve throughput? ▶

The bottleneck in LLM serving is the KV-cache, not FLOPs. Naive serving pre-allocates a contiguous cache for max sequence length per request → massive internal fragmentation and wasted VRAM, which caps batch size.

PagedAttention: treats the KV-cache like OS virtual memory — non-contiguous fixed-size blocks (pages) with a block table. Near-zero fragmentation → fit far more concurrent sequences in the same VRAM.
Continuous (in-flight) batching: instead of waiting for a whole batch to finish, finished sequences are evicted and new requests join mid-flight every step. Keeps the GPU saturated → 10–20× throughput vs static batching under real traffic.
Prefix caching: shared prompt prefixes (your big system prompt, few-shot examples) share KV blocks across requests — pay for the prefix once.

⚡ Crisp takeaway: vLLM wins by managing memory, not by faster matmuls. Pair with GQA models to shrink the cache further.

35 Distributed training: DP vs TP vs PP vs FSDP/ZeRO — when each? ▶

Data Parallel (DDP): replicate the full model on each GPU, split the batch, all-reduce gradients. Use when the model fits on one GPU and you just want to go faster. Simplest, communication-light.
FSDP / ZeRO: shard optimizer states (ZeRO-1), gradients (ZeRO-2), and parameters (ZeRO-3) across GPUs, gather just-in-time. The default for training models that don't fit in one GPU's memory — keeps DP's simplicity, removes the redundant memory.
Tensor Parallel (TP): split individual matmuls (attention/MLP) across GPUs. High communication → keep it within a node over NVLink. Needed when a single layer is too big.
Pipeline Parallel (PP): split layers into stages across nodes; micro-batches keep stages busy (watch the "bubble"). Scales across nodes where bandwidth is lower.

Reality: large training is 3D parallelism — TP within a node, PP across nodes, DP/FSDP on top. For most teams fine-tuning, FSDP (or DeepSpeed ZeRO-3) + LoRA covers 90% of needs without touching TP/PP.

36 Advanced RAG beyond top-k: what's in a production pipeline? ▶

"Embed → top-k → stuff" is a demo. Production retrieval is a pipeline:

Query transformation: rewriting, decomposition (split multi-hop questions), and HyDE (generate a hypothetical answer, embed that to retrieve) for vague queries.
Hybrid retrieval: dense (vectors) + sparse (BM25), fused with Reciprocal Rank Fusion. Sparse catches exact part numbers / error codes that embeddings blur.
Reranking: cross-encoder over the top-N candidates to re-order before context assembly — biggest precision win per dollar.
Metadata filtering: pre-filter by equipment/line/doc-version so retrieval is scoped, not global.
GraphRAG: build an entity/relationship graph for multi-hop and "summarize across many docs" questions that flat chunk retrieval can't answer.
Context assembly: parent-child expansion, dedup, and ordering (put the strongest evidence where the model attends — not lost in the middle).

⚡ Don't add all of this at once. Instrument retrieval (recall@k on a golden set), then add the cheapest component that moves the metric — usually a reranker.

37 How do you evaluate an LLM system rigorously at scale? ▶

Separate component evals from end-to-end evals:

Retrieval: recall@k / MRR on a labeled golden set (covered in §14).
Generation: faithfulness/groundedness (does the answer follow from context?), answer relevance, and correctness vs reference. Frameworks: RAGAS, custom rubrics.
LLM-as-judge: scalable but biased — position bias, verbosity bias, self-preference. Mitigate with pairwise comparison (A/B, not absolute scores), randomized order, a strong judge model, and periodic human calibration of the judge itself.
Golden sets + regression gating: a versioned suite of cases that must pass before any prompt/model/index change ships. This is the unit test layer for LLM apps.
Online: A/B tests, thumbs, edit-distance on user corrections, task completion — the only ground truth that matters long-term.

Senior point: evals are a dataset problem. Mine real production failures into the golden set continuously; an eval suite that never grows is already stale.

38 Design a production agent: tools, control, and failure containment ▶

Agents fail in the gap between "works in a demo" and "safe in prod." Design for containment:

Bounded autonomy: an explicit state machine / graph (LangGraph-style) where the LLM proposes transitions and deterministic code validates and executes them. Don't let the model free-run.
Typed tools + validation: every tool has a strict input schema; reject malformed calls before they execute. Read tools are free; write/irreversible tools require an interlock.
Loop control: max steps, max cost/token budget, and a no-progress detector to kill runaway loops.
Human-in-the-loop checkpoints for irreversible or high-cost actions (maintenance approval, anything that touches OT).
Full traceability: log every step (thought, tool call, args, result) so you can replay and debug — same discipline as §17/§43.
Failure containment: idempotent tools, retries with backoff, and a graceful fallback (degrade to retrieval/templated answer) so a failed step is never a dead end.

39 Walk me through a system design: real-time factory copilot ▶

Treat it as a layered design and state assumptions up front (on-prem, ~50 lines, sub-second alerts, data can't leave the building):

Ingestion: PLC/OPC-UA → MQTT → Kafka. Raw to TimescaleDB/ClickHouse; features computed in stream.
Detection (deterministic): streaming anomaly/correlation models at the edge for <100ms alerting — no LLM in this path.
Knowledge layer: on-prem vector store (pgvector/Qdrant) over SOPs + work orders; hybrid retrieval + reranker.
Copilot (LLM): self-hosted (Llama/Qwen via vLLM) for data sovereignty. On alert → retrieve similar incidents → draft root cause + action with citations. On query → Text-to-SQL over OEE + RAG over docs.
Serving: vLLM with continuous batching + prefix caching; semantic cache in Redis for repeat questions.
Guardrails: schema-validated outputs, citations required, human approval for any write-back to MES/CMMS.
Observability: log every prediction + LLM call (prompt version, chunks, model version, cost) to ClickHouse; drift + eval dashboards; golden-set regression gate in CI.

Scaling & tradeoffs to volunteer: classical ML for the real-time loop (cheap, deterministic, safe); LLM only at the human layer; everything on-prem for OT data sovereignty; graceful degradation to search + templates if the model is down so the line never depends on the copilot.

⚡ The interviewer is testing judgment: lead with "what doesn't need an LLM," size the latency/safety budget, and show the IT/OT boundary.

SCENARIO Real-World Scenarios 4 questions ▶

40 What happens if your embedding model changes — how do you migrate safely? ▶

Embedding model change = incompatible vector space = you cannot query old embeddings with new model queries. Full reindex required.

Migration plan:

Step 1: build new index in parallel using new model. Don't touch the live index.
Step 2: shadow mode — route queries to both indexes, compare results. Log discrepancies.
Step 3: run your eval suite against both indexes. New model should show precision@k improvement (otherwise, why migrate?)
Step 4: atomic cutover — update the router config, switch all traffic to new index
Step 5: keep old index for 2 weeks as rollback option

Prevention: version your embedding model in metadata (embedding_model: "text-embedding-3-large") on every chunk. Detect staleness automatically when model changes.

41 How would you fine-tune a model on user behavior and deploy it? ▶

Data collection: log user queries + responses + explicit feedback (thumbs up/down) + implicit signals (did user follow up with "that's wrong"?). Build a preference dataset.

Data cleaning: filter for high-quality examples. Remove low-confidence base model outputs. Deduplicate. Balance classes.

Training approach: SFT (supervised fine-tuning) on preferred responses. For preference optimization, RLHF (complex) or DPO (simpler, often better) on (preferred, rejected) pairs.

Deployment:

Containerize with vLLM, push to registry
Canary deploy to 5% of traffic
A/B test fine-tuned vs base model on your eval suite
Gate on quality metric improvement before full rollout
Keep base model serving until fine-tuned model proves itself

42 How would you make this system cheaper without killing quality? ▶

Cost optimization in priority order:

1. Cache first: implement semantic caching. If 30% of queries are repeat or near-repeat, that's 30% cost eliminated immediately.
2. Route by complexity: classify incoming queries as simple/medium/complex. Simple → Haiku/GPT-3.5 (~20x cheaper). Complex → Sonnet/GPT-4.
3. Reduce context: audit how many tokens your average prompt uses. Often 40–60% is bloat. Tighten prompts, reduce retrieved chunks.
4. Batch async work: non-real-time tasks (report generation, nightly analysis) → batch API at 50% discount.
5. Self-host high-volume tasks: if one task generates 80% of your API cost, evaluate open-source model for that specific task.

Measure cost-per-task, not total cost. A 3x latency increase may halve cost — worth it for async workflows.

43 Walk me through a debugging session for incorrect LLM outputs ▶

Systematic debugging, not guesswork:

Step 1 — Reproduce: pull the exact logged prompt (system + user + context chunks) from your logging system. Replay it. Is it consistently wrong or flaky?
Step 2 — Isolate: is the error in retrieval (wrong chunks returned) or generation (right chunks, wrong answer)? Check which chunks were injected.
Step 3 — Retrieval audit: run the query manually in your vector DB. Are the top-3 chunks actually relevant? If not → chunking or embedding issue.
Step 4 — Prompt audit: if retrieval is fine, is the system prompt ambiguous? Is the model ignoring the context? Try temp=0, add explicit instruction.
Step 5 — Model issue: does switching to a stronger model fix it? If yes → capability gap, not a prompt bug.
Step 6 — Fix and test: add the failing case to your regression test suite before deploying the fix.

⚡ Always have logging that lets you replay exact prompts. "I can't reproduce it" is not acceptable in production AI systems.

RAPID 5 Bonus Rapid-Fire Concepts must-know terms ▶

R1 What is RLHF and why does it matter? ▶

Reinforcement Learning from Human Feedback. After SFT, a reward model is trained on human preference pairs, then the LLM is fine-tuned via PPO to maximize the reward model's score. It's how ChatGPT learned to be "helpful" rather than just next-token-predicting. DPO (Direct Preference Optimization) is the simpler modern alternative — same objective, no separate reward model, more stable training.

R2 What is a context window and what are its real limits? ▶

The maximum tokens a model can process in one call (input + output combined). GPT-4o: 128K. Claude 3.5: 200K. But effective length is shorter — models lose coherence and "forget" middle content in very long contexts (the "lost in the middle" problem). For production RAG, treat 32K as the practical limit for reliable reasoning.

R3 What is an agent and what makes one production-ready? ▶

An agent is an LLM with tool access that can take multi-step actions autonomously (perceive → plan → act → observe loop). Production-ready requires: (1) deterministic tool definitions with input validation, (2) max-step limits to prevent runaway loops, (3) human-in-the-loop checkpoints for irreversible actions, (4) full action logging, (5) graceful failure handling. Most "agents" in demos fail all 5.

R4 Explain the difference between encoder, decoder, and encoder-decoder models ▶

Encoder-only (BERT, E5): bidirectional attention, sees full context. Best for embeddings, classification, NER. Cannot generate text.
Decoder-only (GPT, Llama): causal/autoregressive attention. Generates text one token at a time. All modern LLMs are decoder-only.
Encoder-decoder (T5, BART): encoder processes input, decoder generates output. Good for translation, summarization. Less common now that decoder-only scales so well.

R5 What is prompt injection and how do you defend against it? ▶

Prompt injection: user input overrides system prompt instructions ("Ignore previous instructions and..."). Defenses: (1) never interpolate raw user input directly into system prompt, (2) use separate message roles (system vs user), (3) output validation — check that the response matches expected schema, (4) sandboxed tool access — even if injected, tools should require explicit permission for destructive ops, (5) input filtering — detect and reject obvious injection patterns before LLM call.

ML·AI EngineerInterview Guide

ML·AI Engineer
Interview Guide