01 Roles (overlapping)
Problem framing, baselines, metrics tied to decisions, stakeholder communication, often notebook-to-first-deploy.
Scalable training & inference, CI/CD for ML, feature stores, latency/cost SLOs, on-call, safe rollbacks.
Iteration is closed-loop: monitoring reveals drift and errors, feeding back into data and retraining—see also data engineering.
02 What “production” adds
| Concern | Why it matters |
|---|---|
| Latency & throughput | User-facing APIs and batch scoring have SLOs; batching and hardware matter. |
| Reliability | Retries, fallbacks, idempotent consumers, health checks. |
| Observability | Structured logs (no secrets), metrics, traces—debug without PII in plain text. |
| Evaluation | Offline metrics + online A/B or shadow traffic; slice analysis for fairness gaps. |
| Governance | Model cards, access control, audit trails for regulated domains. |
03 Applied AI in the LLM era
Product teams combine foundation models with retrieval, tools, and guardrails. Success is less about raw perplexity and more about task success rate, safety, and cost per request. The comparison RAG vs fine-tuning is central to system design.