Tokenization converts raw text into integer IDs from a fixed vocabulary (e.g., 50K tokens for GPT-4 using BPE). The model never sees characters — only token IDs and their embeddings.
Why it matters for generation:
tiktoken for OpenAI).For factory AI: sensor tags like CELL_3_TEMP_AVG are often split 4–6 ways. Pre-processing matters.
An embedding is a dense vector (e.g., 768 or 1536 dimensions) that encodes semantic meaning. Tokens → embeddings via a learned lookup table. The transformer then operates in this vector space.
Key intuitions:
In RAG systems: you embed chunks at index time and embed queries at retrieval time, then find nearest neighbors. Embedding model consistency matters — switching models requires full reindex.
Attention lets each token attend to every other token in the sequence. For each token, you compute a weighted sum of all other token values, where weights come from query-key dot products (scaled, softmaxed).
Multi-head attention runs this in parallel across multiple subspaces, allowing the model to simultaneously capture different types of relationships (syntactic, semantic, coreference).
Positional encoding solves the problem that attention itself is order-agnostic — the same set of tokens in any order would produce the same output. Positional encodings inject token position information into the embedding before attention. Modern models use RoPE (Rotary Position Embedding) which applies rotation to Q/K vectors, enabling better length extrapolation (and tricks like NTK-aware / YaRN scaling to stretch context at inference).
Staff-level depth: full attention is O(n²) in sequence length, which is why context is expensive. The key optimizations to name:
Fine-tuning updates model weights on a task-specific dataset. Key decisions:
Full fine-tune: update all weights. Best quality, but requires massive GPU memory and compute. Risk of catastrophic forgetting. Only justified if you have massive domain data.
LoRA (Low-Rank Adaptation): freeze base weights, inject small trainable rank-decomposition matrices (A×B) into attention layers. Train only ~0.1–1% of parameters. Nearly equivalent quality to full fine-tune for most tasks.
QLoRA: LoRA on a 4-bit quantized base model. Reduces VRAM from ~80GB to ~12GB for a 13B model. Some quality degradation from quantization but acceptable for most applications. The practical choice for most teams.
Decision matrix:
Zero-shot: just a task description. Works well for large models (GPT-4, Claude 3.5+) on common tasks. Faster iteration, less prompt engineering.
Few-shot: include 3–8 input/output examples in the prompt. Critical when: (1) output format is non-standard, (2) domain terminology is unusual, (3) task requires specific reasoning patterns, or (4) using smaller/open-source models.
When few-shot wins: classification of manufacturing defect codes, extracting structured JSON from sensor logs, domain-specific summarization. The examples calibrate format + tone.
Pitfall: poor-quality examples are worse than zero-shot. Garbage few-shot = garbage output.
Robust system prompts must handle adversarial inputs, edge cases, and diverse user styles without breaking. Key principles:
{{user_role}}, {{shift}}, {{equipment_id}}) — same base prompt, parameterized per userTemperature = 0 is the primary lever. This makes sampling greedy (always pick highest probability token). Not fully deterministic in practice due to floating-point nondeterminism across GPU runs, but functionally stable.
Additional controls:
top_p = 1.0, top_k = disabled — no sampling variationoutlines or guidance enforce grammar-based output — guarantees valid JSON/SQL regardless of temperatureHonest answer for interviews: you can get high reproducibility but not true determinism across different hardware/versions. Design systems to handle occasional variation.
Context evolves: system prompts change, docs get updated, new equipment is added. Treat context like code:
version, created_at, source, hash fields(prompt_version, context_chunks_used, model_version, output_hash) — essential for debugging regressionsMemory types and when to use each:
Anti-pattern: using the LLM itself to manage memory state. LLMs hallucinate; your memory store must be a reliable database with deterministic reads/writes.
No single strategy is universal. Match chunking to document structure:
My default for factory SOPs: structural chunking by section, with metadata injection (doc title, section header, last updated) into every chunk.
Decision criteria: (1) do you need on-prem? → Qdrant/Weaviate, (2) already on Postgres? → pgvector, (3) need hybrid BM25+vector? → OpenSearch, (4) cloud-only, fast start? → Pinecone
Yes — treat it like a database migration:
re_embedded=true when doneKey constraint: you cannot mix embeddings from different models in the same index (different vector spaces). Plan model upgrades as full migrations, not in-place updates.
Offline metrics:
Reranking: after initial vector retrieval (top-20), use a cross-encoder (e.g., Cohere rerank, BGE-reranker) that scores query+chunk jointly. More expensive but significantly improves precision.
Citation tracking: every answer chunk gets source_doc, chunk_id, page metadata. Expose citations in the UI — the best hallucination detector is asking users "does this match the source?"
Online evaluation: thumbs up/down on answers, correction tracking, time-to-answer as proxy for user confidence.
Data: ingest → validate schema → clean → version with DVC or Delta Lake
Training: feature store → model training → experiment tracking (MLflow/W&B) → model registry with metadata (dataset hash, metrics, training config)
Serving: containerized model (Docker) → serving layer (FastAPI/Triton/vLLM) → load balancer → API gateway with auth + rate limiting
Observability: log every prediction (input, output, latency, model version) → Prometheus metrics → Grafana dashboard → alerting on latency spikes or confidence drops
Feedback loop: user corrections → label store → periodic retraining trigger → A/B test new model → canary deploy (5% traffic) → full rollout
Drift detection:
Hallucination detection:
Every LLM call should emit a structured log event with:
request_id, session_id, user_idmodel, model_version, prompt_versionsystem_prompt_hash, rendered_prompt (or hash + pointer to S3)context_chunks: list of chunk IDs usedraw_output, parsed_outputlatency_ms, input_tokens, output_tokens, cost_usdtemperature, timestampStore full prompts in object storage (S3/GCS), log only the pointer. ClickHouse is excellent for aggregating LLM telemetry at scale — fast on cost analysis, latency percentiles, error rates by prompt version.
Standard ML CI/CD: test code, validate model metrics, deploy container. LLM adds:
gpt-4o-2024-11-20 not gpt-4o. API model aliases change. Pin exact versions in prod.max_tokens limit + format constraints. JSON outputs are cheaper than prose for structured tasks.Quantization reduces weight precision (FP32 → INT8/INT4) to shrink model size and speed up inference.
Use when:
Avoid when:
Methods: GPTQ (post-training, accurate), AWQ (fast inference), GGUF (CPU-friendly via llama.cpp). INT8 is safe for most tasks; INT4 needs testing.
KV-cache (built-in): transformers cache key-value pairs for the prompt prefix. Constant system prompts are cached automatically — keep them at the top of the prompt.
Semantic caching: embed incoming queries, check vector similarity against recent queries. If cosine sim > 0.97, serve cached response. Redis + pgvector works well. Reduces API cost dramatically for repetitive use cases.
Request batching: group multiple requests into a single model call (for offline/async tasks). vLLM's continuous batching is the standard for high-throughput serving.
Streaming: use SSE/streaming responses to reduce perceived latency — user sees first tokens in 200ms instead of waiting 5s for full response.
Speculative decoding: small draft model generates tokens, large model verifies in parallel. 2–3x throughput improvement for large models.
Hosted API (OpenAI, Anthropic, Gemini):
Open-source (Llama 3, Mistral, Qwen):
My rule: prototype on hosted API, productionize on open-source if volume justifies it or data can't leave the building. For Form Energy: everything stays on-prem.
Fail modes: API timeout, malformed JSON output, rate limit, context overflow, hallucinated tool call.
Fallback strategy:
This is the most important question an AI engineer can ask. LLMs are expensive, slow, and non-deterministic. Before reaching for one, check:
Use LLMs when: the task requires reasoning, generation, or language understanding that structured approaches genuinely can't provide. Not just because it's cool.
The anti-pattern: using a vector DB as a general-purpose store because it's "AI." Know what each database is optimized for.
"Predict failures" is not one problem — it's a family of framings, and choosing the right one is the senior signal:
The hard part is labels, not models. Failure labels come from CMMS / work orders and are noisy: wrong timestamps, "replaced X but Y was the cause," preventive vs reactive maintenance mixed together. Spend your effort on label quality and a leakage-free temporal split (train on past, validate on future), not on swapping XGBoost for a transformer.
Methods, roughly increasing in power/cost:
Pitfalls that sink projects:
Distinguish three things:
Handling: schema + range validation at ingest (Great Expectations-style), a drift monitor that compares live windows to the training baseline, scheduled re-fit cadence tied to known process changes (tool change, line requal), and a fast rollback path. Tie drift alerts to a human-in-the-loop review, not auto-retrain — auto-retraining on drifted/garbage data is how you bake a bad state in permanently.
Decide on four axes — latency, connectivity, data gravity, and safety:
OT realities to name: determinism > raw accuracy near safety functions, change control / validation (you can't hot-swap a model on a safety-rated line casually), and the IT/OT trust boundary (a model service should never have write access to control logic without an interlock).
Sensor ML lives or dies on point-in-time correctness and train/serve consistency:
Be the engineer who knows where not to use an LLM. Real-time control and anomaly detection are classical/streaming ML — an LLM in a control loop is an anti-pattern. LLMs earn their place at the human-interface and knowledge layer:
Architecture: deterministic ML and rules make decisions; the LLM sits on top for retrieval, explanation, and natural-language access — always with citations back to the source SOP/record.
The bottleneck in LLM serving is the KV-cache, not FLOPs. Naive serving pre-allocates a contiguous cache for max sequence length per request → massive internal fragmentation and wasted VRAM, which caps batch size.
Reality: large training is 3D parallelism — TP within a node, PP across nodes, DP/FSDP on top. For most teams fine-tuning, FSDP (or DeepSpeed ZeRO-3) + LoRA covers 90% of needs without touching TP/PP.
"Embed → top-k → stuff" is a demo. Production retrieval is a pipeline:
Separate component evals from end-to-end evals:
Senior point: evals are a dataset problem. Mine real production failures into the golden set continuously; an eval suite that never grows is already stale.
Agents fail in the gap between "works in a demo" and "safe in prod." Design for containment:
Treat it as a layered design and state assumptions up front (on-prem, ~50 lines, sub-second alerts, data can't leave the building):
Scaling & tradeoffs to volunteer: classical ML for the real-time loop (cheap, deterministic, safe); LLM only at the human layer; everything on-prem for OT data sovereignty; graceful degradation to search + templates if the model is down so the line never depends on the copilot.
Embedding model change = incompatible vector space = you cannot query old embeddings with new model queries. Full reindex required.
Migration plan:
Prevention: version your embedding model in metadata (embedding_model: "text-embedding-3-large") on every chunk. Detect staleness automatically when model changes.
Data collection: log user queries + responses + explicit feedback (thumbs up/down) + implicit signals (did user follow up with "that's wrong"?). Build a preference dataset.
Data cleaning: filter for high-quality examples. Remove low-confidence base model outputs. Deduplicate. Balance classes.
Training approach: SFT (supervised fine-tuning) on preferred responses. For preference optimization, RLHF (complex) or DPO (simpler, often better) on (preferred, rejected) pairs.
Deployment:
Cost optimization in priority order:
Measure cost-per-task, not total cost. A 3x latency increase may halve cost — worth it for async workflows.
Systematic debugging, not guesswork:
Reinforcement Learning from Human Feedback. After SFT, a reward model is trained on human preference pairs, then the LLM is fine-tuned via PPO to maximize the reward model's score. It's how ChatGPT learned to be "helpful" rather than just next-token-predicting. DPO (Direct Preference Optimization) is the simpler modern alternative — same objective, no separate reward model, more stable training.
The maximum tokens a model can process in one call (input + output combined). GPT-4o: 128K. Claude 3.5: 200K. But effective length is shorter — models lose coherence and "forget" middle content in very long contexts (the "lost in the middle" problem). For production RAG, treat 32K as the practical limit for reliable reasoning.
An agent is an LLM with tool access that can take multi-step actions autonomously (perceive → plan → act → observe loop). Production-ready requires: (1) deterministic tool definitions with input validation, (2) max-step limits to prevent runaway loops, (3) human-in-the-loop checkpoints for irreversible actions, (4) full action logging, (5) graceful failure handling. Most "agents" in demos fail all 5.
Encoder-only (BERT, E5): bidirectional attention, sees full context. Best for embeddings, classification, NER. Cannot generate text.
Decoder-only (GPT, Llama): causal/autoregressive attention. Generates text one token at a time. All modern LLMs are decoder-only.
Encoder-decoder (T5, BART): encoder processes input, decoder generates output. Good for translation, summarization. Less common now that decoder-only scales so well.
Prompt injection: user input overrides system prompt instructions ("Ignore previous instructions and..."). Defenses: (1) never interpolate raw user input directly into system prompt, (2) use separate message roles (system vs user), (3) output validation — check that the response matches expected schema, (4) sandboxed tool access — even if injected, tools should require explicit permission for destructive ops, (5) input filtering — detect and reject obvious injection patterns before LLM call.