Comprehensive glossary of AI and Large Language Model terms, concepts, and techniques
Trainable numerical values (weights and biases) inside a model that encode learned knowledge.
Total number of trainable parameters in a model, often used as a proxy for model capacity.
Condition where a model has more parameters than strictly needed, enabling better generalization in deep learning.
Ability of a model to fit complex functions, influenced by depth, width, and parameters.
Empirical relationships showing that performance improves predictably with more data, parameters, and compute.
Large pre-trained model adaptable to many downstream tasks through prompting or fine-tuning.
State-of-the-art large-scale models at the cutting edge of performance, cost, and capability.
Pre-trained model without task-specific alignment or instruction tuning.
Model fine-tuned to follow human instructions and prompts.
Instruction-tuned model optimized for conversational interactions.
Initial large-scale training phase where a model learns general language patterns.
Additional training on task- or domain-specific data.
Fine-tuning using prompt–response pairs to improve usability.
Process of shaping model behavior to align with human intent and values.
Reinforcement Learning from Human Feedback: Training method using human preferences as reward signals.
Low-Rank Adaptation: Parameter-efficient fine-tuning technique updating low-rank matrices.
Parameter-Efficient Fine-Tuning: Methods that adapt models while updating minimal parameters.
Running a trained model to generate predictions or outputs.
Time taken for a model to return a response.
Number of requests processed per unit time.
Grouping multiple requests to improve GPU utilization.
Cached key-value attention tensors to speed up autoregressive generation.
Reducing numerical precision of model weights to improve speed and reduce memory usage.
Floating-point precision formats used during training and inference.
Low-precision integer formats used in quantized models.
Quantization applied after model training.
Training process that accounts for quantization effects.
Techniques to reduce model size, including quantization and pruning.
Maximum number of tokens a model can process at once.
Removal of older tokens when context exceeds window limit.
Models designed to handle very large context windows.
Attention mechanism that limits computation to nearby tokens.
Generating tokens one at a time based on previous outputs.
Controls randomness of generated outputs.
Limits token selection to top K most probable tokens.
Selects tokens whose cumulative probability exceeds P.
Deterministic decoding strategy exploring multiple candidate sequences.
Confident but incorrect model output.
Degree to which generated content is grounded in provided context.
Correct reference data used for evaluation.
Constraints ensuring safe and correct model behavior.
Framework for systematically testing model outputs.
Infrastructure for deploying models as APIs.
Initial latency when a model is first loaded.
Automatically adjusting resources based on load.
Monitoring inputs, outputs, latency, errors, and drift.
Tracking changes to prompts over time.
Intermediate reasoning steps generated by a model.
Generating multiple reasoning paths and selecting the best answer.
LLM that can invoke external tools during inference.
Systems composed of autonomous agents coordinating actions.
Artificially generated data used for training or evaluation.
Training a smaller model to mimic a larger one.
Architecture where different expert networks handle different inputs, enabling larger models with lower compute costs.
Quantized LoRA: Combines quantization with LoRA for efficient fine-tuning of large models on consumer hardware.
Memory-efficient attention algorithm that reduces memory usage from O(n²) to O(n), enabling longer context windows.
Technique using a smaller model to draft tokens and a larger model to verify, speeding up generation.
Capability of LLMs to identify when to call external functions/APIs and format requests correctly.
LLM's ability to use external tools like calculators, APIs, or databases during reasoning.
Models that can process and generate multiple types of data (text, images, audio, video) simultaneously.
Models that understand and generate both visual and textual content, enabling image understanding and generation.
Attention mechanism that only computes attention for a subset of tokens, reducing computational cost.
Memory optimization technique that trades compute for memory by recomputing activations during backpropagation.
Training approach where models critique and revise their own outputs based on a set of principles or "constitution".
Simplified alternative to RLHF that directly optimizes model outputs based on human preferences without reinforcement learning.
Fine-tuning approach that incorporates retrieval mechanisms directly into the model training process.
Reasoning framework where models explore multiple reasoning paths in a tree structure before selecting the best solution.
Framework combining reasoning and acting, where agents interleave reasoning traces with actions in the environment.
Training method using AI-generated feedback instead of human feedback, enabling scalable alignment.
Efficiency technique where different tokens are processed with different numbers of layers, reducing compute.
Models designed to handle extremely long context windows (100K+ tokens) using techniques like sliding window attention.
Collection of techniques (LoRA, QLoRA, AdaLoRA) that enable fine-tuning with minimal parameter updates.
Techniques to reduce prompt size while preserving information, enabling longer effective context windows.