AI Glossary: Complete LLM Terminology Guide

🏗️ Model Architecture & Capacity

Model Parameters

Trainable numerical values (weights and biases) inside a model that encode learned knowledge.

Parameter Count

Total number of trainable parameters in a model, often used as a proxy for model capacity.

Overparameterization

Condition where a model has more parameters than strictly needed, enabling better generalization in deep learning.

Model Capacity

Ability of a model to fit complex functions, influenced by depth, width, and parameters.

Scaling Laws

Empirical relationships showing that performance improves predictably with more data, parameters, and compute.

📊 Model Types & Categories

Foundation Model

Large pre-trained model adaptable to many downstream tasks through prompting or fine-tuning.

Frontier Model

State-of-the-art large-scale models at the cutting edge of performance, cost, and capability.

Base Model

Pre-trained model without task-specific alignment or instruction tuning.

Instruction-Tuned Model

Model fine-tuned to follow human instructions and prompts.

Chat Model

Instruction-tuned model optimized for conversational interactions.

🎓 Training & Adaptation Techniques

Pretraining

Initial large-scale training phase where a model learns general language patterns.

Fine-Tuning

Additional training on task- or domain-specific data.

Instruction Tuning

Fine-tuning using prompt–response pairs to improve usability.

Alignment

Process of shaping model behavior to align with human intent and values.

RLHF

Reinforcement Learning from Human Feedback: Training method using human preferences as reward signals.

LoRA

Low-Rank Adaptation: Parameter-efficient fine-tuning technique updating low-rank matrices.

PEFT

Parameter-Efficient Fine-Tuning: Methods that adapt models while updating minimal parameters.

⚡ Inference & Performance Optimization

Inference

Running a trained model to generate predictions or outputs.

Latency

Time taken for a model to return a response.

Throughput

Number of requests processed per unit time.

Batching

Grouping multiple requests to improve GPU utilization.

KV Cache

Cached key-value attention tensors to speed up autoregressive generation.

💾 Quantization & Efficiency

Quantization

Reducing numerical precision of model weights to improve speed and reduce memory usage.

FP32 / FP16 / BF16

Floating-point precision formats used during training and inference.

INT8 / INT4

Low-precision integer formats used in quantized models.

Post-Training Quantization

Quantization applied after model training.

Quantization-Aware Training

Training process that accounts for quantization effects.

Model Compression

Techniques to reduce model size, including quantization and pruning.

🧠 Memory & Context Management

Context Window

Maximum number of tokens a model can process at once.

Prompt Truncation

Removal of older tokens when context exceeds window limit.

Long-Context Model

Models designed to handle very large context windows.

Sliding Window Attention

Attention mechanism that limits computation to nearby tokens.

✨ Generation & Decoding

Autoregressive Generation

Generating tokens one at a time based on previous outputs.

Temperature

Controls randomness of generated outputs.

Top-K Sampling

Limits token selection to top K most probable tokens.

Top-P (Nucleus Sampling)

Selects tokens whose cumulative probability exceeds P.

Beam Search

Deterministic decoding strategy exploring multiple candidate sequences.

🛡️ Reliability, Safety & Evaluation

Hallucination

Confident but incorrect model output.

Faithfulness

Degree to which generated content is grounded in provided context.

Ground Truth

Correct reference data used for evaluation.

Guardrails

Constraints ensuring safe and correct model behavior.

Evaluation Harness

Framework for systematically testing model outputs.

🚀 Deployment & Operations

Model Serving

Infrastructure for deploying models as APIs.

Cold Start

Initial latency when a model is first loaded.

Autoscaling

Automatically adjusting resources based on load.

Observability

Monitoring inputs, outputs, latency, errors, and drift.

Prompt Versioning

Tracking changes to prompts over time.

🔮 Advanced & Emerging Concepts

Chain-of-Thought (CoT)

Intermediate reasoning steps generated by a model.

Self-Consistency

Generating multiple reasoning paths and selecting the best answer.

Tool-Augmented LLM

LLM that can invoke external tools during inference.

Agentic AI

Systems composed of autonomous agents coordinating actions.

Synthetic Data

Artificially generated data used for training or evaluation.

Model Distillation

Training a smaller model to mimic a larger one.

🚀 Latest AI Developments (2024-2025)

Mixture of Experts (MoE)

Architecture where different expert networks handle different inputs, enabling larger models with lower compute costs.

QLoRA

Quantized LoRA: Combines quantization with LoRA for efficient fine-tuning of large models on consumer hardware.

Flash Attention

Memory-efficient attention algorithm that reduces memory usage from O(n²) to O(n), enabling longer context windows.

Speculative Decoding

Technique using a smaller model to draft tokens and a larger model to verify, speeding up generation.

Function Calling

Capability of LLMs to identify when to call external functions/APIs and format requests correctly.

Tool Use

LLM's ability to use external tools like calculators, APIs, or databases during reasoning.

Multimodal Models

Models that can process and generate multiple types of data (text, images, audio, video) simultaneously.

Vision-Language Models

Models that understand and generate both visual and textual content, enabling image understanding and generation.

Sparse Attention

Attention mechanism that only computes attention for a subset of tokens, reducing computational cost.

Gradient Checkpointing

Memory optimization technique that trades compute for memory by recomputing activations during backpropagation.

Constitutional AI

Training approach where models critique and revise their own outputs based on a set of principles or "constitution".

Direct Preference Optimization (DPO)

Simplified alternative to RLHF that directly optimizes model outputs based on human preferences without reinforcement learning.

Retrieval-Augmented Fine-Tuning (RAFT)

Fine-tuning approach that incorporates retrieval mechanisms directly into the model training process.

Tree of Thoughts (ToT)

Reasoning framework where models explore multiple reasoning paths in a tree structure before selecting the best solution.

ReAct (Reasoning + Acting)

Framework combining reasoning and acting, where agents interleave reasoning traces with actions in the environment.

Reinforcement Learning from AI Feedback (RLAIF)

Training method using AI-generated feedback instead of human feedback, enabling scalable alignment.

Mixture of Depths (MoD)

Efficiency technique where different tokens are processed with different numbers of layers, reducing compute.

Long Context Models

Models designed to handle extremely long context windows (100K+ tokens) using techniques like sliding window attention.

Efficient Fine-Tuning

Collection of techniques (LoRA, QLoRA, AdaLoRA) that enable fine-tuning with minimal parameter updates.

Prompt Compression

Techniques to reduce prompt size while preserving information, enabling longer effective context windows.