01 Neuron & layer
A neuron computes a weighted sum of inputs, adds a bias, and applies a nonlinear activation σ. Stacking neurons and layers allows modeling of complex functions; without nonlinearities, depth would collapse to a single linear map.
σ is typically ReLU in hidden layers (sparse activations, stable gradients for positive inputs); output layer uses softmax (multi-class) or linear (regression).
02 MLP stack
A multilayer perceptron alternates linear maps and activations. Width and depth trade off expressivity, data requirements, and compute. Universal approximation says wide/shallow networks can represent many functions—in practice depth helps with hierarchical structure.
03 Training via backpropagation
Backprop applies the chain rule to propagate loss gradients backward through the graph. Frameworks build dynamic or static graphs (PyTorch, JAX, TensorFlow) to compute ∂L/∂w efficiently. Key practical concerns: vanishing/exploding gradients (initialization, skip connections, normalization), learning rate schedules, and batch size affecting noise in updates.
04 Beyond MLPs (pointers)
- CNNs: local filters + pooling → hierarchical visual features.
- RNNs/LSTMs: hidden state over time (being replaced by attention in many NLP settings).
- Transformers: self-attention + position encoding → LLMs and vision transformers.
See also RAG vs fine-tuning for adapting large pretrained models.