← Hub · Deep learning

Deep learning & neural networks

Networks compose differentiable layers so gradient descent can tune millions of parameters. Representations are learned end-to-end from data rather than hand-engineered for each task.

01 Neuron & layer

A neuron computes a weighted sum of inputs, adds a bias, and applies a nonlinear activation σ. Stacking neurons and layers allows modeling of complex functions; without nonlinearities, depth would collapse to a single linear map.

Figure — one neuron (affine + activation)

σ is typically ReLU in hidden layers (sparse activations, stable gradients for positive inputs); output layer uses softmax (multi-class) or linear (regression).

02 MLP stack

A multilayer perceptron alternates linear maps and activations. Width and depth trade off expressivity, data requirements, and compute. Universal approximation says wide/shallow networks can represent many functions—in practice depth helps with hierarchical structure.

Figure — depth as hierarchy of features

03 Training via backpropagation

Backprop applies the chain rule to propagate loss gradients backward through the graph. Frameworks build dynamic or static graphs (PyTorch, JAX, TensorFlow) to compute ∂L/∂w efficiently. Key practical concerns: vanishing/exploding gradients (initialization, skip connections, normalization), learning rate schedules, and batch size affecting noise in updates.

Inductive bias

CNNs bake in translation-equivariance for images; Transformers use self-attention for set-like data with global interactions. Architecture matches structure in the domain.

04 Beyond MLPs (pointers)

CNNs: local filters + pooling → hierarchical visual features.
RNNs/LSTMs: hidden state over time (being replaced by attention in many NLP settings).
Transformers: self-attention + position encoding → LLMs and vision transformers.

See also RAG vs fine-tuning for adapting large pretrained models.