← Hub · Deep learning

Deep learning & neural networks

Networks compose differentiable layers so gradient descent can tune millions of parameters. Representations are learned end-to-end from data rather than hand-engineered for each task.

01 Neuron & layer

A neuron computes a weighted sum of inputs, adds a bias, and applies a nonlinear activation σ. Stacking neurons and layers allows modeling of complex functions; without nonlinearities, depth would collapse to a single linear map.

Figure — one neuron (affine + activation)
x₁ x₂ w₁ w₂ Σ + b σ h

σ is typically ReLU in hidden layers (sparse activations, stable gradients for positive inputs); output layer uses softmax (multi-class) or linear (regression).

02 MLP stack

A multilayer perceptron alternates linear maps and activations. Width and depth trade off expressivity, data requirements, and compute. Universal approximation says wide/shallow networks can represent many functions—in practice depth helps with hierarchical structure.

Figure — depth as hierarchy of features
in L₁ L₂ out

03 Training via backpropagation

Backprop applies the chain rule to propagate loss gradients backward through the graph. Frameworks build dynamic or static graphs (PyTorch, JAX, TensorFlow) to compute ∂L/∂w efficiently. Key practical concerns: vanishing/exploding gradients (initialization, skip connections, normalization), learning rate schedules, and batch size affecting noise in updates.

Inductive bias
CNNs bake in translation-equivariance for images; Transformers use self-attention for set-like data with global interactions. Architecture matches structure in the domain.

04 Beyond MLPs (pointers)

  • CNNs: local filters + pooling → hierarchical visual features.
  • RNNs/LSTMs: hidden state over time (being replaced by attention in many NLP settings).
  • Transformers: self-attention + position encoding → LLMs and vision transformers.

See also RAG vs fine-tuning for adapting large pretrained models.