01
Tensors & autograd
torch.Tensor, devices, gradients
Like NumPy with GPU and derivatives
A tensor is a multi-dimensional array. Set requires_grad=True to track operations for reverse-mode autodiff (.backward()). Use device="cuda" when a GPU is available—keep tensors on one device to avoid silent copies.
python
import torch device = torch.device("cuda" if torch.cuda.is_available() else "cpu") x = torch.linspace(-1, 1, steps=100, device=device, requires_grad=True) y = (x * x).sum() y.backward() # x.grad holds ∂y/∂x
02
nn.Module & building blocks
Subclassing nn.Module
Layers register parameters automatically
python
import torch.nn as nn class MLP(nn.Module): def __init__(self, in_dim: int, hidden: int, out_dim: int): super().__init__() self.net = nn.Sequential( nn.Linear(in_dim, hidden), nn.ReLU(), nn.Linear(hidden, out_dim), ) def forward(self, x: torch.Tensor) -> torch.Tensor: return self.net(x)
Call
model.train() / model.eval() so dropout & batch norm behave correctly. Save checkpoints with torch.save(model.state_dict(), ...).
03
Training loop & DataLoader
Standard supervised loop
Mini-batches, loss, backward, step
python
model = MLP(in_dim, hidden, num_classes).to(device) criterion = nn.CrossEntropyLoss() optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4) for epoch in range(epochs): model.train() for xb, yb in train_loader: xb, yb = xb.to(device), yb.to(device) optimizer.zero_grad(set_to_none=True) logits = model(xb) loss = criterion(logits, yb) loss.backward() optimizer.step()
04
Theory — what to articulate in interviews
Loss, optimization & generalization
Maps directly to knobs in PyTorch
| Topic | Short idea | PyTorch hook |
|---|---|---|
| Empirical risk | Train loss approximates expected loss over the data distribution | CrossEntropyLoss, MSELoss |
| SGD / Adam | Stochastic estimates of the gradient; Adam adapts per-parameter steps | torch.optim.* |
| Overfitting | Low train error, high val error — memorization | Dropout, weight decay, more data, simpler model |
| Regularization | Add penalty or noise so weights stay small / robust | weight_decay, dropout, early stopping |
| Learning rate | Too large: unstable; too small: slow | Schedulers, warmup (see docs), monitor val loss |
For production, you also care about latency, numerical stability (mixed precision with
torch.cuda.amp), and reproducibility (torch.manual_seed, DataLoader workers).