Deep Learning
From Neuron to Transformer
Every concept from a single perceptron to modern Transformers β with animated visualizations, full PyTorch code, and interview prep. No topic skipped. Builds directly on your Python + NumPy knowledge.
The Neuron & Perceptron
Everything in deep learning descends from one idea: a neuron receives inputs, multiplies each by a learned weight, sums them, passes through an activation function, and fires an output. Stack millions of these and you get GPT-4.
z = wβxβ + wβxβ + wβxβ + b. Then output = Ο(z). Watch the signal propagate and see z and activation computed live.
Activation Functions
Activations introduce non-linearity β without them, depth is meaningless. Each has different properties. Knowing which to use where is one of the most practical DL skills.
ReLU: max(0,z). Simple, fast. Default for hidden layers. Can cause "dying neurons" if weights push all inputs negative.
| Function | Formula | Range | Use When | Watch Out |
|---|---|---|---|---|
| ReLU | max(0, z) | [0, β) | Hidden layers β default choice | Dead neurons if z always < 0 |
| Sigmoid | 1/(1+eβ»αΆ») | (0, 1) | Binary output layer only | Vanishing gradients in deep nets |
| Tanh | (eαΆ»βeβ»αΆ»)/(eαΆ»+eβ»αΆ») | (β1, 1) | Hidden layers, RNNs | Vanishing gradients |
| Leaky ReLU | max(0.01z, z) | (ββ, β) | When ReLU causes dead neurons | Extra hyperparameter Ξ± |
| GELU | zΒ·Ξ¦(z) | (ββ, β) | Transformers (BERT, GPT) | Slightly slower to compute |
| Softmax | eαΆ»α΅’/Ξ£eαΆ»β±Ό | (0,1) sum=1 | Multi-class output layer only | Only at final layer! |
Hidden layers β ReLU (default). Binary output β Sigmoid. Multi-class output β Softmax. Regression output β no activation (linear). Transformers β GELU. These 5 rules cover 99% of cases.
Forward Pass & Loss Functions
The forward pass computes the prediction. The loss function measures how wrong the prediction is β a single number that summarizes the error across the whole batch. Backprop then minimizes this number.
Input β Linear β ReLU β Linear β ReLU β Output β Loss. Each layer transforms the data. The loss is computed at the end.
Backpropagation
Backprop is THE algorithm of deep learning. It uses the chain rule of calculus to compute how much each weight contributed to the loss β efficiently, for millions of weights simultaneously. PyTorch automates it, but you must understand it.
Red glow = gradient signal flowing backward. Each layer: receives βL/βoutput, computes βL/βweights via chain rule, passes βL/βinput to previous layer.
βL/βw = βL/βΕ· Γ βΕ·/βz Γ βz/βw. Each term is a simple derivative. Multiply them together = gradient for any weight at any depth. For a 100-layer network, the chain has 100 links. PyTorch tracks the computation graph automatically.
import torch # Simple example: compute gradient of L = (wx - y)Β² w = torch.tensor([2.0], requires_grad=True) # track gradients x = torch.tensor([3.0]) y = torch.tensor([9.0]) # Forward pass β builds computation graph pred = w * x loss = (pred - y) ** 2 # Backward pass β computes all gradients via chain rule loss.backward() print(w.grad) # β tensor([-18.]) = 2*(wx-y)*x = 2*(6-9)*3 = -18 # Manual check: dL/dw = 2(wx-y)Β·x = 2(2Β·3-9)Β·3 = 2(-3)(3) = -18 β # βββ In a real network ββββββββββββββββββββββββββββββββββββ optimizer.zero_grad() # clear gradients from last step output = model(X_batch) # forward pass β builds graph loss = criterion(output, y_batch) loss.backward() # backward β computes all .grad attributes optimizer.step() # use .grad to update weights
Optimizers
The optimizer takes the gradients computed by backprop and decides how to update each weight. SGD is the foundation. Adam is the default that just works. AdamW is what modern LLMs use.
Adam finds the minimum fast and reliably. Momentum accelerates SGD. Plain SGD is noisy but can generalize better with careful tuning.
| Optimizer | Key Idea | Default Params | Use For |
|---|---|---|---|
| SGD | w -= lr Β· grad | lr=0.01 | Computer vision with fine-tuning; needs careful lr schedule |
| SGD + Momentum | Accumulates velocity in consistent gradient directions | momentum=0.9 | Most CV training; faster than plain SGD |
| Adam | Adaptive per-weight learning rate using gradient moments | lr=1e-3, Ξ²β=0.9, Ξ²β=0.999 | Default for most DL; NLP; quick experiments |
| AdamW | Adam with decoupled weight decay (L2 applied correctly) | lr=1e-4, wd=0.01 | Transformers, LLMs β BERT/GPT use this |
| RMSProp | Adaptive lr using running average of squared gradients | lr=1e-3, Ξ±=0.99 | RNNs; non-stationary problems |
Don't use a fixed learning rate. Start higher, decay over time. CosineAnnealingLR β smoothly decays, great default. ReduceLROnPlateau β reduces when val loss stops improving. OneCycleLR β warm up then cool down, best for fast convergence.
PyTorch Tensors & Autograd
PyTorch is the primary DL framework used by most researchers and companies. Tensors are like NumPy arrays β identical API β but they can live on GPU and track gradients automatically. All your NumPy knowledge transfers directly.
import torch import torch.nn as nn import numpy as np # βββ Tensors β identical to NumPy ndarray βββββββββββββββββ x = torch.tensor([1.0, 2.0, 3.0]) X = torch.randn(64, 20) # batch=64, features=20 W = torch.zeros(20, 10) # weight matrix I = torch.eye(5) # identity matrix # Same NumPy operations work on PyTorch tensors: X @ W # matrix multiply (forward pass!) X.mean(dim=0) # mean per feature X.sum(); X.max(); X.abs() X.reshape(-1); X.T; X[:10] # βββ GPU β one line to move everything ββββββββββββββββββββ device = "cuda" if torch.cuda.is_available() else "cpu" X = X.to(device) # move tensor to GPU (or stay on CPU) model = model.to(device) # βββ NumPy interop ββββββββββββββββββββββββββββββββββββββββ arr = np.array([1, 2, 3], dtype=np.float32) t = torch.from_numpy(arr) # shared memory! t.detach().cpu().numpy() # safe Tensor β NumPy # βββ DataLoader β automatic batching ββββββββββββββββββββββ from torch.utils.data import TensorDataset, DataLoader dataset = TensorDataset( torch.FloatTensor(X_train), torch.LongTensor(y_train) # LongTensor for class labels ) loader = DataLoader(dataset, batch_size=64, shuffle=True, num_workers=2) for X_batch, y_batch in loader: # X_batch: (64, 20), y_batch: (64,) pass # your training code here
Building an MLP in PyTorch
nn.Module is the base class for every PyTorch model β from a 2-layer MLP to GPT-4. The pattern: define layers in __init__, define the computation in forward(). That's it.
Input layer β hidden layers (Linear + BN + ReLU + Dropout) β output layer. Each circle is a neuron. Each line is a weight. The network learns by adjusting all weights simultaneously.
import torch.nn as nn class MLP(nn.Module): def __init__(self, in_dim, hidden, out_dim, dropout=0.3): super().__init__() # ALWAYS call this first layers = [] prev = in_dim for h in hidden: layers += [ nn.Linear(prev, h), # y = xW^T + b nn.BatchNorm1d(h), # normalize activations nn.ReLU(), # non-linearity nn.Dropout(dropout), # regularization ] prev = h layers.append(nn.Linear(prev, out_dim)) # output (no activation!) self.net = nn.Sequential(*layers) def forward(self, x): return self.net(x) # Sequential handles the rest # Instantiate model = MLP(in_dim=20, hidden=[128, 64, 32], out_dim=2, dropout=0.3).to(device) # Count parameters total = sum(p.numel() for p in model.parameters()) print(f"Parameters: {total:,}") # Weight initialization β He (Kaiming) for ReLU networks def init_weights(m): if isinstance(m, nn.Linear): nn.init.kaiming_normal_(m.weight, nonlinearity='relu') nn.init.zeros_(m.bias) model.apply(init_weights)
Random normal initialization causes vanishing/exploding activations in deep networks. Xavier/Glorot: for sigmoid/tanh. He/Kaiming: for ReLU (default in PyTorch's nn.Linear). Correct init means the forward pass starts with reasonable activations β training converges much faster.
The Complete Training Loop
The training loop is the same for every DL model β from MLP to GPT. Understand each line. The most common bugs all live here.
criterion = nn.CrossEntropyLoss() optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4) scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=50) best_val, history = float('inf'), {'train':[], 'val':[], 'acc':[]} for epoch in range(100): # βββ TRAIN ββββββββββββββββββββββββββββββββββββββββββββββ model.train() # enables dropout + BN train mode train_loss = 0.0 for Xb, yb in train_loader: Xb, yb = Xb.to(device), yb.to(device) optimizer.zero_grad() # 1. CLEAR old gradients β never forget logits = model(Xb) # 2. Forward pass loss = criterion(logits, yb) # 3. Compute loss loss.backward() # 4. Backprop β computes all grads nn.utils.clip_grad_norm_( # 5. Clip gradients (optional) model.parameters(), 1.0) optimizer.step() # 6. Update weights with grads train_loss += loss.item() # βββ VALIDATE βββββββββββββββββββββββββββββββββββββββββββ model.eval() # disables dropout; BN uses running stats val_loss, correct, total = 0.0, 0, 0 with torch.no_grad(): # no graph built β 4x less memory for Xv, yv in val_loader: Xv, yv = Xv.to(device), yv.to(device) logits = model(Xv) val_loss += criterion(logits, yv).item() correct += (logits.argmax(1) == yv).sum().item() total += yv.size(0) scheduler.step() if val_loss < best_val: best_val = val_loss torch.save(model.state_dict(), 'best.pt') # save best checkpoint if epoch % 10 == 0: print(f"[{epoch:3d}] train={train_loss/len(train_loader):.4f} val={val_loss/len(val_loader):.4f} acc={correct/total:.3f}")
1. zero_grad() β clear old gradients (miss this and gradients accumulate β a very common bug). 2. Forward pass. 3. Compute loss. 4. loss.backward(). 5. (Optional) clip gradients. 6. optimizer.step(). Every DL training loop you ever write follows this exact sequence.
Regularization β Preventing Overfitting
Deep networks can memorize millions of training examples perfectly. These techniques force the network to learn generalizable patterns instead of memorizing noise.
Convolutional Neural Networks (CNNs)
CNNs exploit the structure of spatial data: nearby pixels are related, and the same pattern (edge, texture) can appear anywhere in an image. Convolution shares weights across all positions β far more efficient than a fully connected layer.
A 3Γ3 filter slides across the image. At each position: element-wise multiply with the covered patch, sum β one output value. The filter's weights are learned to detect specific features (edges, curves, textures).
class CNN(nn.Module): def __init__(self, num_classes=10): super().__init__() self.features = nn.Sequential( # Block 1: (3,224,224) β (32,112,112) nn.Conv2d(3, 32, kernel_size=3, padding=1), # same padding nn.BatchNorm2d(32), nn.ReLU(), nn.MaxPool2d(2), # halve spatial dims # Block 2: β (64, 56, 56) nn.Conv2d(32, 64, kernel_size=3, padding=1), nn.BatchNorm2d(64), nn.ReLU(), nn.MaxPool2d(2), # Block 3: β (128, 4, 4) via adaptive pooling nn.Conv2d(64, 128, kernel_size=3, padding=1), nn.BatchNorm2d(128), nn.ReLU(), nn.AdaptiveAvgPool2d((4, 4)), # always (128,4,4) ) self.classifier = nn.Sequential( nn.Flatten(), # (128,4,4) β (2048,) nn.Linear(128*4*4, 256), nn.ReLU(), nn.Dropout(0.5), nn.Linear(256, num_classes), ) def forward(self, x): return self.classifier(self.features(x)) # Output size formula: out = (in + 2*pad - kernel) / stride + 1 # 224 + 2*1 - 3 / 1 + 1 = 224 β "same" padding keeps spatial size
RNNs, LSTMs & GRUs
Designed for sequential data β text, time series, audio. Process one timestep at a time, maintaining a hidden state that acts as memory. LSTMs and GRUs solve the vanishing gradient problem of vanilla RNNs.
Each timestep receives the input token and the previous hidden state. The LSTM gates decide what to remember, what to forget, and what to output.
class LSTMClassifier(nn.Module): def __init__(self, vocab_size, emb_dim, hidden, n_classes, n_layers=2): super().__init__() self.embed = nn.Embedding(vocab_size, emb_dim, padding_idx=0) self.lstm = nn.LSTM(emb_dim, hidden, n_layers, batch_first=True, # (batch, seq, feat) dropout=0.3, bidirectional=True) self.fc = nn.Linear(hidden * 2, n_classes) # *2 for bidirectional def forward(self, x): # x: (batch, seq_len) emb = self.embed(x) # (batch, seq_len, emb_dim) out, (h_n, c_n) = self.lstm(emb) # h_n: (n_layers*2, batch, hidden) β final hidden states h = torch.cat([h_n[-2], h_n[-1]], dim=1) # concat fwd + bwd return self.fc(h) # (batch, n_classes)
Transformers & Self-Attention
The architecture behind GPT-4, Claude, BERT, and every state-of-the-art model. Instead of processing sequentially like RNNs, Transformers attend to ALL positions simultaneously with learned attention weights β enabling massive parallelism and long-range dependencies.
Q (Query) Γ K (Key) β similarity scores β softmax β attention weights β weighted sum of V (Values) = context-aware output for each token.
import math import torch.nn.functional as F class MultiHeadAttention(nn.Module): def __init__(self, d_model, n_heads): super().__init__() self.d_k = d_model // n_heads self.n_heads = n_heads self.W_q = nn.Linear(d_model, d_model) self.W_k = nn.Linear(d_model, d_model) self.W_v = nn.Linear(d_model, d_model) self.W_o = nn.Linear(d_model, d_model) def forward(self, x, mask=None): B, T, C = x.shape def split_heads(t): return t.view(B, T, self.n_heads, self.d_k).transpose(1, 2) Q, K, V = split_heads(self.W_q(x)), split_heads(self.W_k(x)), split_heads(self.W_v(x)) scores = Q @ K.transpose(-2, -1) / math.sqrt(self.d_k) if mask is not None: scores = scores.masked_fill(mask == 0, -1e9) attn = F.softmax(scores, dim=-1) out = (attn @ V).transpose(1, 2).reshape(B, T, C) return self.W_o(out) class TransformerBlock(nn.Module): def __init__(self, d_model, n_heads, ff_dim, dropout=0.1): super().__init__() self.attn = MultiHeadAttention(d_model, n_heads) self.ff = nn.Sequential(nn.Linear(d_model, ff_dim), nn.GELU(), nn.Linear(ff_dim, d_model)) self.ln1 = nn.LayerNorm(d_model) self.ln2 = nn.LayerNorm(d_model) self.drop = nn.Dropout(dropout) def forward(self, x): x = x + self.drop(self.attn(self.ln1(x))) # residual + attention x = x + self.drop(self.ff(self.ln2(x))) # residual + feedforward return x
Transfer Learning & Fine-Tuning
Training large models from scratch requires weeks and millions of dollars. Transfer learning: take a model pretrained on massive data, adapt it to your task. This is how 90% of real-world DL projects work.
from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch # βββ Load pretrained BERT ββββββββββββββββββββββββββββββββββ tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") model = AutoModelForSequenceClassification.from_pretrained( "bert-base-uncased", num_labels=2 # replace classification head ).to(device) # βββ Tokenize βββββββββββββββββββββββββββββββββββββββββββββ texts = ["Great product!", "Terrible experience"] inputs = tokenizer(texts, padding=True, truncation=True, max_length=128, return_tensors="pt") # βββ Discriminative learning rates ββββββββββββββββββββββββ optimizer = torch.optim.AdamW([ {"params": model.bert.embeddings.parameters(), "lr": 1e-5}, # early {"params": model.bert.encoder.parameters(), "lr": 2e-5}, # middle {"params": model.classifier.parameters(), "lr": 5e-5}, # head ]) # βββ Training step ββββββββββββββββββββββββββββββββββββββββ labels = torch.tensor([1, 0]) outputs = model(**inputs.to(device), labels=labels.to(device)) loss = outputs.loss # HuggingFace computes loss for you preds = outputs.logits.argmax(dim=1)
Practical DL β Debugging & Production
The gap between "model trains" and "model works well" is all here. These are hard-won lessons that only come from debugging real models β read them carefully.
You've mastered DL fundamentals when you can:
- Explain backpropagation using the chain rule, without notes
- Build and train an MLP in PyTorch from scratch β including the 6-step loop
- Explain vanishing gradients and name 3 solutions
- Build a CNN with Conv2d β BatchNorm2d β ReLU β MaxPool blocks
- Explain Q, K, V in self-attention using the database analogy
- Fine-tune a pretrained HuggingFace model on custom text data
- Debug a NaN loss β enumerate 5 possible causes in order
- Know when to use MLP, CNN, LSTM, or Transformer for a given task