Complete Deep Learning Curriculum

Deep Learning
From Neuron to Transformer

Every concept from a single perceptron to modern Transformers β€” with animated visualizations, full PyTorch code, and interview prep. No topic skipped. Builds directly on your Python + NumPy knowledge.

14 Modules
PyTorch Code
Live Animations
CNN Β· LSTM Β· Transformer
Backprop by Hand
Transfer Learning
Module 01

The Neuron & Perceptron

Everything in deep learning descends from one idea: a neuron receives inputs, multiplies each by a learned weight, sums them, passes through an activation function, and fires an output. Stack millions of these and you get GPT-4.

Single Neuron β€” Animated Forward Pass Interactive

z = w₁x₁ + wβ‚‚xβ‚‚ + w₃x₃ + b. Then output = Οƒ(z). Watch the signal propagate and see z and activation computed live.

Weighted sum: z = Ξ£ wα΅’xα΅’ + b = wΒ·x + b Activation: a = f(z) where f is ReLU, Sigmoid, tanh... Output: Ε· = a (prediction)
Inputs (x)
The features of one data sample. 20 features β†’ 20 inputs. Each pixel in an image is an input. Each embedding dimension in NLP is an input. Your NumPy X matrix rows become the input to neurons.
x ∈ ℝⁿ
Weights (w) & Bias (b)
Weights: learnable, one per input. Positive = input pushes output up, negative = pushes down. Bias: one extra learnable number that shifts the threshold. Without bias, decision boundary must pass through origin.
Learned by backpropRandom init
From Neuron β†’ MLP β†’ Deep Network
One neuron = one linear boundary. Stack neurons in a layer = multiple boundaries. Stack layers = hierarchical features. Layer 1: edges. Layer 2: shapes. Layer 3: objects. Depth creates abstraction.
Universal Approximator
Why Not Just One Layer?
Without hidden layers + activations, stacking is pointless β€” n linear functions is still one linear function. The XOR problem cannot be solved with a single-layer network. Non-linearity from activation functions is what makes depth meaningful.
Key Insight
Module 02

Activation Functions

Activations introduce non-linearity β€” without them, depth is meaningless. Each has different properties. Knowing which to use where is one of the most practical DL skills.

Activation Functions β€” Visualized Switch between them

ReLU: max(0,z). Simple, fast. Default for hidden layers. Can cause "dying neurons" if weights push all inputs negative.

FunctionFormulaRangeUse WhenWatch Out
ReLUmax(0, z)[0, ∞)Hidden layers β€” default choiceDead neurons if z always < 0
Sigmoid1/(1+e⁻ᢻ)(0, 1)Binary output layer onlyVanishing gradients in deep nets
Tanh(eαΆ»βˆ’e⁻ᢻ)/(eαΆ»+e⁻ᢻ)(βˆ’1, 1)Hidden layers, RNNsVanishing gradients
Leaky ReLUmax(0.01z, z)(βˆ’βˆž, ∞)When ReLU causes dead neuronsExtra hyperparameter Ξ±
GELUzΒ·Ξ¦(z)(βˆ’βˆž, ∞)Transformers (BERT, GPT)Slightly slower to compute
SoftmaxeαΆ»α΅’/Ξ£eαΆ»β±Ό(0,1) sum=1Multi-class output layer onlyOnly at final layer!
πŸ’‘ Golden Rules β€” Memorize These

Hidden layers β†’ ReLU (default). Binary output β†’ Sigmoid. Multi-class output β†’ Softmax. Regression output β†’ no activation (linear). Transformers β†’ GELU. These 5 rules cover 99% of cases.

Module 03

Forward Pass & Loss Functions

The forward pass computes the prediction. The loss function measures how wrong the prediction is β€” a single number that summarizes the error across the whole batch. Backprop then minimizes this number.

Data Flows Through the Network Animated Forward Pass

Input β†’ Linear β†’ ReLU β†’ Linear β†’ ReLU β†’ Output β†’ Loss. Each layer transforms the data. The loss is computed at the end.

Binary CE: L = βˆ’[yΒ·log(Ε·) + (1βˆ’y)Β·log(1βˆ’Ε·)] Categorical CE: L = βˆ’Ξ£ yα΅’Β·log(Ε·α΅’) (one-hot y, softmax Ε·) MSE (regression):L = (1/n)Β·Ξ£(yα΅’ βˆ’ Ε·α΅’)Β² MAE: L = (1/n)Β·Ξ£|yα΅’ βˆ’ Ε·α΅’| (robust to outliers)
Binary Cross-Entropy
y=1 and Ε·=0.01 β†’ huge loss. y=1 and Ε·=0.99 β†’ tiny loss. Penalizes confident wrong predictions exponentially harder than uncertain ones. Always use with Sigmoid output. In PyTorch: nn.BCEWithLogitsLoss (more numerically stable).
Binary (0 or 1)
CrossEntropyLoss in PyTorch
nn.CrossEntropyLoss combines LogSoftmax + NLLLoss. Pass RAW LOGITS (no softmax applied!). Expects integer class labels (not one-hot). This is the most common mistake: applying softmax before CrossEntropyLoss.
Multi-classNo softmax before!
MSE vs MAE
MSE squares errors β†’ penalizes large errors much more β†’ sensitive to outliers. MAE takes absolute value β†’ treats all errors linearly β†’ robust to outliers. Use MSE by default; switch to MAE if your target has outliers.
Regression
Module 04

Backpropagation

Backprop is THE algorithm of deep learning. It uses the chain rule of calculus to compute how much each weight contributed to the loss β€” efficiently, for millions of weights simultaneously. PyTorch automates it, but you must understand it.

Gradient Flows Backward Through Layers Step by Step
Click to flow gradients backward

Red glow = gradient signal flowing backward. Each layer: receives βˆ‚L/βˆ‚output, computes βˆ‚L/βˆ‚weights via chain rule, passes βˆ‚L/βˆ‚input to previous layer.

πŸ”— Chain Rule β€” The Core

βˆ‚L/βˆ‚w = βˆ‚L/βˆ‚Ε· Γ— βˆ‚Ε·/βˆ‚z Γ— βˆ‚z/βˆ‚w. Each term is a simple derivative. Multiply them together = gradient for any weight at any depth. For a 100-layer network, the chain has 100 links. PyTorch tracks the computation graph automatically.

Vanishing Gradients
Sigmoid/tanh saturate β†’ derivatives near 0 β†’ gradient shrinks exponentially as it flows back through layers β†’ early layers barely update β†’ network doesn't learn deep features. Fix: ReLU, batch normalization, residual connections.
Deep network problemReLU fixes
Exploding Gradients
Gradients multiply at each layer β†’ can grow exponentially β†’ NaN weights. Especially in RNNs. Fix: gradient clipping (cap gradient norm at a max value, e.g. 1.0). One line: nn.utils.clip_grad_norm_(model.parameters(), 1.0).
RNN problemClip gradients
PyTorch Autograd
PyTorch builds a dynamic computational graph on every forward pass. loss.backward() traverses this graph in reverse, computing all gradients via chain rule automatically. tensor.grad holds the result. You never compute βˆ‚L/βˆ‚w by hand.
Automatic differentiation
04_autograd_demo.py
python
import torch

# Simple example: compute gradient of L = (wx - y)Β²
w = torch.tensor([2.0], requires_grad=True)   # track gradients
x = torch.tensor([3.0])
y = torch.tensor([9.0])

# Forward pass β€” builds computation graph
pred = w * x
loss = (pred - y) ** 2

# Backward pass β€” computes all gradients via chain rule
loss.backward()

print(w.grad)    # β†’ tensor([-18.]) = 2*(wx-y)*x = 2*(6-9)*3 = -18
# Manual check: dL/dw = 2(wx-y)Β·x = 2(2Β·3-9)Β·3 = 2(-3)(3) = -18 βœ“

# ─── In a real network ────────────────────────────────────
optimizer.zero_grad()   # clear gradients from last step
output = model(X_batch)  # forward pass β€” builds graph
loss = criterion(output, y_batch)
loss.backward()          # backward β€” computes all .grad attributes
optimizer.step()         # use .grad to update weights
Module 05

Optimizers

The optimizer takes the gradients computed by backprop and decides how to update each weight. SGD is the foundation. Adam is the default that just works. AdamW is what modern LLMs use.

Optimizer Paths on a Loss Surface Compare convergence

Adam finds the minimum fast and reliably. Momentum accelerates SGD. Plain SGD is noisy but can generalize better with careful tuning.

OptimizerKey IdeaDefault ParamsUse For
SGDw -= lr Β· gradlr=0.01Computer vision with fine-tuning; needs careful lr schedule
SGD + MomentumAccumulates velocity in consistent gradient directionsmomentum=0.9Most CV training; faster than plain SGD
AdamAdaptive per-weight learning rate using gradient momentslr=1e-3, β₁=0.9, Ξ²β‚‚=0.999Default for most DL; NLP; quick experiments
AdamWAdam with decoupled weight decay (L2 applied correctly)lr=1e-4, wd=0.01Transformers, LLMs β€” BERT/GPT use this
RMSPropAdaptive lr using running average of squared gradientslr=1e-3, Ξ±=0.99RNNs; non-stationary problems
πŸ’‘ Learning Rate Schedulers

Don't use a fixed learning rate. Start higher, decay over time. CosineAnnealingLR β€” smoothly decays, great default. ReduceLROnPlateau β€” reduces when val loss stops improving. OneCycleLR β€” warm up then cool down, best for fast convergence.

Module 06

PyTorch Tensors & Autograd

PyTorch is the primary DL framework used by most researchers and companies. Tensors are like NumPy arrays β€” identical API β€” but they can live on GPU and track gradients automatically. All your NumPy knowledge transfers directly.

06_pytorch_fundamentals.py
python
import torch
import torch.nn as nn
import numpy as np

# ─── Tensors β€” identical to NumPy ndarray ─────────────────
x = torch.tensor([1.0, 2.0, 3.0])
X = torch.randn(64, 20)           # batch=64, features=20
W = torch.zeros(20, 10)            # weight matrix
I = torch.eye(5)                   # identity matrix

# Same NumPy operations work on PyTorch tensors:
X @ W                               # matrix multiply (forward pass!)
X.mean(dim=0)                     # mean per feature
X.sum(); X.max(); X.abs()
X.reshape(-1); X.T; X[:10]

# ─── GPU β€” one line to move everything ────────────────────
device = "cuda" if torch.cuda.is_available() else "cpu"
X = X.to(device)     # move tensor to GPU (or stay on CPU)
model = model.to(device)

# ─── NumPy interop ────────────────────────────────────────
arr = np.array([1, 2, 3], dtype=np.float32)
t = torch.from_numpy(arr)          # shared memory!
t.detach().cpu().numpy()           # safe Tensor β†’ NumPy

# ─── DataLoader β€” automatic batching ──────────────────────
from torch.utils.data import TensorDataset, DataLoader

dataset = TensorDataset(
    torch.FloatTensor(X_train),
    torch.LongTensor(y_train)        # LongTensor for class labels
)
loader = DataLoader(dataset, batch_size=64, shuffle=True, num_workers=2)

for X_batch, y_batch in loader:    # X_batch: (64, 20), y_batch: (64,)
    pass                             # your training code here
Module 07

Building an MLP in PyTorch

nn.Module is the base class for every PyTorch model β€” from a 2-layer MLP to GPT-4. The pattern: define layers in __init__, define the computation in forward(). That's it.

MLP Architecture β€” Layer by Layer Visual

Input layer β†’ hidden layers (Linear + BN + ReLU + Dropout) β†’ output layer. Each circle is a neuron. Each line is a weight. The network learns by adjusting all weights simultaneously.

07_mlp_pytorch.py
python
import torch.nn as nn

class MLP(nn.Module):
    def __init__(self, in_dim, hidden, out_dim, dropout=0.3):
        super().__init__()                # ALWAYS call this first
        layers = []
        prev = in_dim
        for h in hidden:
            layers += [
                nn.Linear(prev, h),        # y = xW^T + b
                nn.BatchNorm1d(h),          # normalize activations
                nn.ReLU(),                  # non-linearity
                nn.Dropout(dropout),        # regularization
            ]
            prev = h
        layers.append(nn.Linear(prev, out_dim))  # output (no activation!)
        self.net = nn.Sequential(*layers)

    def forward(self, x):
        return self.net(x)               # Sequential handles the rest

# Instantiate
model = MLP(in_dim=20, hidden=[128, 64, 32], out_dim=2, dropout=0.3).to(device)

# Count parameters
total = sum(p.numel() for p in model.parameters())
print(f"Parameters: {total:,}")

# Weight initialization β€” He (Kaiming) for ReLU networks
def init_weights(m):
    if isinstance(m, nn.Linear):
        nn.init.kaiming_normal_(m.weight, nonlinearity='relu')
        nn.init.zeros_(m.bias)
model.apply(init_weights)
πŸ’‘ Weight Initialization Matters

Random normal initialization causes vanishing/exploding activations in deep networks. Xavier/Glorot: for sigmoid/tanh. He/Kaiming: for ReLU (default in PyTorch's nn.Linear). Correct init means the forward pass starts with reasonable activations β€” training converges much faster.

Module 08

The Complete Training Loop

The training loop is the same for every DL model β€” from MLP to GPT. Understand each line. The most common bugs all live here.

08_training_loop.py β€” Production quality
python
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=50)

best_val, history = float('inf'), {'train':[], 'val':[], 'acc':[]}

for epoch in range(100):

    # ═══ TRAIN ══════════════════════════════════════════════
    model.train()                      # enables dropout + BN train mode
    train_loss = 0.0
    for Xb, yb in train_loader:
        Xb, yb = Xb.to(device), yb.to(device)

        optimizer.zero_grad()          # 1. CLEAR old gradients ← never forget
        logits = model(Xb)              # 2. Forward pass
        loss = criterion(logits, yb)    # 3. Compute loss
        loss.backward()               # 4. Backprop β€” computes all grads
        nn.utils.clip_grad_norm_(      # 5. Clip gradients (optional)
            model.parameters(), 1.0)
        optimizer.step()              # 6. Update weights with grads
        train_loss += loss.item()

    # ═══ VALIDATE ═══════════════════════════════════════════
    model.eval()                       # disables dropout; BN uses running stats
    val_loss, correct, total = 0.0, 0, 0
    with torch.no_grad():              # no graph built β€” 4x less memory
        for Xv, yv in val_loader:
            Xv, yv = Xv.to(device), yv.to(device)
            logits = model(Xv)
            val_loss += criterion(logits, yv).item()
            correct += (logits.argmax(1) == yv).sum().item()
            total += yv.size(0)

    scheduler.step()
    if val_loss < best_val:
        best_val = val_loss
        torch.save(model.state_dict(), 'best.pt')   # save best checkpoint

    if epoch % 10 == 0:
        print(f"[{epoch:3d}] train={train_loss/len(train_loader):.4f} val={val_loss/len(val_loader):.4f} acc={correct/total:.3f}")
βœ… The 6 Steps β€” Tattoo These

1. zero_grad() β€” clear old gradients (miss this and gradients accumulate β€” a very common bug). 2. Forward pass. 3. Compute loss. 4. loss.backward(). 5. (Optional) clip gradients. 6. optimizer.step(). Every DL training loop you ever write follows this exact sequence.

Module 09

Regularization β€” Preventing Overfitting

Deep networks can memorize millions of training examples perfectly. These techniques force the network to learn generalizable patterns instead of memorizing noise.

Dropout
Randomly zeroes p% of neurons each forward pass during training. Forces redundancy β€” network can't rely on any single neuron. Acts like training an ensemble of 2ⁿ subnetworks. p=0.2–0.5. Always disabled with model.eval().
Most commonnn.Dropout(0.3)
Batch Normalization
Normalizes each layer's inputs to mean 0, std 1 within the mini-batch. Allows much higher learning rates, dramatically stabilizes training. Adds 2 learnable params (Ξ³, Ξ²) per feature. Goes between Linear and activation.
Use alwaysnn.BatchNorm1d
Weight Decay (L2)
Adds λΣwΒ² to the loss β†’ pushes all weights toward zero. In PyTorch: weight_decay argument in AdamW. Always use AdamW (not Adam) for this β€” Adam's L2 implementation is subtly incorrect.
Simpleweight_decay=1e-4
Early Stopping
Monitor val loss. Stop training when it hasn't improved for N epochs (patience). Save the best checkpoint. Free regularization β€” always do it. One of the most effective techniques for preventing overfitting.
FreeAlways use
Data Augmentation
Create modified copies of training data: flip, rotate, crop, color jitter (images); synonym replacement, back-translation (text); adding noise, time warping (audio/time series). Most effective regularization when data is limited.
Best for small data
Layer Normalization
Like BatchNorm but normalizes across features within each sample (not across batch). Works for batch_size=1. Required for Transformers β€” BatchNorm doesn't work with variable-length sequences. nn.LayerNorm in PyTorch.
Transformers onlynn.LayerNorm
Module 10

Convolutional Neural Networks (CNNs)

CNNs exploit the structure of spatial data: nearby pixels are related, and the same pattern (edge, texture) can appear anywhere in an image. Convolution shares weights across all positions β€” far more efficient than a fully connected layer.

Convolution β€” Filter Sliding Over Input Animated

A 3Γ—3 filter slides across the image. At each position: element-wise multiply with the covered patch, sum β†’ one output value. The filter's weights are learned to detect specific features (edges, curves, textures).

10_cnn_pytorch.py
python
class CNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.features = nn.Sequential(
            # Block 1: (3,224,224) β†’ (32,112,112)
            nn.Conv2d(3, 32, kernel_size=3, padding=1),  # same padding
            nn.BatchNorm2d(32), nn.ReLU(),
            nn.MaxPool2d(2),                            # halve spatial dims
            # Block 2: β†’ (64, 56, 56)
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64), nn.ReLU(),
            nn.MaxPool2d(2),
            # Block 3: β†’ (128, 4, 4) via adaptive pooling
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128), nn.ReLU(),
            nn.AdaptiveAvgPool2d((4, 4)),              # always (128,4,4)
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),                              # (128,4,4) β†’ (2048,)
            nn.Linear(128*4*4, 256), nn.ReLU(), nn.Dropout(0.5),
            nn.Linear(256, num_classes),
        )
    def forward(self, x): return self.classifier(self.features(x))

# Output size formula: out = (in + 2*pad - kernel) / stride + 1
# 224 + 2*1 - 3 / 1 + 1 = 224  ← "same" padding keeps spatial size
MaxPool vs AvgPool
MaxPool: take maximum in each window. Detects presence of a feature regardless of exact position (spatial invariance). AvgPool: average β€” smoother. AdaptiveAvgPool(size): always produces a fixed output size regardless of input β€” very useful.
Downsampling
Residual Connections (ResNet)
y = F(x) + x. Add input directly to the output ("skip connection"). Prevents vanishing gradients in very deep networks. Enables training of 50–200+ layer networks. Every modern architecture uses this. LeNetβ†’AlexNetβ†’VGGβ†’ResNet was the progression.
Modern standard
Using Pretrained CNNs
torchvision.models.resnet50(weights='IMAGENET1K_V1'). Replace model.fc with nn.Linear(2048, your_classes). Freeze backbone for small datasets, fine-tune for larger. State-of-the-art in 5 lines of code.
Fastest path to results
Module 11

RNNs, LSTMs & GRUs

Designed for sequential data β€” text, time series, audio. Process one timestep at a time, maintaining a hidden state that acts as memory. LSTMs and GRUs solve the vanishing gradient problem of vanilla RNNs.

LSTM Unrolled β€” How Hidden State Propagates Visual

Each timestep receives the input token and the previous hidden state. The LSTM gates decide what to remember, what to forget, and what to output.

Vanilla RNN
h_t = tanh(W_hΒ·h_{t-1} + W_xΒ·x_t + b). Hidden state carries context. Problem: vanishing gradients β†’ forgets dependencies beyond ~10 timesteps. The weight matrix is multiplied at every step β†’ gradients either vanish or explode over long sequences.
Short memory
LSTM
3 gates: Forget (what to erase from memory), Input (what new info to write), Output (what to read). Separate cell state C_t can carry information across hundreds of timesteps unchanged. The standard choice for sequence modeling.
Long sequencesnn.LSTM
GRU
Simplified LSTM: 2 gates (Reset, Update). Fewer parameters β†’ faster training. Similar performance to LSTM in most tasks. Good choice when LSTM overfits or is too slow. nn.GRU in PyTorch.
Efficientnn.GRU
Bidirectional RNN
Process sequence both forward and backward, concatenate hidden states. Each token sees both past AND future context. Essential for understanding tasks (classification, NER). Set bidirectional=True in nn.LSTM. Double the hidden_dim for output.
bidirectional=True
11_lstm_classifier.py
python
class LSTMClassifier(nn.Module):
    def __init__(self, vocab_size, emb_dim, hidden, n_classes, n_layers=2):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, emb_dim, padding_idx=0)
        self.lstm = nn.LSTM(emb_dim, hidden, n_layers,
                            batch_first=True,          # (batch, seq, feat)
                            dropout=0.3,
                            bidirectional=True)
        self.fc = nn.Linear(hidden * 2, n_classes)   # *2 for bidirectional

    def forward(self, x):               # x: (batch, seq_len)
        emb = self.embed(x)             # (batch, seq_len, emb_dim)
        out, (h_n, c_n) = self.lstm(emb)
        # h_n: (n_layers*2, batch, hidden) β€” final hidden states
        h = torch.cat([h_n[-2], h_n[-1]], dim=1)   # concat fwd + bwd
        return self.fc(h)               # (batch, n_classes)
Module 12

Transformers & Self-Attention

The architecture behind GPT-4, Claude, BERT, and every state-of-the-art model. Instead of processing sequentially like RNNs, Transformers attend to ALL positions simultaneously with learned attention weights β€” enabling massive parallelism and long-range dependencies.

Self-Attention β€” Tokens Attending to Each Other Animated
Each token creates Q, K, V vectors. Bright lines = high attention weight.

Q (Query) Γ— K (Key) β†’ similarity scores β†’ softmax β†’ attention weights β†’ weighted sum of V (Values) = context-aware output for each token.

Attention(Q,K,V) = softmax( QKα΅€ / √d_k ) Β· V Where: Q = XW_Q (queries) K = XW_K (keys) V = XW_V (values) d_k = dimension of keys (scaling prevents large dot products)
Q, K, V β€” The Database Analogy
Q = "what am I looking for?" K = "what do I contain?" V = "what do I return?" Each token queries all other tokens. High QΒ·K similarity β†’ high attention weight β†’ token's V contributes more to the output. This is how "cat" attends to "sat" in "the cat sat."
Core mechanism
Multi-Head Attention
Run h parallel attention heads with different W_Q, W_K, W_V projections. Each head can attend to different relationship types: syntax, semantics, coreference, etc. Concatenate all heads, project with W_O. num_heads is a key hyperparameter.
Parallel attention
Positional Encoding
Attention is order-agnostic β€” "cat sat" = "sat cat" without positional info. Add positional vectors to token embeddings. Sinusoidal (original paper). Learned (GPT). RoPE β€” Rotary (modern LLMs). Without this, the model can't learn sequence order.
Required
Encoder vs Decoder
Encoder (BERT): bidirectional attention, sees full sequence. For understanding: classification, NER, Q&A. Decoder (GPT): causal/masked attention, only sees past tokens. For generation. Encoder-Decoder (T5, BART): for translation/summarization.
Architecture choice
12_transformer_block.py β€” From scratch implementation
python
import math
import torch.nn.functional as F

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.d_k = d_model // n_heads
        self.n_heads = n_heads
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def forward(self, x, mask=None):
        B, T, C = x.shape
        def split_heads(t):
            return t.view(B, T, self.n_heads, self.d_k).transpose(1, 2)
        Q, K, V = split_heads(self.W_q(x)), split_heads(self.W_k(x)), split_heads(self.W_v(x))
        scores = Q @ K.transpose(-2, -1) / math.sqrt(self.d_k)
        if mask is not None: scores = scores.masked_fill(mask == 0, -1e9)
        attn = F.softmax(scores, dim=-1)
        out = (attn @ V).transpose(1, 2).reshape(B, T, C)
        return self.W_o(out)

class TransformerBlock(nn.Module):
    def __init__(self, d_model, n_heads, ff_dim, dropout=0.1):
        super().__init__()
        self.attn = MultiHeadAttention(d_model, n_heads)
        self.ff   = nn.Sequential(nn.Linear(d_model, ff_dim), nn.GELU(), nn.Linear(ff_dim, d_model))
        self.ln1  = nn.LayerNorm(d_model)
        self.ln2  = nn.LayerNorm(d_model)
        self.drop = nn.Dropout(dropout)

    def forward(self, x):
        x = x + self.drop(self.attn(self.ln1(x)))   # residual + attention
        x = x + self.drop(self.ff(self.ln2(x)))     # residual + feedforward
        return x
Module 13

Transfer Learning & Fine-Tuning

Training large models from scratch requires weeks and millions of dollars. Transfer learning: take a model pretrained on massive data, adapt it to your task. This is how 90% of real-world DL projects work.

Feature Extraction
Freeze all pretrained weights. Only train a new output head. Fast, cheap, works when your data is similar to pretrained data. Use when you have <1000 samples. PyTorch: for param in model.parameters(): param.requires_grad = False. Then unfreeze the head only.
Frozen backboneFast
Fine-Tuning
Unfreeze some or all pretrained layers. Train end-to-end with small lr (1e-4 to 1e-5). Unfreeze gradually from the top. Discriminative lr: lower rate for early layers (they're already good), higher for later layers. Best performance on sufficient data.
Best performanceNeeds more data
HuggingFace Transformers
The industry library for pretrained NLP. AutoModel, BertForSequenceClassification, GPT2LMHeadModel. Thousands of pretrained models. Fine-tune BERT on your text in ~20 lines. You used sentence-transformers in your RAG system β€” that's HuggingFace.
Industry standard
torchvision Pretrained CNNs
torchvision.models.resnet50(weights='IMAGENET1K_V1'). Replace model.fc. Freeze and train head. 10 lines to state-of-the-art on custom image data. Also: EfficientNet, ViT, ConvNeXt β€” all available in torchvision.
Vision
13_finetune_bert.py
python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# ─── Load pretrained BERT ──────────────────────────────────
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased", num_labels=2    # replace classification head
).to(device)

# ─── Tokenize ─────────────────────────────────────────────
texts = ["Great product!", "Terrible experience"]
inputs = tokenizer(texts, padding=True, truncation=True,
                   max_length=128, return_tensors="pt")

# ─── Discriminative learning rates ────────────────────────
optimizer = torch.optim.AdamW([
    {"params": model.bert.embeddings.parameters(), "lr": 1e-5},   # early
    {"params": model.bert.encoder.parameters(),    "lr": 2e-5},   # middle
    {"params": model.classifier.parameters(),      "lr": 5e-5},   # head
])

# ─── Training step ────────────────────────────────────────
labels = torch.tensor([1, 0])
outputs = model(**inputs.to(device), labels=labels.to(device))
loss = outputs.loss       # HuggingFace computes loss for you
preds = outputs.logits.argmax(dim=1)
Module 14

Practical DL β€” Debugging & Production

The gap between "model trains" and "model works well" is all here. These are hard-won lessons that only come from debugging real models β€” read them carefully.

Loss is NaN β€” Debug Checklist
Check in order: (1) Learning rate too high. (2) Forgot zero_grad(). (3) log(0) in loss β€” add 1e-8 epsilon. (4) Exploding gradients β€” add clip_grad_norm_. (5) Wrong loss function β€” CrossEntropyLoss needs raw logits, NOT softmax output. (6) NaN in input data.
Most common crash
Model Not Learning
(1) Learning rate too low β€” try 10x higher. (2) Data not shuffled. (3) Labels wrong β€” check class indices (0-indexed). (4) Model too small β€” add layers. (5) Not training long enough. (6) Bug in data pipeline β€” visualize batches.
Common issue
Sanity Check β€” Overfit One Batch
Before full training: take one batch, train on it for 100 steps. Loss should reach near-zero. If it doesn't, your model has a bug (wrong loss, wrong labels, architecture issue). This 2-minute test saves hours of wasted training.
Do this always
Mixed Precision Training
Use float16 for forward pass, float32 for gradients. 2x memory reduction, 2-3x speedup on modern GPUs. Add 5 lines: from torch.cuda.amp import autocast, GradScaler. Wrap forward+loss in with autocast():. Completely free speedup.
Free 2x speedup
Experiment Tracking β€” W&B
pip install wandb. wandb.init(project="my-model"). wandb.log({"loss": loss, "acc": acc}). Automatically tracks every metric, saves model checkpoints, lets you compare runs visually. Industry standard alongside MLflow. Never lose results again.
Industry standard
model.train() vs model.eval()
Forgetting this is a very common bug. model.train(): enables dropout, BatchNorm uses batch stats. model.eval(): disables dropout, BatchNorm uses running stats. ALWAYS call model.eval() before validation/inference. ALWAYS call model.train() before the next training epoch.
Common bugCritical
Interview Q
"Explain backpropagation. Why is it necessary?"
To train a network, we need to know how much each weight contributed to the error β€” so we can adjust it in the right direction. Backprop computes this efficiently using the chain rule of calculus. In the forward pass, we compute the prediction and loss. In the backward pass, we compute βˆ‚Loss/βˆ‚weight for every weight simultaneously, by flowing the gradient backward from output to input layer. At each layer: local gradient = gradient from next layer Γ— derivative of this layer's operation. PyTorch automates this with .backward(). Without backprop, training deep networks would require computing each weight's gradient separately β€” O(n) forward passes, completely infeasible for millions of weights.
Interview Q
"What is the vanishing gradient problem and how do you solve it?"
In deep networks using sigmoid/tanh activations, gradients shrink at each layer during backprop because these functions have derivatives much less than 1 in most of their range. After 10+ layers, the gradient signal reaching early layers is near zero β€” those layers barely update and don't learn. Solutions: (1) ReLU activation β€” derivative is either 0 or 1, no shrinking. (2) Batch Normalization β€” keeps activations in a healthy range. (3) Residual connections β€” gradient can flow directly through skip connections, bypassing potentially vanishing paths. Modern architectures (ResNet, Transformer) use all three.
🧠 Deep Learning Mastery Checkpoint

You've mastered DL fundamentals when you can:

  • Explain backpropagation using the chain rule, without notes
  • Build and train an MLP in PyTorch from scratch β€” including the 6-step loop
  • Explain vanishing gradients and name 3 solutions
  • Build a CNN with Conv2d β†’ BatchNorm2d β†’ ReLU β†’ MaxPool blocks
  • Explain Q, K, V in self-attention using the database analogy
  • Fine-tune a pretrained HuggingFace model on custom text data
  • Debug a NaN loss β€” enumerate 5 possible causes in order
  • Know when to use MLP, CNN, LSTM, or Transformer for a given task