Complete Deep Learning Curriculum

Deep Learning
From Neuron to Transformer

Every concept from a single perceptron to modern Transformers — with animated visualizations, full PyTorch code, and interview prep. No topic skipped. Builds directly on your Python + NumPy knowledge.

14 Modules

PyTorch Code

Live Animations

CNN · LSTM · Transformer

Backprop by Hand

Transfer Learning

Module 01

The Neuron & Perceptron

Everything in deep learning descends from one idea: a neuron receives inputs, multiplies each by a learned weight, sums them, passes through an activation function, and fires an output. Stack millions of these and you get GPT-4.

Single Neuron — Animated Forward Pass Interactive

z = w₁x₁ + w₂x₂ + w₃x₃ + b. Then output = σ(z). Watch the signal propagate and see z and activation computed live.

Weighted sum: z = Σ wᵢxᵢ + b = w·x + b Activation: a = f(z) where f is ReLU, Sigmoid, tanh... Output: ŷ = a (prediction)

Inputs (x)

The features of one data sample. 20 features → 20 inputs. Each pixel in an image is an input. Each embedding dimension in NLP is an input. Your NumPy X matrix rows become the input to neurons.

x ∈ ℝⁿ

Weights (w) & Bias (b)

Weights: learnable, one per input. Positive = input pushes output up, negative = pushes down. Bias: one extra learnable number that shifts the threshold. Without bias, decision boundary must pass through origin.

Learned by backpropRandom init

From Neuron → MLP → Deep Network

One neuron = one linear boundary. Stack neurons in a layer = multiple boundaries. Stack layers = hierarchical features. Layer 1: edges. Layer 2: shapes. Layer 3: objects. Depth creates abstraction.

Universal Approximator

Why Not Just One Layer?

Without hidden layers + activations, stacking is pointless — n linear functions is still one linear function. The XOR problem cannot be solved with a single-layer network. Non-linearity from activation functions is what makes depth meaningful.

Key Insight

Module 02

Activation Functions

Activations introduce non-linearity — without them, depth is meaningless. Each has different properties. Knowing which to use where is one of the most practical DL skills.

Activation Functions — Visualized Switch between them

ReLU: max(0,z). Simple, fast. Default for hidden layers. Can cause "dying neurons" if weights push all inputs negative.

Function	Formula	Range	Use When	Watch Out
ReLU	max(0, z)	[0, ∞)	Hidden layers — default choice	Dead neurons if z always < 0
Sigmoid	1/(1+e⁻ᶻ)	(0, 1)	Binary output layer only	Vanishing gradients in deep nets
Tanh	(eᶻ−e⁻ᶻ)/(eᶻ+e⁻ᶻ)	(−1, 1)	Hidden layers, RNNs	Vanishing gradients
Leaky ReLU	max(0.01z, z)	(−∞, ∞)	When ReLU causes dead neurons	Extra hyperparameter α
GELU	z·Φ(z)	(−∞, ∞)	Transformers (BERT, GPT)	Slightly slower to compute
Softmax	eᶻᵢ/Σeᶻⱼ	(0,1) sum=1	Multi-class output layer only	Only at final layer!

💡 Golden Rules — Memorize These

Hidden layers → ReLU (default). Binary output → Sigmoid. Multi-class output → Softmax. Regression output → no activation (linear). Transformers → GELU. These 5 rules cover 99% of cases.

Module 03

Forward Pass & Loss Functions

The forward pass computes the prediction. The loss function measures how wrong the prediction is — a single number that summarizes the error across the whole batch. Backprop then minimizes this number.

Data Flows Through the Network Animated Forward Pass

Input → Linear → ReLU → Linear → ReLU → Output → Loss. Each layer transforms the data. The loss is computed at the end.

Binary CE: L = −[y·log(ŷ) + (1−y)·log(1−ŷ)] Categorical CE: L = −Σ yᵢ·log(ŷᵢ) (one-hot y, softmax ŷ) MSE (regression):L = (1/n)·Σ(yᵢ − ŷᵢ)² MAE: L = (1/n)·Σ|yᵢ − ŷᵢ| (robust to outliers)

Binary Cross-Entropy

y=1 and ŷ=0.01 → huge loss. y=1 and ŷ=0.99 → tiny loss. Penalizes confident wrong predictions exponentially harder than uncertain ones. Always use with Sigmoid output. In PyTorch: nn.BCEWithLogitsLoss (more numerically stable).

Binary (0 or 1)

CrossEntropyLoss in PyTorch

nn.CrossEntropyLoss combines LogSoftmax + NLLLoss. Pass RAW LOGITS (no softmax applied!). Expects integer class labels (not one-hot). This is the most common mistake: applying softmax before CrossEntropyLoss.

Multi-classNo softmax before!

MSE vs MAE

MSE squares errors → penalizes large errors much more → sensitive to outliers. MAE takes absolute value → treats all errors linearly → robust to outliers. Use MSE by default; switch to MAE if your target has outliers.

Regression

Module 04

Backpropagation

Backprop is THE algorithm of deep learning. It uses the chain rule of calculus to compute how much each weight contributed to the loss — efficiently, for millions of weights simultaneously. PyTorch automates it, but you must understand it.

Gradient Flows Backward Through Layers Step by Step

Click to flow gradients backward

Red glow = gradient signal flowing backward. Each layer: receives ∂L/∂output, computes ∂L/∂weights via chain rule, passes ∂L/∂input to previous layer.

🔗 Chain Rule — The Core

∂L/∂w = ∂L/∂ŷ × ∂ŷ/∂z × ∂z/∂w. Each term is a simple derivative. Multiply them together = gradient for any weight at any depth. For a 100-layer network, the chain has 100 links. PyTorch tracks the computation graph automatically.

Vanishing Gradients

Sigmoid/tanh saturate → derivatives near 0 → gradient shrinks exponentially as it flows back through layers → early layers barely update → network doesn't learn deep features. Fix: ReLU, batch normalization, residual connections.

Deep network problemReLU fixes

Exploding Gradients

Gradients multiply at each layer → can grow exponentially → NaN weights. Especially in RNNs. Fix: gradient clipping (cap gradient norm at a max value, e.g. 1.0). One line: nn.utils.clip_grad_norm_(model.parameters(), 1.0).

RNN problemClip gradients

PyTorch Autograd

PyTorch builds a dynamic computational graph on every forward pass. loss.backward() traverses this graph in reverse, computing all gradients via chain rule automatically. tensor.grad holds the result. You never compute ∂L/∂w by hand.

Automatic differentiation

04_autograd_demo.py

python

import torch

# Simple example: compute gradient of L = (wx - y)²
w = torch.tensor([2.0], requires_grad=True)   # track gradients
x = torch.tensor([3.0])
y = torch.tensor([9.0])

# Forward pass — builds computation graph
pred = w * x
loss = (pred - y) ** 2

# Backward pass — computes all gradients via chain rule
loss.backward()

print(w.grad)    # → tensor([-18.]) = 2*(wx-y)*x = 2*(6-9)*3 = -18
# Manual check: dL/dw = 2(wx-y)·x = 2(2·3-9)·3 = 2(-3)(3) = -18 ✓

# ─── In a real network ────────────────────────────────────
optimizer.zero_grad()   # clear gradients from last step
output = model(X_batch)  # forward pass — builds graph
loss = criterion(output, y_batch)
loss.backward()          # backward — computes all .grad attributes
optimizer.step()         # use .grad to update weights

Module 05

Optimizers

The optimizer takes the gradients computed by backprop and decides how to update each weight. SGD is the foundation. Adam is the default that just works. AdamW is what modern LLMs use.

Optimizer Paths on a Loss Surface Compare convergence

Adam finds the minimum fast and reliably. Momentum accelerates SGD. Plain SGD is noisy but can generalize better with careful tuning.

Optimizer	Key Idea	Default Params	Use For
SGD	w -= lr · grad	lr=0.01	Computer vision with fine-tuning; needs careful lr schedule
SGD + Momentum	Accumulates velocity in consistent gradient directions	momentum=0.9	Most CV training; faster than plain SGD
Adam	Adaptive per-weight learning rate using gradient moments	lr=1e-3, β₁=0.9, β₂=0.999	Default for most DL; NLP; quick experiments
AdamW	Adam with decoupled weight decay (L2 applied correctly)	lr=1e-4, wd=0.01	Transformers, LLMs — BERT/GPT use this
RMSProp	Adaptive lr using running average of squared gradients	lr=1e-3, α=0.99	RNNs; non-stationary problems

💡 Learning Rate Schedulers

Don't use a fixed learning rate. Start higher, decay over time. CosineAnnealingLR — smoothly decays, great default. ReduceLROnPlateau — reduces when val loss stops improving. OneCycleLR — warm up then cool down, best for fast convergence.

Module 06

PyTorch Tensors & Autograd

PyTorch is the primary DL framework used by most researchers and companies. Tensors are like NumPy arrays — identical API — but they can live on GPU and track gradients automatically. All your NumPy knowledge transfers directly.

06_pytorch_fundamentals.py

python

import torch
import torch.nn as nn
import numpy as np

# ─── Tensors — identical to NumPy ndarray ─────────────────
x = torch.tensor([1.0, 2.0, 3.0])
X = torch.randn(64, 20)           # batch=64, features=20
W = torch.zeros(20, 10)            # weight matrix
I = torch.eye(5)                   # identity matrix

# Same NumPy operations work on PyTorch tensors:
X @ W                               # matrix multiply (forward pass!)
X.mean(dim=0)                     # mean per feature
X.sum(); X.max(); X.abs()
X.reshape(-1); X.T; X[:10]

# ─── GPU — one line to move everything ────────────────────
device = "cuda" if torch.cuda.is_available() else "cpu"
X = X.to(device)     # move tensor to GPU (or stay on CPU)
model = model.to(device)

# ─── NumPy interop ────────────────────────────────────────
arr = np.array([1, 2, 3], dtype=np.float32)
t = torch.from_numpy(arr)          # shared memory!
t.detach().cpu().numpy()           # safe Tensor → NumPy

# ─── DataLoader — automatic batching ──────────────────────
from torch.utils.data import TensorDataset, DataLoader

dataset = TensorDataset(
    torch.FloatTensor(X_train),
    torch.LongTensor(y_train)        # LongTensor for class labels
)
loader = DataLoader(dataset, batch_size=64, shuffle=True, num_workers=2)

for X_batch, y_batch in loader:    # X_batch: (64, 20), y_batch: (64,)
    pass                             # your training code here

Module 07

Building an MLP in PyTorch

nn.Module is the base class for every PyTorch model — from a 2-layer MLP to GPT-4. The pattern: define layers in __init__, define the computation in forward(). That's it.

MLP Architecture — Layer by Layer Visual

Input layer → hidden layers (Linear + BN + ReLU + Dropout) → output layer. Each circle is a neuron. Each line is a weight. The network learns by adjusting all weights simultaneously.

07_mlp_pytorch.py

python

import torch.nn as nn

class MLP(nn.Module):
    def __init__(self, in_dim, hidden, out_dim, dropout=0.3):
        super().__init__()                # ALWAYS call this first
        layers = []
        prev = in_dim
        for h in hidden:
            layers += [
                nn.Linear(prev, h),        # y = xW^T + b
                nn.BatchNorm1d(h),          # normalize activations
                nn.ReLU(),                  # non-linearity
                nn.Dropout(dropout),        # regularization
            ]
            prev = h
        layers.append(nn.Linear(prev, out_dim))  # output (no activation!)
        self.net = nn.Sequential(*layers)

    def forward(self, x):
        return self.net(x)               # Sequential handles the rest

# Instantiate
model = MLP(in_dim=20, hidden=[128, 64, 32], out_dim=2, dropout=0.3).to(device)

# Count parameters
total = sum(p.numel() for p in model.parameters())
print(f"Parameters: {total:,}")

# Weight initialization — He (Kaiming) for ReLU networks
def init_weights(m):
    if isinstance(m, nn.Linear):
        nn.init.kaiming_normal_(m.weight, nonlinearity='relu')
        nn.init.zeros_(m.bias)
model.apply(init_weights)

💡 Weight Initialization Matters

Random normal initialization causes vanishing/exploding activations in deep networks. Xavier/Glorot: for sigmoid/tanh. He/Kaiming: for ReLU (default in PyTorch's nn.Linear). Correct init means the forward pass starts with reasonable activations — training converges much faster.

Module 08

The Complete Training Loop

The training loop is the same for every DL model — from MLP to GPT. Understand each line. The most common bugs all live here.

08_training_loop.py — Production quality

python

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=50)

best_val, history = float('inf'), {'train':[], 'val':[], 'acc':[]}

for epoch in range(100):

    # ═══ TRAIN ══════════════════════════════════════════════
    model.train()                      # enables dropout + BN train mode
    train_loss = 0.0
    for Xb, yb in train_loader:
        Xb, yb = Xb.to(device), yb.to(device)

        optimizer.zero_grad()          # 1. CLEAR old gradients ← never forget
        logits = model(Xb)              # 2. Forward pass
        loss = criterion(logits, yb)    # 3. Compute loss
        loss.backward()               # 4. Backprop — computes all grads
        nn.utils.clip_grad_norm_(      # 5. Clip gradients (optional)
            model.parameters(), 1.0)
        optimizer.step()              # 6. Update weights with grads
        train_loss += loss.item()

    # ═══ VALIDATE ═══════════════════════════════════════════
    model.eval()                       # disables dropout; BN uses running stats
    val_loss, correct, total = 0.0, 0, 0
    with torch.no_grad():              # no graph built — 4x less memory
        for Xv, yv in val_loader:
            Xv, yv = Xv.to(device), yv.to(device)
            logits = model(Xv)
            val_loss += criterion(logits, yv).item()
            correct += (logits.argmax(1) == yv).sum().item()
            total += yv.size(0)

    scheduler.step()
    if val_loss < best_val:
        best_val = val_loss
        torch.save(model.state_dict(), 'best.pt')   # save best checkpoint

    if epoch % 10 == 0:
        print(f"[{epoch:3d}] train={train_loss/len(train_loader):.4f} val={val_loss/len(val_loader):.4f} acc={correct/total:.3f}")

✅ The 6 Steps — Tattoo These

1. zero_grad() — clear old gradients (miss this and gradients accumulate — a very common bug). 2. Forward pass. 3. Compute loss. 4. loss.backward(). 5. (Optional) clip gradients. 6. optimizer.step(). Every DL training loop you ever write follows this exact sequence.

Module 09

Regularization — Preventing Overfitting

Deep networks can memorize millions of training examples perfectly. These techniques force the network to learn generalizable patterns instead of memorizing noise.

Dropout

Randomly zeroes p% of neurons each forward pass during training. Forces redundancy — network can't rely on any single neuron. Acts like training an ensemble of 2ⁿ subnetworks. p=0.2–0.5. Always disabled with model.eval().

Most commonnn.Dropout(0.3)

Batch Normalization

Normalizes each layer's inputs to mean 0, std 1 within the mini-batch. Allows much higher learning rates, dramatically stabilizes training. Adds 2 learnable params (γ, β) per feature. Goes between Linear and activation.

Use alwaysnn.BatchNorm1d

Weight Decay (L2)

Adds λΣw² to the loss → pushes all weights toward zero. In PyTorch: weight_decay argument in AdamW. Always use AdamW (not Adam) for this — Adam's L2 implementation is subtly incorrect.

Simpleweight_decay=1e-4

Early Stopping

Monitor val loss. Stop training when it hasn't improved for N epochs (patience). Save the best checkpoint. Free regularization — always do it. One of the most effective techniques for preventing overfitting.

FreeAlways use

Data Augmentation

Create modified copies of training data: flip, rotate, crop, color jitter (images); synonym replacement, back-translation (text); adding noise, time warping (audio/time series). Most effective regularization when data is limited.

Best for small data

Layer Normalization

Like BatchNorm but normalizes across features within each sample (not across batch). Works for batch_size=1. Required for Transformers — BatchNorm doesn't work with variable-length sequences. nn.LayerNorm in PyTorch.

Transformers onlynn.LayerNorm

Module 10

Convolutional Neural Networks (CNNs)

CNNs exploit the structure of spatial data: nearby pixels are related, and the same pattern (edge, texture) can appear anywhere in an image. Convolution shares weights across all positions — far more efficient than a fully connected layer.

Convolution — Filter Sliding Over Input Animated

A 3×3 filter slides across the image. At each position: element-wise multiply with the covered patch, sum → one output value. The filter's weights are learned to detect specific features (edges, curves, textures).

10_cnn_pytorch.py

python

class CNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.features = nn.Sequential(
            # Block 1: (3,224,224) → (32,112,112)
            nn.Conv2d(3, 32, kernel_size=3, padding=1),  # same padding
            nn.BatchNorm2d(32), nn.ReLU(),
            nn.MaxPool2d(2),                            # halve spatial dims
            # Block 2: → (64, 56, 56)
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64), nn.ReLU(),
            nn.MaxPool2d(2),
            # Block 3: → (128, 4, 4) via adaptive pooling
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128), nn.ReLU(),
            nn.AdaptiveAvgPool2d((4, 4)),              # always (128,4,4)
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),                              # (128,4,4) → (2048,)
            nn.Linear(128*4*4, 256), nn.ReLU(), nn.Dropout(0.5),
            nn.Linear(256, num_classes),
        )
    def forward(self, x): return self.classifier(self.features(x))

# Output size formula: out = (in + 2*pad - kernel) / stride + 1
# 224 + 2*1 - 3 / 1 + 1 = 224  ← "same" padding keeps spatial size

MaxPool vs AvgPool

MaxPool: take maximum in each window. Detects presence of a feature regardless of exact position (spatial invariance). AvgPool: average — smoother. AdaptiveAvgPool(size): always produces a fixed output size regardless of input — very useful.

Downsampling

Residual Connections (ResNet)

y = F(x) + x. Add input directly to the output ("skip connection"). Prevents vanishing gradients in very deep networks. Enables training of 50–200+ layer networks. Every modern architecture uses this. LeNet→AlexNet→VGG→ResNet was the progression.

Modern standard

Using Pretrained CNNs

torchvision.models.resnet50(weights='IMAGENET1K_V1'). Replace model.fc with nn.Linear(2048, your_classes). Freeze backbone for small datasets, fine-tune for larger. State-of-the-art in 5 lines of code.

Fastest path to results

Module 11

RNNs, LSTMs & GRUs

Designed for sequential data — text, time series, audio. Process one timestep at a time, maintaining a hidden state that acts as memory. LSTMs and GRUs solve the vanishing gradient problem of vanilla RNNs.

LSTM Unrolled — How Hidden State Propagates Visual

Each timestep receives the input token and the previous hidden state. The LSTM gates decide what to remember, what to forget, and what to output.

Vanilla RNN

h_t = tanh(W_h·h_{t-1} + W_x·x_t + b). Hidden state carries context. Problem: vanishing gradients → forgets dependencies beyond ~10 timesteps. The weight matrix is multiplied at every step → gradients either vanish or explode over long sequences.

Short memory

LSTM

3 gates: Forget (what to erase from memory), Input (what new info to write), Output (what to read). Separate cell state C_t can carry information across hundreds of timesteps unchanged. The standard choice for sequence modeling.

Long sequencesnn.LSTM

GRU

Simplified LSTM: 2 gates (Reset, Update). Fewer parameters → faster training. Similar performance to LSTM in most tasks. Good choice when LSTM overfits or is too slow. nn.GRU in PyTorch.

Efficientnn.GRU

Bidirectional RNN

Process sequence both forward and backward, concatenate hidden states. Each token sees both past AND future context. Essential for understanding tasks (classification, NER). Set bidirectional=True in nn.LSTM. Double the hidden_dim for output.

bidirectional=True

11_lstm_classifier.py

python

class LSTMClassifier(nn.Module):
    def __init__(self, vocab_size, emb_dim, hidden, n_classes, n_layers=2):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, emb_dim, padding_idx=0)
        self.lstm = nn.LSTM(emb_dim, hidden, n_layers,
                            batch_first=True,          # (batch, seq, feat)
                            dropout=0.3,
                            bidirectional=True)
        self.fc = nn.Linear(hidden * 2, n_classes)   # *2 for bidirectional

    def forward(self, x):               # x: (batch, seq_len)
        emb = self.embed(x)             # (batch, seq_len, emb_dim)
        out, (h_n, c_n) = self.lstm(emb)
        # h_n: (n_layers*2, batch, hidden) — final hidden states
        h = torch.cat([h_n[-2], h_n[-1]], dim=1)   # concat fwd + bwd
        return self.fc(h)               # (batch, n_classes)

Module 12

Transformers & Self-Attention

The architecture behind GPT-4, Claude, BERT, and every state-of-the-art model. Instead of processing sequentially like RNNs, Transformers attend to ALL positions simultaneously with learned attention weights — enabling massive parallelism and long-range dependencies.

Self-Attention — Tokens Attending to Each Other Animated

Each token creates Q, K, V vectors. Bright lines = high attention weight.

Q (Query) × K (Key) → similarity scores → softmax → attention weights → weighted sum of V (Values) = context-aware output for each token.

Attention(Q,K,V) = softmax( QKᵀ / √d_k ) · V Where: Q = XW_Q (queries) K = XW_K (keys) V = XW_V (values) d_k = dimension of keys (scaling prevents large dot products)

Q, K, V — The Database Analogy

Q = "what am I looking for?" K = "what do I contain?" V = "what do I return?" Each token queries all other tokens. High Q·K similarity → high attention weight → token's V contributes more to the output. This is how "cat" attends to "sat" in "the cat sat."

Core mechanism

Multi-Head Attention

Run h parallel attention heads with different W_Q, W_K, W_V projections. Each head can attend to different relationship types: syntax, semantics, coreference, etc. Concatenate all heads, project with W_O. num_heads is a key hyperparameter.

Parallel attention

Positional Encoding

Attention is order-agnostic — "cat sat" = "sat cat" without positional info. Add positional vectors to token embeddings. Sinusoidal (original paper). Learned (GPT). RoPE — Rotary (modern LLMs). Without this, the model can't learn sequence order.

Required

Encoder vs Decoder

Encoder (BERT): bidirectional attention, sees full sequence. For understanding: classification, NER, Q&A. Decoder (GPT): causal/masked attention, only sees past tokens. For generation. Encoder-Decoder (T5, BART): for translation/summarization.

Architecture choice

12_transformer_block.py — From scratch implementation

python

import math
import torch.nn.functional as F

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.d_k = d_model // n_heads
        self.n_heads = n_heads
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def forward(self, x, mask=None):
        B, T, C = x.shape
        def split_heads(t):
            return t.view(B, T, self.n_heads, self.d_k).transpose(1, 2)
        Q, K, V = split_heads(self.W_q(x)), split_heads(self.W_k(x)), split_heads(self.W_v(x))
        scores = Q @ K.transpose(-2, -1) / math.sqrt(self.d_k)
        if mask is not None: scores = scores.masked_fill(mask == 0, -1e9)
        attn = F.softmax(scores, dim=-1)
        out = (attn @ V).transpose(1, 2).reshape(B, T, C)
        return self.W_o(out)

class TransformerBlock(nn.Module):
    def __init__(self, d_model, n_heads, ff_dim, dropout=0.1):
        super().__init__()
        self.attn = MultiHeadAttention(d_model, n_heads)
        self.ff   = nn.Sequential(nn.Linear(d_model, ff_dim), nn.GELU(), nn.Linear(ff_dim, d_model))
        self.ln1  = nn.LayerNorm(d_model)
        self.ln2  = nn.LayerNorm(d_model)
        self.drop = nn.Dropout(dropout)

    def forward(self, x):
        x = x + self.drop(self.attn(self.ln1(x)))   # residual + attention
        x = x + self.drop(self.ff(self.ln2(x)))     # residual + feedforward
        return x

Module 13

Transfer Learning & Fine-Tuning

Training large models from scratch requires weeks and millions of dollars. Transfer learning: take a model pretrained on massive data, adapt it to your task. This is how 90% of real-world DL projects work.

Feature Extraction

Freeze all pretrained weights. Only train a new output head. Fast, cheap, works when your data is similar to pretrained data. Use when you have <1000 samples. PyTorch: for param in model.parameters(): param.requires_grad = False. Then unfreeze the head only.

Frozen backboneFast

Fine-Tuning

Unfreeze some or all pretrained layers. Train end-to-end with small lr (1e-4 to 1e-5). Unfreeze gradually from the top. Discriminative lr: lower rate for early layers (they're already good), higher for later layers. Best performance on sufficient data.

Best performanceNeeds more data

HuggingFace Transformers

The industry library for pretrained NLP. AutoModel, BertForSequenceClassification, GPT2LMHeadModel. Thousands of pretrained models. Fine-tune BERT on your text in ~20 lines. You used sentence-transformers in your RAG system — that's HuggingFace.

Industry standard

torchvision Pretrained CNNs

torchvision.models.resnet50(weights='IMAGENET1K_V1'). Replace model.fc. Freeze and train head. 10 lines to state-of-the-art on custom image data. Also: EfficientNet, ViT, ConvNeXt — all available in torchvision.

Vision

13_finetune_bert.py

python

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# ─── Load pretrained BERT ──────────────────────────────────
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased", num_labels=2    # replace classification head
).to(device)

# ─── Tokenize ─────────────────────────────────────────────
texts = ["Great product!", "Terrible experience"]
inputs = tokenizer(texts, padding=True, truncation=True,
                   max_length=128, return_tensors="pt")

# ─── Discriminative learning rates ────────────────────────
optimizer = torch.optim.AdamW([
    {"params": model.bert.embeddings.parameters(), "lr": 1e-5},   # early
    {"params": model.bert.encoder.parameters(),    "lr": 2e-5},   # middle
    {"params": model.classifier.parameters(),      "lr": 5e-5},   # head
])

# ─── Training step ────────────────────────────────────────
labels = torch.tensor([1, 0])
outputs = model(**inputs.to(device), labels=labels.to(device))
loss = outputs.loss       # HuggingFace computes loss for you
preds = outputs.logits.argmax(dim=1)

Module 14

Practical DL — Debugging & Production

The gap between "model trains" and "model works well" is all here. These are hard-won lessons that only come from debugging real models — read them carefully.

Loss is NaN — Debug Checklist

Check in order: (1) Learning rate too high. (2) Forgot zero_grad(). (3) log(0) in loss — add 1e-8 epsilon. (4) Exploding gradients — add clip_grad_norm_. (5) Wrong loss function — CrossEntropyLoss needs raw logits, NOT softmax output. (6) NaN in input data.

Most common crash

Model Not Learning

(1) Learning rate too low — try 10x higher. (2) Data not shuffled. (3) Labels wrong — check class indices (0-indexed). (4) Model too small — add layers. (5) Not training long enough. (6) Bug in data pipeline — visualize batches.

Common issue

Sanity Check — Overfit One Batch

Before full training: take one batch, train on it for 100 steps. Loss should reach near-zero. If it doesn't, your model has a bug (wrong loss, wrong labels, architecture issue). This 2-minute test saves hours of wasted training.

Do this always

Mixed Precision Training

Use float16 for forward pass, float32 for gradients. 2x memory reduction, 2-3x speedup on modern GPUs. Add 5 lines: from torch.cuda.amp import autocast, GradScaler. Wrap forward+loss in with autocast():. Completely free speedup.

Free 2x speedup

Experiment Tracking — W&B

pip install wandb. wandb.init(project="my-model"). wandb.log({"loss": loss, "acc": acc}). Automatically tracks every metric, saves model checkpoints, lets you compare runs visually. Industry standard alongside MLflow. Never lose results again.

Industry standard

model.train() vs model.eval()

Forgetting this is a very common bug. model.train(): enables dropout, BatchNorm uses batch stats. model.eval(): disables dropout, BatchNorm uses running stats. ALWAYS call model.eval() before validation/inference. ALWAYS call model.train() before the next training epoch.

Common bugCritical

Interview Q

"Explain backpropagation. Why is it necessary?"

To train a network, we need to know how much each weight contributed to the error — so we can adjust it in the right direction. Backprop computes this efficiently using the chain rule of calculus. In the forward pass, we compute the prediction and loss. In the backward pass, we compute ∂Loss/∂weight for every weight simultaneously, by flowing the gradient backward from output to input layer. At each layer: local gradient = gradient from next layer × derivative of this layer's operation. PyTorch automates this with .backward(). Without backprop, training deep networks would require computing each weight's gradient separately — O(n) forward passes, completely infeasible for millions of weights.

Interview Q

"What is the vanishing gradient problem and how do you solve it?"

In deep networks using sigmoid/tanh activations, gradients shrink at each layer during backprop because these functions have derivatives much less than 1 in most of their range. After 10+ layers, the gradient signal reaching early layers is near zero — those layers barely update and don't learn. Solutions: (1) ReLU activation — derivative is either 0 or 1, no shrinking. (2) Batch Normalization — keeps activations in a healthy range. (3) Residual connections — gradient can flow directly through skip connections, bypassing potentially vanishing paths. Modern architectures (ResNet, Transformer) use all three.

🧠 Deep Learning Mastery Checkpoint

You've mastered DL fundamentals when you can:

Explain backpropagation using the chain rule, without notes
Build and train an MLP in PyTorch from scratch — including the 6-step loop
Explain vanishing gradients and name 3 solutions
Build a CNN with Conv2d → BatchNorm2d → ReLU → MaxPool blocks
Explain Q, K, V in self-attention using the database analogy
Fine-tune a pretrained HuggingFace model on custom text data
Debug a NaN loss — enumerate 5 possible causes in order
Know when to use MLP, CNN, LSTM, or Transformer for a given task

Deep LearningFrom Neuron to Transformer

The Neuron & Perceptron

Activation Functions

Forward Pass & Loss Functions

Backpropagation

Optimizers

PyTorch Tensors & Autograd

Building an MLP in PyTorch

The Complete Training Loop

Regularization — Preventing Overfitting

Convolutional Neural Networks (CNNs)

RNNs, LSTMs & GRUs

Transformers & Self-Attention

Transfer Learning & Fine-Tuning

Practical DL — Debugging & Production

You've mastered DL fundamentals when you can:

Deep Learning
From Neuron to Transformer