Machine Learning From Scratch — Complete Visual Guide

Module 01

Mathematical Foundations

ML is applied math. You don't need a PhD — but you need to understand why gradient descent works, what an eigenvector means intuitively, and why probabilities matter. This module covers only what you'll actually use.

⚠️ Don't Skip This

Every algorithm in this guide has a mathematical core. Understanding the math is what separates someone who uses sklearn from someone who can debug, tune, and explain it in interviews.

Gradient Descent — Animated Interactive Visualization

The ball rolls down the loss curve. Each step = θ := θ − α·∇L(θ). Watch it converge to the minimum.

Linear Algebra — Matrices & Vectors

Your dataset IS a matrix. Rows = samples, columns = features. Matrix multiplication is how weights are applied. Dot product measures similarity (used in cosine similarity for RAG!).

In SlidesCore

Calculus — Derivatives & Gradients

A gradient points in the direction of steepest ascent. We go the opposite direction to minimize loss. This single idea trains every neural network, logistic regression, and SVM.

In SlidesCore

Probability & Bayes' Theorem

P(A|B) = P(B|A)·P(A) / P(B). This is Naive Bayes, Bayesian Optimization, and the probabilistic interpretation of logistic regression — all from one formula.

In SlidesCore

Probability Distributions

Gaussian: natural phenomena. Bernoulli: coin flip / binary classification. Multinomial: word counts / Naive Bayes. Poisson: event counts. Your data follows one of these — knowing which changes everything.

In Slides

Eigenvalues & Eigenvectors

A vector that only scales — doesn't rotate — when a transformation is applied. PCA finds eigenvectors of the covariance matrix. These are the directions of maximum variance in your data.

In SlidesPCA Key

Information Theory — Entropy

H(X) = −Σ p(x)log p(x). Measures uncertainty. Decision trees use Information Gain (drop in entropy) to pick splits. Cross-entropy is the loss function for every classifier.

Not in Slides!Interview

Common Probability Distributions Hover to explore

Recognizing your data's distribution tells you which model, imputation method, and transformation to use.

Gradient Descent: θ := θ − α · ∇L(θ) Cross-Entropy Loss: L = −Σ yᵢ·log(ŷᵢ) + (1−yᵢ)·log(1−ŷᵢ) Bayes' Theorem: P(y|X) = P(X|y)·P(y) / P(X) Entropy: H(X) = −Σ p(x)·log₂(p(x))

Interview Q

"Explain gradient descent intuitively. What is the learning rate?"

Imagine you're blindfolded on a hilly landscape trying to find the lowest point. At each step, you feel the slope under your feet and take a step in the downhill direction. The learning rate α is your step size. Too large → overshoot the minimum and diverge. Too small → reach the minimum very slowly. The gradient tells you the direction and slope; learning rate tells you how far to step.

Module 02

Data Engineering & Preprocessing

In real projects, 70–80% of time is here — not on models. A well-preprocessed dataset beats a fancy model on messy data every time. This is where real ML skill shows.

Feature Scaling — Why It Matters Before vs After

Without scaling, KNN and SVM are dominated by the feature with the largest range (e.g., salary in thousands vs age in tens).

Data Imputation

Mean: numeric, symmetric data. Median: numeric, skewed data (robust to outliers). Mode: categorical. KNN Imputer: uses K nearest neighbors — smartest option. Always impute AFTER splitting to avoid leakage.

In Slides

Outlier Detection

Z-Score: outlier if |z| > 3. IQR: outlier if x < Q1−1.5·IQR or x > Q3+1.5·IQR. Isolation Forest: best for high dimensions. Box plots are your first tool — always look at them.

In Slides

Feature Scaling

Min-Max: x'=(x−min)/(max−min) → [0,1]. Standard Scaler: x'=(x−μ)/σ → mean 0, std 1. Required for: KNN, SVM, Neural Nets, PCA. NOT required for: Random Forest, XGBoost, Decision Trees.

In SlidesCritical

Feature Transformation

Log Transform: fixes right skew (income, house prices). Box-Cox: generalized log. Square root: moderate skew. These make distributions more Gaussian, helping linear models dramatically.

In Slides

Handling Imbalanced Data

SMOTE: creates synthetic minority samples by interpolating between real ones. Undersampling: removes majority samples. class_weight='balanced' in sklearn adjusts loss automatically. ROC/F1, not accuracy, for imbalanced data.

In SlidesReal-World

Dimensionality Reduction

PCA: unsupervised, maximizes variance, use for preprocessing. LDA: supervised, maximizes class separation, use before classifiers. t-SNE/UMAP: visualization ONLY — not for modeling features.

In Slides

Encoding Categorical Features

Label Encoding: ordinal data (small<med<large). One-Hot Encoding: nominal data (city, color) — creates binary columns. Target Encoding: mean of target per category — powerful but leaks if not done carefully.

Missing from Slides!Critical

Data Leakage — The #1 Mistake

Fitting a scaler on ALL data before splitting means test data influenced training — your model is "cheating." Always split first, then fit preprocessors on train only, transform both. Use sklearn Pipeline to enforce this automatically.

Missing from Slides!Interview

PCA — Finding Principal Components Animation

The first principal component (red arrow) points in the direction of maximum variance. The second is perpendicular to it. PCA rotates your data to align with these directions, then you can drop low-variance components.

preprocessing_pipeline.py

python

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import KNNImputer, SimpleImputer

num_pipeline = Pipeline([
    ('impute', KNNImputer(n_neighbors=5)),
    ('scale',  StandardScaler())
])
cat_pipeline = Pipeline([
    ('impute', SimpleImputer(strategy='most_frequent')),
    ('encode', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])
preprocessor = ColumnTransformer([
    ('num', num_pipeline, numeric_cols),
    ('cat', cat_pipeline, categorical_cols)
])
# ↑ This is the CORRECT pattern — preprocessor is fitted ONLY on X_train
# Later: full_pipeline = Pipeline([('prep', preprocessor), ('model', clf)])

Module 03

Supervised Learning: Regression

Predicts a continuous number. Every algorithm is a different strategy for learning the function y = f(X). Start with linear regression as your baseline before trying anything complex.

Regression Models — Compare Fits Interactive

Watch how high-degree polynomials perfectly fit training points but oscillate wildly. Ridge regularization tames this by penalizing large coefficients.

Simple & Multiple Linear Regression

y = β₀ + β₁x₁ + ... + βₙxₙ. Minimizes Sum of Squared Errors. Has a closed-form solution: β = (XᵀX)⁻¹Xᵀy. Check assumptions: linearity, no multicollinearity (use VIF), homoscedasticity.

In SlidesBaseline

Polynomial Regression

Adds x², x³ features to capture curves. Use PolynomialFeatures in sklearn. Warning: degree ≥ 5 almost always overfits. Combine with Ridge regularization for safety.

In Slides

Ridge (L2) Regularization

Loss = MSE + λΣβ². Shrinks weights toward zero but never exactly. Handles multicollinearity well. α (sklearn) = λ. Cross-validate α. When features are correlated, always prefer Ridge over plain linear.

In SlidesMust Know

Lasso (L1) Regularization

Loss = MSE + λΣ|β|. Can force weights to EXACTLY zero — built-in feature selection! When you have many irrelevant features, Lasso is better than Ridge. Sparse solutions are interpretable.

In SlidesFeature Selection

SVR — Support Vector Regression

Fits the widest possible ε-insensitive tube around data. Points inside = no penalty. Points outside = penalized. Kernel trick (RBF) handles non-linear patterns. Requires scaling.

In Slides

Bias-Variance Tradeoff

Total Error = Bias² + Variance + Irreducible Noise. High bias = underfitting (model too simple). High variance = overfitting (model memorized training data). Regularization, ensemble methods, and more data all help.

Not in Slides!Interview #1

Ridge: L = Σ(yᵢ − ŷᵢ)² + λΣβⱼ² (L2 — shrinks weights) Lasso: L = Σ(yᵢ − ŷᵢ)² + λΣ|βⱼ| (L1 — zeroes weights) R²: 1 − SS_residual / SS_total (1 = perfect, 0 = mean baseline) RMSE: √(1/n · Σ(yᵢ − ŷᵢ)²) (same units as target)

Interview Q

"What is the difference between Ridge and Lasso? When would you use each?"

Both prevent overfitting by adding a penalty to large weights. Ridge (L2) adds the sum of squared weights — it shrinks all weights proportionally but never to zero. Lasso (L1) adds the sum of absolute weights — it can force some weights to exactly zero, effectively removing features. Use Lasso when you suspect many features are irrelevant and want automatic feature selection. Use Ridge when features are correlated (it handles multicollinearity better). ElasticNet combines both and is a safe default.

Module 04

Supervised Learning: Classification

Predicts a category. Each algorithm learns a different kind of decision boundary — from a line, to a curve, to a Voronoi region, to a hyperplane. Knowing which to use is the skill.

Decision Boundaries — How Classifiers Think Interactive

Logistic = straight line. KNN = jagged, local. SVM RBF = smooth curves. Decision Tree = axis-aligned rectangles. Each shape has different strengths.

Logistic Regression

Uses sigmoid σ(z)=1/(1+e⁻ᶻ) to output probabilities. Trained with cross-entropy loss. Linear decision boundary. Highly interpretable — coefficients tell you feature impact. Best baseline for any classification problem.

In SlidesAlways Start Here

Naive Bayes (Gaussian / Multinomial / Bernoulli)

Applies Bayes assuming features are independent. "Naive" because independence is rarely true — but it works surprisingly well. Gaussian: real-valued features. Multinomial: word counts (NLP). Bernoulli: binary features.

In Slides

KNN — K Nearest Neighbors

No training — just memorizes data. Prediction = majority vote of K nearest points (Euclidean distance). Sensitive to scale — ALWAYS scale first. High K = smoother boundary (less overfit). Slow at inference on large datasets.

In Slides

Decision Tree Classification

Splits using Gini Impurity or Information Gain. Gini = 1−Σpᵢ². Deeply interpretable — you can export and read every rule. Tends to overfit without max_depth or min_samples_leaf constraints.

In SlidesInterpretable

SVM — Support Vector Machine

Finds the hyperplane that maximizes the margin between classes. C parameter: large C = less tolerance for misclassification (might overfit). Support vectors are the only training points that matter.

In Slides

Kernel SVM

The kernel trick maps data to a higher dimension — without computing it explicitly — making non-linear data linearly separable. RBF kernel: most common, great default. γ controls how local the influence of each point is.

In SlidesKernel Trick

Random Forest Classification

Ensemble of trees trained on bootstrap samples with random feature subsets. Final prediction = majority vote. Variance reduction vs single tree. Built-in feature_importances_. First model to try after logistic regression.

In SlidesGo-To

Sigmoid vs Softmax

Sigmoid: binary classification output → single probability [0,1]. Softmax: multi-class → vector of probabilities summing to 1. This difference matters when building neural networks and understanding logistic regression outputs.

Not in Slides!Neural Net Key

Module 05

Unsupervised Learning

No labels. The algorithm finds structure in data by itself. Used for customer segmentation, anomaly detection, topic modeling, and exploring new datasets before you know what to predict.

Clustering Algorithms Compared Animated

K-Means: fast, assumes spherical clusters, needs K upfront. DBSCAN: finds arbitrary shapes, marks noise as -1, no K needed.

K-Means Clustering

Assign points to nearest centroid, recompute centroids, repeat. Sensitive to initialization → use K-Means++. Sensitive to scale → StandardScaler first. Sensitive to outliers → remove them first or use DBSCAN.

In SlidesCore

Elbow Method & Silhouette Score

Elbow: plot WCSS vs K — pick the "elbow." Silhouette: s = (b−a)/max(a,b) where a = intra-cluster distance, b = nearest cluster distance. Range [−1,1]. Use BOTH together — elbow identifies candidates, silhouette picks the best.

In Slides

DBSCAN

Core points: ≥ min_samples neighbors within ε radius. Border points: within ε of a core point. Noise points: labeled −1. Finds arbitrary cluster shapes. Doesn't need K. Two params: ε (try KNN distance plot) and min_samples (try log(N)).

In SlidesOutlier Detection

Hierarchical Clustering

Agglomerative: start with N clusters, merge closest pair at each step. Dendrogram shows the merge history — cut it at any height to get any number of clusters. Ward linkage minimizes within-cluster variance.

In Slides

Gaussian Mixture Models (GMM)

Soft clustering — each point gets a probability of belonging to each cluster. Data modeled as mixture of K Gaussians. Trained with EM algorithm (Expectation-Maximization). More flexible shapes than K-Means.

In Slides

Anomaly Detection

Isolation Forest: anomalies get isolated in fewer random splits (anomaly score = average path length). One-Class SVM: learns a boundary around normal data. LOF: compares local density to neighbors. Used in fraud, network intrusion, manufacturing QC.

In Slides

Association Rules — Apriori & Eclat

Market basket analysis. Support = P(X∩Y). Confidence = P(Y|X). Lift = Confidence/P(Y). Lift > 1 means the items appear together more than by chance. Apriori: breadth-first. Eclat: depth-first with vertical format (faster).

In Slides

Module 06

Ensemble Methods & Boosting

Combine weak learners into a strong one. XGBoost and LightGBM dominate Kaggle competitions and real-world tabular data problems. Understanding these deeply is a superpower.

Boosting — Sequential Error Correction Step by Step

Round 0 — All weights equal

Each round focuses on the mistakes of the previous. Misclassified points get larger circles (higher weight). The final model is a weighted sum of all rounds.

Bagging & Random Forest

Train N trees on random bootstrap samples. Each tree also uses a random feature subset (sqrt(n_features) for classification). Average predictions. Reduces variance while keeping bias low. Works in parallel — fast to train.

In SlidesCore

AdaBoost

Train a stump → identify misclassified points → upweight them → train next stump on reweighted data. Final prediction = weighted vote of all stumps. Sensitive to noisy data and outliers (they get very high weights).

In Slides

Gradient Boosting (GBM)

Each tree fits the residuals (negative gradient) of the previous ensemble. Learning rate shrinks each tree's contribution, preventing overfitting. The most important hyperparams: n_estimators, learning_rate, max_depth. Lower learning_rate → needs more trees, but generalizes better.

In SlidesPowerful

XGBoost

Optimized GBM: second-order Taylor expansion, L1+L2 regularization, parallel tree building, column subsampling. Key params: n_estimators, max_depth (3–6), learning_rate (0.01–0.3), subsample, colsample_bytree. Use early_stopping_rounds to avoid overfitting automatically.

In SlidesIndustry Standard

LightGBM

Leaf-wise growth (vs level-wise in XGBoost) → more accurate trees but more prone to overfit without tuning. GOSS + EFB make it 10x faster on large datasets. num_leaves is the most important param (more than max_depth).

In SlidesFast

CatBoost

Handles raw categorical features — just pass them as strings. Ordered boosting reduces overfitting. Symmetric (oblivious) trees → fast inference. Best when you have many high-cardinality categoricals and don't want to encode them.

In Slides

Stacking & Blending

Stacking: Level-0 models make predictions → these become features for Level-1 meta-learner. Blending: simpler — use held-out validation set for meta features. Often adds 0.5–2% on competitions. Risk: complexity and overfitting.

In Slides

Voting Classifier

Hard voting: majority class. Soft voting: average probabilities — always use soft when models give probabilities, it's more accurate because it uses confidence. Combine diverse models (SVM + RF + Logistic) for best results.

In Slides

Algorithm	Speed	Categoricals	Key Tuning Param	Best For
Random Forest	Fast (parallel)	Encode first	n_estimators	General baseline, high dims
XGBoost	Medium	Encode first	learning_rate + early_stop	Structured data, competitions
LightGBM	Very fast	Limited native	num_leaves	Large datasets (1M+ rows)
CatBoost	Medium	Native (best)	iterations + depth	High-cardinality categoricals
AdaBoost	Fast	Encode first	n_estimators	Simple, interpretable ensembles

Module 07

Model Evaluation & Hyperparameter Tuning

The model is only as good as how you measure it. Wrong metrics lead to wrong decisions. This is also where you learn if your model is actually learning or just memorizing.

Confusion Matrix → All Metrics Derived Hover cells

Every classification metric comes from 4 numbers: TP, TN, FP, FN. Hover each cell to see its role in Precision, Recall, F1, and Accuracy.

Accuracy, Precision, Recall, F1

Accuracy = (TP+TN)/N — useless for imbalanced. Precision = TP/(TP+FP) — "how often are positive predictions correct?" Recall = TP/(TP+FN) — "how many positives did we catch?" F1 = harmonic mean of both.

In SlidesCore

ROC Curve & AUC

Plots TPR vs FPR at every threshold. AUC = 0.5 is random; AUC = 1.0 is perfect. Model-comparison tool — works regardless of threshold choice. For very imbalanced data, use Precision-Recall curve instead.

Not in Slides!Must Know

K-Fold & Stratified K-Fold

K-Fold splits into K folds, trains K times (each fold as test once), averages scores. Stratified preserves class ratio in each fold — always use for classification. K=5 or K=10 are standard choices.

In Slides

GridSearchCV & RandomizedSearchCV

GridSearch: exhaustive, guaranteed to find best in grid, O(n_combos × K) fits. RandomizedSearch: random sample, often finds near-optimal at 10% cost. For large search spaces always use Randomized. n_iter controls budget.

In Slides

Bayesian Optimization (Optuna)

Builds a surrogate model of the objective function. Each trial chooses params where improvement is most likely. Far more efficient than random search. Industry standard for expensive models. pip install optuna — 5 lines to use.

Not in Slides!Industry

Learning Curves

Plot train and val scores vs training set size. Both low → underfit (need better model). Large gap → overfit (need more data or regularization). Converging → more data won't help. Essential diagnostic before any optimization.

Not in Slides!

ROC Curve — Compare 3 Models Animated

The closer the curve hugs the top-left corner, the better. AUC = area under the curve. The diagonal is a random classifier (AUC=0.5). Always compare models with AUC, not accuracy.

Precision: P = TP / (TP + FP) Recall: R = TP / (TP + FN) F1 Score: F1 = 2·P·R / (P+R) Accuracy: A = (TP + TN) / (TP + TN + FP + FN) R² Score: R² = 1 − SS_res / SS_tot

Interview Q

"You have 99% fraud-free data. Your model gets 99% accuracy. Is it good?"

No — a model that predicts "not fraud" for every single transaction would achieve 99% accuracy. Accuracy is meaningless on imbalanced datasets. You should look at Precision and Recall for the minority class (fraud). If recall for fraud is 0%, the model is useless despite 99% accuracy. Use F1-score or AUC-ROC for imbalanced problems, and consider using SMOTE or class_weight='balanced'.

Module 08

Critical Topics NOT in the Slides

The syllabus images covered classical ML well. But these topics were absent — and every real ML role will need them. Do not skip this module.

⚠️ 16 Topics Missing from Your Syllabus

These were not visible in either slide image but are required for real-world ML work:

Neural Networks (MLP)Backprop, activations, layers

Bias-Variance TradeoffMost important ML concept

ROC / AUC / PR CurvesModel comparison standard

Data LeakageMost common real-world bug

Feature Engineering80% of model performance

SHAP ExplainabilityRequired in regulated industries

sklearn PipelinesProduction-grade code pattern

Bayesian OptimizationOptuna — industry standard tuning

Time Series MLARIMA, Prophet, lag features

NLP BasicsTF-IDF, embeddings, BERT

Model DeploymentFastAPI + Docker (you know this!)

Experiment TrackingMLflow, Weights & Biases

Categorical EncodingLabel vs OHE vs Target encoding

Learning CurvesDiagnose under/overfitting

Regression MetricsMAE, MSE, RMSE, R²

Train/Val/Test SplitTest set touched ONCE only

Neural Networks — MLP

Layers of weighted sums + activations. Forward pass: Xᵀ → Z = XW+b → A = activation(Z). Backprop: chain rule computes gradients. ReLU for hidden layers, Sigmoid for binary output, Softmax for multi-class. Implemented in sklearn as MLPClassifier.

Not in Slides!Critical

SHAP Values — Explainability

Assigns each feature a contribution value for each individual prediction. "Feature X pushed this prediction +0.3 above baseline." Works for any model. Required in finance, healthcare, and legal contexts. pip install shap → 3 lines to use.

Not in Slides!Industry Must

ML Pipelines with sklearn

Chain preprocessor + model into one Pipeline object. Prevents data leakage automatically. Enables correct cross-validation. Deploy one object — pickle it, load it, call .predict(). This is how production ML code looks.

Not in Slides!Production

NLP — TF-IDF & Embeddings

TF-IDF: term frequency × inverse document frequency — numeric representation of text. Word2Vec/GloVe: dense word embeddings capturing semantic meaning. BERT/sentence-transformers: contextual embeddings (you used this in your RAG system!).

Not in Slides!

Experiment Tracking — MLflow

Log params, metrics, models, and plots automatically across runs. Compare experiments visually. Reproduce any result. Register and version models. mlflow.autolog() with sklearn adds tracking in one line.

Not in Slides!MLOps

Model Deployment with FastAPI

You already know FastAPI from the RAG system. Save model with joblib.dump(). Load in FastAPI. Wrap .predict() in a POST endpoint. Containerize with Docker. Same pattern as your RAG backend — ML deployment is just another API.

Not in Slides!You Know This!

production_ml_api.py — Deploy Any ML Model

python

from fastapi import FastAPI
from pydantic import BaseModel
import joblib
import numpy as np

app = FastAPI()
pipeline = joblib.load("model.pkl")  # full sklearn Pipeline

class PredictRequest(BaseModel):
    features: list[float]  # your feature vector

class PredictResponse(BaseModel):
    prediction: int
    probability: float
    model_version: str = "v1.0"

@app.post("/predict", response_model=PredictResponse)
async def predict(req: PredictRequest):
    X = np.array(req.features).reshape(1, -1)
    pred = pipeline.predict(X)[0]
    prob = pipeline.predict_proba(X)[0].max()
    return PredictResponse(prediction=int(pred), probability=float(prob))

# ↑ Same FastAPI pattern as your RAG system. ML deployment = just another endpoint.

🎓 Complete Curriculum Checkpoint

You've mastered the course when you can:

Explain every algorithm's intuition without code or notes
Choose the right algorithm for any given problem type
Build a leak-free sklearn Pipeline from raw data to prediction
Explain why accuracy fails on imbalanced data and what to use instead
Describe XGBoost's key improvements over vanilla GBM
Deploy any trained model as a FastAPI endpoint (you already know this!)
Explain SHAP and why it matters for model trust in production
Handle missing values, outliers, and categorical features correctly

Machine LearningFrom Scratch to Production

Mathematical Foundations

Data Engineering & Preprocessing

Supervised Learning: Regression

Supervised Learning: Classification

Unsupervised Learning

Ensemble Methods & Boosting

Model Evaluation & Hyperparameter Tuning

Critical Topics NOT in the Slides

⚠️ 16 Topics Missing from Your Syllabus

You've mastered the course when you can:

Machine Learning
From Scratch to Production