Every topic from the iSCALE syllabus — plus all the critical ones that were missing. Deep explanations, interactive visualizations, real code, and interview answers at every step.
8
Modules
60+
Topics
12
Live Visualizations
∞
Depth
Module 01
Mathematical Foundations
ML is applied math. You don't need a PhD — but you need to understand why gradient descent works, what an eigenvector means intuitively, and why probabilities matter. This module covers only what you'll actually use.
⚠️ Don't Skip This
Every algorithm in this guide has a mathematical core. Understanding the math is what separates someone who uses sklearn from someone who can debug, tune, and explain it in interviews.
The ball rolls down the loss curve. Each step = θ := θ − α·∇L(θ). Watch it converge to the minimum.
Linear Algebra — Matrices & Vectors
Your dataset IS a matrix. Rows = samples, columns = features. Matrix multiplication is how weights are applied. Dot product measures similarity (used in cosine similarity for RAG!).
In SlidesCore
Calculus — Derivatives & Gradients
A gradient points in the direction of steepest ascent. We go the opposite direction to minimize loss. This single idea trains every neural network, logistic regression, and SVM.
In SlidesCore
Probability & Bayes' Theorem
P(A|B) = P(B|A)·P(A) / P(B). This is Naive Bayes, Bayesian Optimization, and the probabilistic interpretation of logistic regression — all from one formula.
In SlidesCore
Probability Distributions
Gaussian: natural phenomena. Bernoulli: coin flip / binary classification. Multinomial: word counts / Naive Bayes. Poisson: event counts. Your data follows one of these — knowing which changes everything.
In Slides
Eigenvalues & Eigenvectors
A vector that only scales — doesn't rotate — when a transformation is applied. PCA finds eigenvectors of the covariance matrix. These are the directions of maximum variance in your data.
In SlidesPCA Key
Information Theory — Entropy
H(X) = −Σ p(x)log p(x). Measures uncertainty. Decision trees use Information Gain (drop in entropy) to pick splits. Cross-entropy is the loss function for every classifier.
Not in Slides!Interview
Common Probability Distributions Hover to explore
Recognizing your data's distribution tells you which model, imputation method, and transformation to use.
"Explain gradient descent intuitively. What is the learning rate?"
Imagine you're blindfolded on a hilly landscape trying to find the lowest point. At each step, you feel the slope under your feet and take a step in the downhill direction. The learning rate α is your step size. Too large → overshoot the minimum and diverge. Too small → reach the minimum very slowly. The gradient tells you the direction and slope; learning rate tells you how far to step.
Module 02
Data Engineering & Preprocessing
In real projects, 70–80% of time is here — not on models. A well-preprocessed dataset beats a fancy model on messy data every time. This is where real ML skill shows.
Feature Scaling — Why It Matters Before vs After
Without scaling, KNN and SVM are dominated by the feature with the largest range (e.g., salary in thousands vs age in tens).
Data Imputation
Mean: numeric, symmetric data. Median: numeric, skewed data (robust to outliers). Mode: categorical. KNN Imputer: uses K nearest neighbors — smartest option. Always impute AFTER splitting to avoid leakage.
In Slides
Outlier Detection
Z-Score: outlier if |z| > 3. IQR: outlier if x < Q1−1.5·IQR or x > Q3+1.5·IQR. Isolation Forest: best for high dimensions. Box plots are your first tool — always look at them.
In Slides
Feature Scaling
Min-Max: x'=(x−min)/(max−min) → [0,1]. Standard Scaler: x'=(x−μ)/σ → mean 0, std 1. Required for: KNN, SVM, Neural Nets, PCA. NOT required for: Random Forest, XGBoost, Decision Trees.
In SlidesCritical
Feature Transformation
Log Transform: fixes right skew (income, house prices). Box-Cox: generalized log. Square root: moderate skew. These make distributions more Gaussian, helping linear models dramatically.
In Slides
Handling Imbalanced Data
SMOTE: creates synthetic minority samples by interpolating between real ones. Undersampling: removes majority samples. class_weight='balanced' in sklearn adjusts loss automatically. ROC/F1, not accuracy, for imbalanced data.
In SlidesReal-World
Dimensionality Reduction
PCA: unsupervised, maximizes variance, use for preprocessing. LDA: supervised, maximizes class separation, use before classifiers. t-SNE/UMAP: visualization ONLY — not for modeling features.
In Slides
Encoding Categorical Features
Label Encoding: ordinal data (small<med<large). One-Hot Encoding: nominal data (city, color) — creates binary columns. Target Encoding: mean of target per category — powerful but leaks if not done carefully.
Missing from Slides!Critical
Data Leakage — The #1 Mistake
Fitting a scaler on ALL data before splitting means test data influenced training — your model is "cheating." Always split first, then fit preprocessors on train only, transform both. Use sklearn Pipeline to enforce this automatically.
Missing from Slides!Interview
PCA — Finding Principal Components Animation
The first principal component (red arrow) points in the direction of maximum variance. The second is perpendicular to it. PCA rotates your data to align with these directions, then you can drop low-variance components.
preprocessing_pipeline.py
python
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import KNNImputer, SimpleImputer
num_pipeline = Pipeline([
('impute', KNNImputer(n_neighbors=5)),
('scale', StandardScaler())
])
cat_pipeline = Pipeline([
('impute', SimpleImputer(strategy='most_frequent')),
('encode', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])
preprocessor = ColumnTransformer([
('num', num_pipeline, numeric_cols),
('cat', cat_pipeline, categorical_cols)
])
# ↑ This is the CORRECT pattern — preprocessor is fitted ONLY on X_train# Later: full_pipeline = Pipeline([('prep', preprocessor), ('model', clf)])
Module 03
Supervised Learning: Regression
Predicts a continuous number. Every algorithm is a different strategy for learning the function y = f(X). Start with linear regression as your baseline before trying anything complex.
Regression Models — Compare Fits Interactive
Watch how high-degree polynomials perfectly fit training points but oscillate wildly. Ridge regularization tames this by penalizing large coefficients.
Simple & Multiple Linear Regression
y = β₀ + β₁x₁ + ... + βₙxₙ. Minimizes Sum of Squared Errors. Has a closed-form solution: β = (XᵀX)⁻¹Xᵀy. Check assumptions: linearity, no multicollinearity (use VIF), homoscedasticity.
In SlidesBaseline
Polynomial Regression
Adds x², x³ features to capture curves. Use PolynomialFeatures in sklearn. Warning: degree ≥ 5 almost always overfits. Combine with Ridge regularization for safety.
In Slides
Ridge (L2) Regularization
Loss = MSE + λΣβ². Shrinks weights toward zero but never exactly. Handles multicollinearity well. α (sklearn) = λ. Cross-validate α. When features are correlated, always prefer Ridge over plain linear.
In SlidesMust Know
Lasso (L1) Regularization
Loss = MSE + λΣ|β|. Can force weights to EXACTLY zero — built-in feature selection! When you have many irrelevant features, Lasso is better than Ridge. Sparse solutions are interpretable.
In SlidesFeature Selection
SVR — Support Vector Regression
Fits the widest possible ε-insensitive tube around data. Points inside = no penalty. Points outside = penalized. Kernel trick (RBF) handles non-linear patterns. Requires scaling.
In Slides
Bias-Variance Tradeoff
Total Error = Bias² + Variance + Irreducible Noise. High bias = underfitting (model too simple). High variance = overfitting (model memorized training data). Regularization, ensemble methods, and more data all help.
Not in Slides!Interview #1
Ridge: L = Σ(yᵢ − ŷᵢ)² + λΣβⱼ² (L2 — shrinks weights)
Lasso: L = Σ(yᵢ − ŷᵢ)² + λΣ|βⱼ| (L1 — zeroes weights)
R²: 1 − SS_residual / SS_total (1 = perfect, 0 = mean baseline)
RMSE: √(1/n · Σ(yᵢ − ŷᵢ)²) (same units as target)
Interview Q
"What is the difference between Ridge and Lasso? When would you use each?"
Both prevent overfitting by adding a penalty to large weights. Ridge (L2) adds the sum of squared weights — it shrinks all weights proportionally but never to zero. Lasso (L1) adds the sum of absolute weights — it can force some weights to exactly zero, effectively removing features. Use Lasso when you suspect many features are irrelevant and want automatic feature selection. Use Ridge when features are correlated (it handles multicollinearity better). ElasticNet combines both and is a safe default.
Module 04
Supervised Learning: Classification
Predicts a category. Each algorithm learns a different kind of decision boundary — from a line, to a curve, to a Voronoi region, to a hyperplane. Knowing which to use is the skill.
Decision Boundaries — How Classifiers Think Interactive
Logistic = straight line. KNN = jagged, local. SVM RBF = smooth curves. Decision Tree = axis-aligned rectangles. Each shape has different strengths.
Logistic Regression
Uses sigmoid σ(z)=1/(1+e⁻ᶻ) to output probabilities. Trained with cross-entropy loss. Linear decision boundary. Highly interpretable — coefficients tell you feature impact. Best baseline for any classification problem.
In SlidesAlways Start Here
Naive Bayes (Gaussian / Multinomial / Bernoulli)
Applies Bayes assuming features are independent. "Naive" because independence is rarely true — but it works surprisingly well. Gaussian: real-valued features. Multinomial: word counts (NLP). Bernoulli: binary features.
In Slides
KNN — K Nearest Neighbors
No training — just memorizes data. Prediction = majority vote of K nearest points (Euclidean distance). Sensitive to scale — ALWAYS scale first. High K = smoother boundary (less overfit). Slow at inference on large datasets.
In Slides
Decision Tree Classification
Splits using Gini Impurity or Information Gain. Gini = 1−Σpᵢ². Deeply interpretable — you can export and read every rule. Tends to overfit without max_depth or min_samples_leaf constraints.
In SlidesInterpretable
SVM — Support Vector Machine
Finds the hyperplane that maximizes the margin between classes. C parameter: large C = less tolerance for misclassification (might overfit). Support vectors are the only training points that matter.
In Slides
Kernel SVM
The kernel trick maps data to a higher dimension — without computing it explicitly — making non-linear data linearly separable. RBF kernel: most common, great default. γ controls how local the influence of each point is.
In SlidesKernel Trick
Random Forest Classification
Ensemble of trees trained on bootstrap samples with random feature subsets. Final prediction = majority vote. Variance reduction vs single tree. Built-in feature_importances_. First model to try after logistic regression.
In SlidesGo-To
Sigmoid vs Softmax
Sigmoid: binary classification output → single probability [0,1]. Softmax: multi-class → vector of probabilities summing to 1. This difference matters when building neural networks and understanding logistic regression outputs.
Not in Slides!Neural Net Key
Module 05
Unsupervised Learning
No labels. The algorithm finds structure in data by itself. Used for customer segmentation, anomaly detection, topic modeling, and exploring new datasets before you know what to predict.
Clustering Algorithms Compared Animated
K-Means: fast, assumes spherical clusters, needs K upfront. DBSCAN: finds arbitrary shapes, marks noise as -1, no K needed.
K-Means Clustering
Assign points to nearest centroid, recompute centroids, repeat. Sensitive to initialization → use K-Means++. Sensitive to scale → StandardScaler first. Sensitive to outliers → remove them first or use DBSCAN.
In SlidesCore
Elbow Method & Silhouette Score
Elbow: plot WCSS vs K — pick the "elbow." Silhouette: s = (b−a)/max(a,b) where a = intra-cluster distance, b = nearest cluster distance. Range [−1,1]. Use BOTH together — elbow identifies candidates, silhouette picks the best.
In Slides
DBSCAN
Core points: ≥ min_samples neighbors within ε radius. Border points: within ε of a core point. Noise points: labeled −1. Finds arbitrary cluster shapes. Doesn't need K. Two params: ε (try KNN distance plot) and min_samples (try log(N)).
In SlidesOutlier Detection
Hierarchical Clustering
Agglomerative: start with N clusters, merge closest pair at each step. Dendrogram shows the merge history — cut it at any height to get any number of clusters. Ward linkage minimizes within-cluster variance.
In Slides
Gaussian Mixture Models (GMM)
Soft clustering — each point gets a probability of belonging to each cluster. Data modeled as mixture of K Gaussians. Trained with EM algorithm (Expectation-Maximization). More flexible shapes than K-Means.
In Slides
Anomaly Detection
Isolation Forest: anomalies get isolated in fewer random splits (anomaly score = average path length). One-Class SVM: learns a boundary around normal data. LOF: compares local density to neighbors. Used in fraud, network intrusion, manufacturing QC.
In Slides
Association Rules — Apriori & Eclat
Market basket analysis. Support = P(X∩Y). Confidence = P(Y|X). Lift = Confidence/P(Y). Lift > 1 means the items appear together more than by chance. Apriori: breadth-first. Eclat: depth-first with vertical format (faster).
In Slides
Module 06
Ensemble Methods & Boosting
Combine weak learners into a strong one. XGBoost and LightGBM dominate Kaggle competitions and real-world tabular data problems. Understanding these deeply is a superpower.
Boosting — Sequential Error Correction Step by Step
Round 0 — All weights equal
Each round focuses on the mistakes of the previous. Misclassified points get larger circles (higher weight). The final model is a weighted sum of all rounds.
Bagging & Random Forest
Train N trees on random bootstrap samples. Each tree also uses a random feature subset (sqrt(n_features) for classification). Average predictions. Reduces variance while keeping bias low. Works in parallel — fast to train.
In SlidesCore
AdaBoost
Train a stump → identify misclassified points → upweight them → train next stump on reweighted data. Final prediction = weighted vote of all stumps. Sensitive to noisy data and outliers (they get very high weights).
In Slides
Gradient Boosting (GBM)
Each tree fits the residuals (negative gradient) of the previous ensemble. Learning rate shrinks each tree's contribution, preventing overfitting. The most important hyperparams: n_estimators, learning_rate, max_depth. Lower learning_rate → needs more trees, but generalizes better.
In SlidesPowerful
XGBoost
Optimized GBM: second-order Taylor expansion, L1+L2 regularization, parallel tree building, column subsampling. Key params: n_estimators, max_depth (3–6), learning_rate (0.01–0.3), subsample, colsample_bytree. Use early_stopping_rounds to avoid overfitting automatically.
In SlidesIndustry Standard
LightGBM
Leaf-wise growth (vs level-wise in XGBoost) → more accurate trees but more prone to overfit without tuning. GOSS + EFB make it 10x faster on large datasets. num_leaves is the most important param (more than max_depth).
In SlidesFast
CatBoost
Handles raw categorical features — just pass them as strings. Ordered boosting reduces overfitting. Symmetric (oblivious) trees → fast inference. Best when you have many high-cardinality categoricals and don't want to encode them.
In Slides
Stacking & Blending
Stacking: Level-0 models make predictions → these become features for Level-1 meta-learner. Blending: simpler — use held-out validation set for meta features. Often adds 0.5–2% on competitions. Risk: complexity and overfitting.
In Slides
Voting Classifier
Hard voting: majority class. Soft voting: average probabilities — always use soft when models give probabilities, it's more accurate because it uses confidence. Combine diverse models (SVM + RF + Logistic) for best results.
In Slides
Algorithm
Speed
Categoricals
Key Tuning Param
Best For
Random Forest
Fast (parallel)
Encode first
n_estimators
General baseline, high dims
XGBoost
Medium
Encode first
learning_rate + early_stop
Structured data, competitions
LightGBM
Very fast
Limited native
num_leaves
Large datasets (1M+ rows)
CatBoost
Medium
Native (best)
iterations + depth
High-cardinality categoricals
AdaBoost
Fast
Encode first
n_estimators
Simple, interpretable ensembles
Module 07
Model Evaluation & Hyperparameter Tuning
The model is only as good as how you measure it. Wrong metrics lead to wrong decisions. This is also where you learn if your model is actually learning or just memorizing.
Confusion Matrix → All Metrics Derived Hover cells
Every classification metric comes from 4 numbers: TP, TN, FP, FN. Hover each cell to see its role in Precision, Recall, F1, and Accuracy.
Accuracy, Precision, Recall, F1
Accuracy = (TP+TN)/N — useless for imbalanced. Precision = TP/(TP+FP) — "how often are positive predictions correct?" Recall = TP/(TP+FN) — "how many positives did we catch?" F1 = harmonic mean of both.
In SlidesCore
ROC Curve & AUC
Plots TPR vs FPR at every threshold. AUC = 0.5 is random; AUC = 1.0 is perfect. Model-comparison tool — works regardless of threshold choice. For very imbalanced data, use Precision-Recall curve instead.
Not in Slides!Must Know
K-Fold & Stratified K-Fold
K-Fold splits into K folds, trains K times (each fold as test once), averages scores. Stratified preserves class ratio in each fold — always use for classification. K=5 or K=10 are standard choices.
In Slides
GridSearchCV & RandomizedSearchCV
GridSearch: exhaustive, guaranteed to find best in grid, O(n_combos × K) fits. RandomizedSearch: random sample, often finds near-optimal at 10% cost. For large search spaces always use Randomized. n_iter controls budget.
In Slides
Bayesian Optimization (Optuna)
Builds a surrogate model of the objective function. Each trial chooses params where improvement is most likely. Far more efficient than random search. Industry standard for expensive models. pip install optuna — 5 lines to use.
Not in Slides!Industry
Learning Curves
Plot train and val scores vs training set size. Both low → underfit (need better model). Large gap → overfit (need more data or regularization). Converging → more data won't help. Essential diagnostic before any optimization.
Not in Slides!
ROC Curve — Compare 3 Models Animated
The closer the curve hugs the top-left corner, the better. AUC = area under the curve. The diagonal is a random classifier (AUC=0.5). Always compare models with AUC, not accuracy.
Precision: P = TP / (TP + FP)
Recall: R = TP / (TP + FN)
F1 Score: F1 = 2·P·R / (P+R)
Accuracy: A = (TP + TN) / (TP + TN + FP + FN)
R² Score: R² = 1 − SS_res / SS_tot
Interview Q
"You have 99% fraud-free data. Your model gets 99% accuracy. Is it good?"
No — a model that predicts "not fraud" for every single transaction would achieve 99% accuracy. Accuracy is meaningless on imbalanced datasets. You should look at Precision and Recall for the minority class (fraud). If recall for fraud is 0%, the model is useless despite 99% accuracy. Use F1-score or AUC-ROC for imbalanced problems, and consider using SMOTE or class_weight='balanced'.
Module 08
Critical Topics NOT in the Slides
The syllabus images covered classical ML well. But these topics were absent — and every real ML role will need them. Do not skip this module.
⚠️ 16 Topics Missing from Your Syllabus
These were not visible in either slide image but are required for real-world ML work:
SHAP ExplainabilityRequired in regulated industries
sklearn PipelinesProduction-grade code pattern
Bayesian OptimizationOptuna — industry standard tuning
Time Series MLARIMA, Prophet, lag features
NLP BasicsTF-IDF, embeddings, BERT
Model DeploymentFastAPI + Docker (you know this!)
Experiment TrackingMLflow, Weights & Biases
Categorical EncodingLabel vs OHE vs Target encoding
Learning CurvesDiagnose under/overfitting
Regression MetricsMAE, MSE, RMSE, R²
Train/Val/Test SplitTest set touched ONCE only
Neural Networks — MLP
Layers of weighted sums + activations. Forward pass: Xᵀ → Z = XW+b → A = activation(Z). Backprop: chain rule computes gradients. ReLU for hidden layers, Sigmoid for binary output, Softmax for multi-class. Implemented in sklearn as MLPClassifier.
Not in Slides!Critical
SHAP Values — Explainability
Assigns each feature a contribution value for each individual prediction. "Feature X pushed this prediction +0.3 above baseline." Works for any model. Required in finance, healthcare, and legal contexts. pip install shap → 3 lines to use.
Not in Slides!Industry Must
ML Pipelines with sklearn
Chain preprocessor + model into one Pipeline object. Prevents data leakage automatically. Enables correct cross-validation. Deploy one object — pickle it, load it, call .predict(). This is how production ML code looks.
Not in Slides!Production
NLP — TF-IDF & Embeddings
TF-IDF: term frequency × inverse document frequency — numeric representation of text. Word2Vec/GloVe: dense word embeddings capturing semantic meaning. BERT/sentence-transformers: contextual embeddings (you used this in your RAG system!).
Not in Slides!
Experiment Tracking — MLflow
Log params, metrics, models, and plots automatically across runs. Compare experiments visually. Reproduce any result. Register and version models. mlflow.autolog() with sklearn adds tracking in one line.
Not in Slides!MLOps
Model Deployment with FastAPI
You already know FastAPI from the RAG system. Save model with joblib.dump(). Load in FastAPI. Wrap .predict() in a POST endpoint. Containerize with Docker. Same pattern as your RAG backend — ML deployment is just another API.
Not in Slides!You Know This!
production_ml_api.py — Deploy Any ML Model
python
from fastapi import FastAPI
from pydantic import BaseModel
import joblib
import numpy as np
app = FastAPI()
pipeline = joblib.load("model.pkl") # full sklearn PipelineclassPredictRequest(BaseModel):
features: list[float] # your feature vectorclassPredictResponse(BaseModel):
prediction: int
probability: float
model_version: str = "v1.0"@app.post("/predict", response_model=PredictResponse)
async defpredict(req: PredictRequest):
X = np.array(req.features).reshape(1, -1)
pred = pipeline.predict(X)[0]
prob = pipeline.predict_proba(X)[0].max()
return PredictResponse(prediction=int(pred), probability=float(prob))
# ↑ Same FastAPI pattern as your RAG system. ML deployment = just another endpoint.
🎓 Complete Curriculum Checkpoint
You've mastered the course when you can:
Explain every algorithm's intuition without code or notes
Choose the right algorithm for any given problem type
Build a leak-free sklearn Pipeline from raw data to prediction
Explain why accuracy fails on imbalanced data and what to use instead
Describe XGBoost's key improvements over vanilla GBM
Deploy any trained model as a FastAPI endpoint (you already know this!)
Explain SHAP and why it matters for model trust in production
Handle missing values, outliers, and categorical features correctly