ML from Scratch — Complete Guide

45+

Topics Covered

8

Learning Modules

100%

From Your Syllabus

Learning Roadmap

Step 1Math Foundations

Step 2Data Preprocessing

Step 3Regression

Step 4Classification

Step 5Clustering

Step 6Ensemble Methods

Step 7Evaluation

⚠️ What the syllabus images don't fully show (you'd need to add):

The visible slides were cut off before showing: F1-Score, ROC-AUC, Log Loss, R² / Adjusted R², MAE/MSE/RMSE metrics. These are covered in the Evaluation section of this guide.

Linear Algebra — Matrices & Vectors

FOUNDATION

What it is: Math for working with tables of numbers. A vector is one column of data. A matrix is a whole table. In ML, your entire dataset is a matrix.

Your Angular component's @Input() properties are like a vector. The full component tree is like a matrix — rows = components, columns = properties.

Key concepts: Dot product (multiplying two vectors = one number), Matrix multiplication (transforming data), Eigenvalues (used in PCA — finding the "most important directions" in data).

dot_product = Σ(a[i] × b[i]) for each index i matrix_multiply: C[i][j] = Σ A[i][k] × B[k][j]

Python — numpy basicsnumpy

import numpy as np

# A vector — one row of your dataset (one sample)
sample = np.array([28, 75000, 3])  # age, salary, years_exp

# A matrix — your whole dataset
dataset = np.array([
    [28, 75000, 3],
    [35, 90000, 8],
    [22, 45000, 1]
])

# Shape = (rows=samples, cols=features)
print(dataset.shape)  # (3, 3)

# Dot product
weights = np.array([0.5, 0.3, 0.2])
result = np.dot(sample, weights)  # 14 + 22500 + 0.6

Calculus — Derivatives & Gradients

FOUNDATION

What it is: A derivative tells you "if I change X a tiny bit, how much does Y change?" In ML, this is how the model learns — it adjusts its weights in the direction that reduces error.

Imagine you're lost on a hilly field in fog and want to reach the lowest valley. You feel the slope under your feet (that's the gradient) and take a step downhill. That's exactly how neural networks learn — it's called "Gradient Descent".

Gradient Descent: The core learning algorithm. Move weights in the direction that reduces loss.

new_weight = old_weight - (learning_rate × gradient) gradient = dLoss/dWeight (how much loss changes per weight change)

Python — Gradient Descent from scratchconcept

# Imagine y = 3x + noise (we want to find w=3)
w = 0.0   # start guess
lr = 0.01  # learning rate — step size

for epoch in range(100):
    prediction = w * x
    error = prediction - y_true
    gradient = 2 * np.mean(error * x)  # dLoss/dw
    w = w - lr * gradient               # step downhill

# After 100 steps, w ≈ 3.0

Probability & Bayes' Theorem

FOUNDATION

What it is: P(A|B) = "the probability of A, GIVEN that B is true." This is how Naive Bayes classifiers work — they use conditional probability to classify.

You get an email with the word "FREE MONEY". What's the probability it's spam given it has those words? That's exactly Bayes' Theorem in action.

P(Spam | "FREE MONEY") = P("FREE MONEY" | Spam) × P(Spam) / P("FREE MONEY") = Likelihood × Prior / Evidence

Why it matters: Used directly in Naive Bayes. Underlies all probabilistic ML. Understanding priors vs posteriors helps you understand how models update beliefs.

Key Distributions to Know

Normal/Gaussian — bell curve, most natural data
Bernoulli — yes/no outcomes (click or no click)
Multinomial — multiple categories (Naive Bayes for text)

Descriptive & Inferential Statistics

FOUNDATION

Descriptive: Summarizing data you have. Mean, median, mode, variance, standard deviation. These are also used directly in preprocessing (Z-Score needs mean + std).

Inferential: Drawing conclusions about a whole population from a sample. Used in hypothesis testing, A/B tests, and understanding model performance.

Python — Basic statsnumpy/pandas

import pandas as pd
import numpy as np

df = pd.read_csv('data.csv')

# Descriptive stats in one line
df.describe()  # count, mean, std, min, max, quartiles

# Individual stats
mean = df['salary'].mean()
std  = df['salary'].std()
med  = df['salary'].median()
var  = df['salary'].var()

# Correlation matrix — which features relate?
df.corr()  # values close to 1 or -1 = strong relationship

Data Imputation (Handling Missing Values)

PREPROCESSING

The problem: Real data always has missing values. You can't just delete all rows — you'd lose too much data.

A spreadsheet where some cells are blank. You need to fill them in intelligently, not just put "0" everywhere.

Strategies

Mean imputation — fill with average (for numeric, no outliers)
Median imputation — fill with middle value (better when outliers exist)
Mode imputation — fill with most frequent (for categorical data)
KNN Imputer — fill using k nearest neighbours' values (smartest)

Python — Imputation methodssklearn

from sklearn.impute import SimpleImputer, KNNImputer
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'age': [25, np.nan, 30, 22, np.nan],
    'salary': [50000, 60000, np.nan, 45000, 70000]
})

# Mean imputation
imputer = SimpleImputer(strategy='mean')
df_filled = imputer.fit_transform(df)

# KNN Imputer (uses nearby rows to guess missing value)
knn_imp = KNNImputer(n_neighbors=2)
df_knn = knn_imp.fit_transform(df)

Outlier Detection — Z-Score & IQR

PREPROCESSING

What it is: Outliers are data points so far from the rest that they'd skew your model. Like one employee earning ₹10 Crore in a list of ₹5-15 Lakh salaries.

Imagine rating a restaurant. 99 people give 4/5. One bitter person gives 0/5. Their score is an outlier that pulls the average down unfairly.

Z-Score: z = (value - mean) / std_dev → |z| > 3 usually means outlier IQR Method: IQR = Q3 - Q1 → Outlier if value < Q1 - 1.5×IQR OR value > Q3 + 1.5×IQR

Python — Outlier Detectionscipy/pandas

from scipy import stats

# Z-Score method
z_scores = np.abs(stats.zscore(df['salary']))
df_clean = df[z_scores < 3]  # keep only non-outliers

# IQR method
Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1
df_iqr = df[(df['salary'] >= Q1 - 1.5*IQR) &
            (df['salary'] <= Q3 + 1.5*IQR)]

Feature Scaling — Min-Max & Standard Scaler

PREPROCESSING

Why it matters: If age is 0-100 and salary is 0-10,000,000, the model thinks salary is a million times more important. Scaling fixes this.

Converting all measurements to the same unit. You wouldn't compare a person's height in feet with their weight in grams without converting first.

Min-Max: x_scaled = (x - min) / (max - min) → range [0, 1] Standard Scaler: x_scaled = (x - mean) / std_dev → mean=0, std=1

When to use which: Min-Max when you know bounds and no big outliers. Standard Scaler (Z-score normalization) when you have outliers or need Gaussian distribution.

Python — Feature Scalingsklearn

from sklearn.preprocessing import MinMaxScaler, StandardScaler

X = [[25, 50000], [30, 90000], [22, 30000]]

# Min-Max: scales to 0-1
mm = MinMaxScaler()
X_mm = mm.fit_transform(X)
# age: [0.375, 0.875, 0.0], salary: [0.5, 1.0, 0.0]

# Standard Scaler: mean=0, std=1
ss = StandardScaler()
X_ss = ss.fit_transform(X)

Feature Transformation — Log & Box-Cox

PREPROCESSING

What it is: Many ML models assume data is normally distributed (bell curve). But real data like salaries, prices, or website visits is skewed. Transformations fix the shape.

Imagine plotting salaries of 1000 people. Most earn 30k-80k, but a few billionaires make the x-axis stretch to millions. Log transform squishes it back to a readable bell shape.

Python — Log & Box-Cox Transformscipy

import numpy as np
from scipy.stats import boxcox

salaries = [30000, 45000, 60000, 500000, 1200000]

# Log transform — great for right-skewed data
log_sal = np.log1p(salaries)  # log1p = log(x+1), handles 0

# Box-Cox — finds the BEST power transformation automatically
bc_sal, lambda_ = boxcox(salaries)  # lambda_ is the optimal power

Handling Imbalanced Data — SMOTE & Undersampling

PREPROCESSING

The problem: You have 9,900 legit transactions and 100 fraudulent ones. If your model just predicts "not fraud" for everything, it's 99% accurate but totally useless.

A class with 99 right-handed and 1 left-handed student. If a teacher treats everyone as right-handed, they're "99% accurate" but failing that one student completely.

Solutions

SMOTE — creates synthetic minority samples (oversampling)
Random Undersampling — removes majority class samples
class_weight='balanced' — tells the model to care more about minority

Python — SMOTEimbalanced-learn

from imblearn.over_sampling import SMOTE

sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X_train, y_train)
# Now minority class has synthetic samples added
print(y_res.value_counts())  # should be balanced now

Dimensionality Reduction — PCA, LDA, t-SNE

PREPROCESSING

What it is: Your dataset might have 500 features (columns). Many are redundant. Reduce to 2-10 key dimensions without losing important information.

A 3D sculpture's shadow on a wall is a 2D projection. You lose the depth, but you still see the shape. PCA finds the "best angle" to project from to keep maximum information.

PCA (Principal Component Analysis) — unsupervised, finds directions of max variance. LDA (Linear Discriminant Analysis) — supervised, finds directions that best separate classes. t-SNE — great for visualization only (2D/3D plots of high-dim data).

Python — PCAsklearn

from sklearn.decomposition import PCA

# Reduce 500 features → 10 principal components
pca = PCA(n_components=10)
X_reduced = pca.fit_transform(X_scaled)

# How much variance does each component explain?
print(pca.explained_variance_ratio_)
# [0.32, 0.18, 0.12, ...]  → first 3 explain 62% of info

Simple & Multiple Linear Regression

REGRESSION

Simple: One input feature → one output. Draw the best straight line through data points.

Multiple: Many input features → one output. The line becomes a plane (or hyperplane).

Predicting your electric bill. Simple: bill depends on AC hours. Multiple: bill depends on AC hours + number of people + season + appliances.

y = w0 + w1×x1 + w2×x2 + ... + wn×xn + error Goal: minimize SSE (Sum of Squared Errors) = Σ(actual - predicted)²

Python — Linear Regressionsklearn

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print(f"R²: {r2_score(y_test, y_pred):.3f}")        # 1.0 = perfect
print(f"RMSE: {mean_squared_error(y_test, y_pred)**0.5:.2f}")

Evaluation Metrics for Regression

MAE — Mean Absolute Error (avg distance from truth, easy to interpret)
MSE — Mean Squared Error (punishes big errors more)
RMSE — Root MSE (same unit as target, most common)
R² — 0 to 1, how much variance the model explains

Polynomial Regression

REGRESSION

What it is: Linear regression can only fit straight lines. Polynomial regression fits curves by adding x², x³ etc. as features.

Speed vs fuel efficiency isn't a straight line — it curves. At 40 km/h it's good, at 80 km/h still ok, at 160 km/h it tanks. You need a curve, not a line.

y = w0 + w1×x + w2×x² + w3×x³ + ... (Still "linear regression" because the weights w are linear — only x is polynomial)

Python — Polynomial Regressionsklearn

from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

# Degree 2 = add x² features, degree 3 = add x², x³
poly_model = Pipeline([
    ('poly', PolynomialFeatures(degree=2)),
    ('linear', LinearRegression())
])
poly_model.fit(X_train, y_train)

# ⚠️ High degree = overfitting risk!
# degree=10 fits training data perfectly but fails on new data

Regularization — L1 Lasso, L2 Ridge, ElasticNet

REGRESSION

The problem it solves: Overfitting — model learns training data too well (memorizes noise) and fails on new data.

A student who memorizes past exam papers vs one who understands the subject. Regularization penalizes "too much complexity" — like a teacher saying "explain it simply, don't recite 100 rules".

L1 (Lasso): Can reduce some weights to exactly 0 — effectively removes useless features. Good for feature selection.

L2 (Ridge): Shrinks all weights toward 0 but never exactly 0. Good when all features matter a little.

ElasticNet: Mix of both L1 + L2.

Loss(Lasso) = MSE + α × Σ|weights| ← L1 Loss(Ridge) = MSE + α × Σ(weights²) ← L2 Loss(Elastic) = MSE + α×L1 + β×L2

Python — Regularizationsklearn

from sklearn.linear_model import Lasso, Ridge, ElasticNet

# alpha = regularization strength (higher = more penalty)
lasso = Lasso(alpha=0.1)    # some weights become exactly 0
ridge = Ridge(alpha=1.0)    # all weights shrink but stay non-zero
enet  = ElasticNet(alpha=0.5, l1_ratio=0.5)  # 50/50 mix

lasso.fit(X_train, y_train)
print(lasso.coef_)  # see which features got zero'd out

Decision Tree & Random Forest Regression

REGRESSION

Decision Tree Regression: Instead of fitting a line, it splits data into boxes and predicts the average value in each box.

A flowchart: "Is it a house or flat? If house → is it > 3 BHK? If yes → predict ₹1.2Cr. If no → predict ₹80L". Each box predicts a number.

Random Forest Regression: 100+ decision trees, each trained on random data subsets. Final prediction = average of all trees. Much more accurate and robust.

Python — Random Forest Regressionsklearn

from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(
    n_estimators=100,   # 100 trees
    max_depth=10,        # max tree depth
    random_state=42
)
rf.fit(X_train, y_train)

# Which features matter most?
importances = pd.Series(rf.feature_importances_, index=feature_names)
importances.sort_values(ascending=False).plot.bar()

Support Vector Regression (SVR)

REGRESSION

What it is: Find a "tube" (epsilon-tube) around the prediction line. Only data points OUTSIDE the tube contribute to the error. Good for non-linear data with kernel trick.

You're drawing a pipe through a cloud of points. Points inside the pipe are "fine". Only the ones sticking out matter — those are the "support vectors".

When to use SVR

Small to medium datasets (gets slow on large data)
Non-linear relationships (use RBF kernel)
When you want robustness to outliers

Python — SVRsklearn

from sklearn.svm import SVR

# IMPORTANT: SVR needs scaled features!
svr = SVR(kernel='rbf', C=100, epsilon=0.1)
svr.fit(X_train_scaled, y_train)
y_pred = svr.predict(X_test_scaled)

Logistic Regression

CLASSIFICATION

⚠️ Confusing name: Despite "Regression" in the name, this is a CLASSIFIER. It outputs a probability (0-1) and you set a threshold (usually 0.5) to classify.

You're deciding whether to give a loan. Logistic regression outputs "73% chance they'll repay". You decide: above 60% = approve. It's giving you a probability, not a category directly.

sigmoid(z) = 1 / (1 + e^(-z)) → squishes output to [0,1] z = w0 + w1×x1 + w2×x2 + ... If sigmoid(z) > 0.5 → class 1, else class 0

Python — Logistic Regressionsklearn

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(C=1.0, max_iter=1000)
model.fit(X_train, y_train)

# Probabilities (not just 0/1)
probs = model.predict_proba(X_test)   # [[0.27, 0.73], ...]
preds = model.predict(X_test)         # [1, 0, 1, 1, ...]

# Multiclass: set multi_class='multinomial'

Naive Bayes (Gaussian, Multinomial, Bernoulli)

CLASSIFICATION

What it is: Uses Bayes' Theorem assuming all features are independent. "Naive" because real features are rarely truly independent, but it still works surprisingly well.

Spam filter: it checks each word separately — "FREE", "MONEY", "CLICK" each raise the spam probability independently. It ignores that "FREE MONEY" together is more suspicious than each word alone.

Gaussian NB: continuous features (age, salary). Multinomial NB: count data (word frequencies). Bernoulli NB: binary features (word present/absent).

Python — Naive Bayessklearn

from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB

# Gaussian: for continuous features like age, salary
gnb = GaussianNB()

# Multinomial: for text classification (word counts)
mnb = MultinomialNB(alpha=1.0)  # alpha = smoothing

# Bernoulli: for binary features (word present: yes/no)
bnb = BernoulliNB()

gnb.fit(X_train, y_train)
accuracy = gnb.score(X_test, y_test)

Best for

Text classification (spam, sentiment)
Real-time predictions (very fast)
Small datasets where it often beats complex models

K-Nearest Neighbors (KNN)

CLASSIFICATION

What it is: For a new data point, find the K most similar points in training data. Majority class among those K neighbors = prediction.

You move to a new city and wonder which restaurant to try. You ask your 5 nearest neighbours (K=5). 3 recommend "Restaurant A", 2 say "B". You go to A. That's KNN.

Choosing K: Low K (1-3) = flexible, may overfit. High K = smooth, may underfit. Usually try K=5, then tune with cross-validation. K should be odd for binary classification to avoid ties.

Distance (Euclidean) = √(Σ(xi - xj)²) Predict: majority class among K nearest points

Python — KNN Classifiersklearn

from sklearn.neighbors import KNeighborsClassifier

# MUST scale features! KNN is distance-based
knn = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn.fit(X_train_scaled, y_train)

# Find optimal K
scores = []
for k in range(1, 21):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train_scaled, y_train)
    scores.append(knn.score(X_test_scaled, y_test))
# Pick k with highest score

Support Vector Machines (SVM) & Kernel SVM

CLASSIFICATION

What it is: Find the "best dividing line" (hyperplane) between classes, maximizing the margin (gap) between the line and the nearest points of each class. Those nearest points are the "support vectors".

Drawing a line between red and blue dots on paper. You want the line that has the maximum "no man's land" gap on either side. The dots right on the edge of each side are the support vectors.

Kernel SVM: When data isn't linearly separable, the kernel trick maps data to a higher dimension where it IS separable — without actually computing that transformation (it's a mathematical shortcut).

Kernels

Linear kernel — for linearly separable data
RBF (Radial Basis Function) — most common, works for most non-linear data
Polynomial kernel — for polynomial boundaries

Python — SVMsklearn

from sklearn.svm import SVC

# C = margin hardness (high C = fewer margin violations)
# gamma = RBF influence radius (high = complex boundary)
svm = SVC(kernel='rbf', C=10, gamma='scale', probability=True)
svm.fit(X_train_scaled, y_train)

# probability=True enables predict_proba()
probs = svm.predict_proba(X_test_scaled)

Decision Tree Classification

CLASSIFICATION

What it is: A flowchart of yes/no questions. At each node, it splits data to maximize "purity" (all one class together). Leaves are final class predictions.

The game "20 Questions". "Is it alive? → Is it bigger than a dog? → Does it have stripes?" Each question narrows down the answer. The tree learns which questions are most useful.

Gini Impurity = 1 - Σ(pi²) (0 = pure, 0.5 = max impurity) Entropy = -Σ(pi × log2(pi)) Information Gain = Entropy(parent) - weighted_avg(Entropy(children))

Python — Decision Treesklearn

from sklearn.tree import DecisionTreeClassifier, export_text

dt = DecisionTreeClassifier(
    max_depth=5,          # prevents overfitting
    min_samples_leaf=5,   # min 5 samples per leaf
    criterion='gini'
)
dt.fit(X_train, y_train)

# Print the tree rules — interpretable!
print(export_text(dt, feature_names=feature_names))

Random Forest Classification

CLASSIFICATION

What it is: An ensemble of decision trees. Each tree is trained on a random subset of data and features. Final prediction = majority vote of all trees.

Instead of asking one expert, you ask 100 experts. Each sees slightly different data. You take the majority vote. This "wisdom of crowds" approach is far more accurate than any single tree.

Why it's better than one tree: Reduces overfitting (high variance → low variance), handles missing values, gives feature importances, rarely needs tuning.

Python — Random Forestsklearn

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(
    n_estimators=200,    # more trees = more stable
    max_features='sqrt', # each tree sees √(n_features)
    oob_score=True,      # out-of-bag score (free validation)
    n_jobs=-1            # use all CPU cores
)
rf.fit(X_train, y_train)
print(f"OOB Score: {rf.oob_score_:.3f}")

K-Means Clustering & K-Means++

CLUSTERING

What it is: Group data into K clusters. Each point belongs to the cluster with the nearest center (centroid). Centers are updated iteratively until stable.

You have 1000 customers and want to group them into 3 types. K-Means finds 3 "average customers" and assigns everyone to their closest type. Think of it as finding 3 representative profiles automatically.

K-Means++: Better initialization — centroids start spread out, not random. Converges faster and to better solutions.

Step 1: Place K centroids randomly Step 2: Assign each point to nearest centroid Step 3: Move each centroid to the mean of its assigned points Repeat 2-3 until centroids stop moving

Python — K-Means Clusteringsklearn

from sklearn.cluster import KMeans

km = KMeans(
    n_clusters=3,
    init='k-means++',  # smarter init (default)
    n_init=10,         # run 10 times, pick best
    random_state=42
)
labels = km.fit_predict(X_scaled)
centers = km.cluster_centers_
inertia = km.inertia_  # sum of squared distances to centers

Elbow Method & Silhouette Score

CLUSTERING

The problem: K-Means needs you to choose K. How do you know the right number of clusters?

Elbow method: Imagine adding more employees to a team. First 5 reduce workload a lot. Adding employees 6-10 helps less. After 15, it's chaos. The "elbow" is where adding more stops helping much.

Silhouette Score: For each point, measures "how well does it fit its own cluster vs the nearest other cluster". Range: -1 (bad) to +1 (perfect). Higher = better clusters.

Python — Finding optimal Ksklearn

from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt

inertias = []
sil_scores = []

for k in range(2, 11):
    km = KMeans(n_clusters=k, random_state=42)
    labels = km.fit_predict(X_scaled)
    inertias.append(km.inertia_)
    sil_scores.append(silhouette_score(X_scaled, labels))

# Plot inertia — look for the "elbow"
plt.plot(range(2, 11), inertias, marker='o')
plt.title('Elbow Method')

# Best K = highest silhouette score
best_k = range(2, 11)[sil_scores.index(max(sil_scores))]

Hierarchical Clustering & Dendrograms

CLUSTERING

What it is: Builds a tree (dendrogram) showing how data points merge into clusters at different scales. You don't need to specify K upfront.

Think of evolution's family tree. Individual species → subspecies → species → genus → family → order. You can "cut" the tree at any level to get any number of clusters.

Agglomerative (bottom-up): Start with every point as its own cluster, merge closest pairs repeatedly. Divisive (top-down): Start with one big cluster, split repeatedly.

Python — Hierarchical Clusteringsklearn/scipy

from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt

# Draw the dendrogram to see natural clusters
Z = linkage(X_scaled, method='ward')
dendrogram(Z, truncate_mode='level', p=5)
plt.title('Dendrogram')
plt.show()

# Cut at K=3 clusters
hc = AgglomerativeClustering(n_clusters=3, linkage='ward')
labels = hc.fit_predict(X_scaled)

DBSCAN (Density-Based Clustering)

CLUSTERING

What it is: Clusters are "dense regions" of points. Works on any cluster shape (not just circles). Automatically identifies outliers as noise (label = -1).

Imagine people standing in a park. Groups of people close together are clusters. Lone people standing far away from any group are "noise points" — outliers.

Parameters: eps = max distance between neighbors, min_samples = min points to form a cluster core.

DBSCAN vs K-Means

DBSCAN: no need to specify K, finds arbitrary shapes, handles noise
K-Means: faster, works better on spherical clusters, needs K upfront
Use DBSCAN when clusters are irregular or you have many outliers

Python — DBSCANsklearn

from sklearn.cluster import DBSCAN

db = DBSCAN(eps=0.5, min_samples=5)
labels = db.fit_predict(X_scaled)

n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = list(labels).count(-1)  # -1 = outlier/noise
print(f"Clusters: {n_clusters}, Noise points: {n_noise}")

Gaussian Mixture Models (GMM)

CLUSTERING

What it is: Assumes data comes from a mix of K Gaussian (normal) distributions. "Soft clustering" — each point gets a probability of belonging to each cluster, not a hard assignment.

K-Means is like forcing everyone into exactly one political party. GMM is like saying "this voter is 60% party A, 30% party B, 10% party C" — they might switch depending on context.

Python — Gaussian Mixture Modelsklearn

from sklearn.mixture import GaussianMixture

gmm = GaussianMixture(n_components=3, covariance_type='full')
gmm.fit(X_scaled)

# Hard labels
labels = gmm.predict(X_scaled)

# Soft probabilities — unique to GMM!
probs = gmm.predict_proba(X_scaled)
# Each row sums to 1.0, shows cluster membership probability

Anomaly Detection

UNSUPERVISED

What it is: Find data points that don't fit the normal pattern. Used for fraud detection, network intrusion, manufacturing defects, medical anomalies.

Security camera watching a parking lot. Every day 50 cars park normally. One day a car parks backwards and stays for 12 hours at 3am. That's an anomaly — it stands out from normal behavior.

Common Methods

Isolation Forest — isolates anomalies with random splits (fast, scalable)
One-Class SVM — learns the "normal" boundary, flags outside it
Statistical: flag Z-score > 3 as anomaly
Autoencoders (deep learning) — high reconstruction error = anomaly

Python — Isolation Forestsklearn

from sklearn.ensemble import IsolationForest

iso = IsolationForest(
    contamination=0.05,  # expect ~5% anomalies
    random_state=42
)
iso.fit(X_train)
predictions = iso.predict(X_test)
# Returns: 1 = normal, -1 = anomaly
anomalies = X_test[predictions == -1]

Association Rule Learning — Apriori & Eclat

UNSUPERVISED

What it is: Find which items appear together frequently. Classic use case: Market Basket Analysis — "customers who buy diapers also buy beer on Fridays".

You run a supermarket. You look at thousands of receipts and notice: 73% of people who buy bread also buy butter in the same trip. That's an association rule. You move butter next to bread → sales increase.

Support = (transactions with {A and B}) / (total transactions) Confidence = support({A,B}) / support({A}) Lift = Confidence / support({B}) → Lift > 1 means buying A INCREASES chance of buying B

Python — Apriori Algorithmmlxtend

from mlxtend.frequent_patterns import apriori, association_rules

# data = one-hot encoded: rows=transactions, cols=items
# e.g. df[['bread','butter','milk','eggs']] = True/False

frequent_items = apriori(df, min_support=0.05, use_colnames=True)
rules = association_rules(frequent_items, metric='confidence', min_threshold=0.5)

# Top rules by lift
rules.sort_values('lift', ascending=False).head(10)

Bagging (Bootstrap Aggregating) & Pasting

ENSEMBLE

Bagging: Train multiple models on different random SAMPLES (with replacement) of training data. Combine with voting/averaging. Reduces variance (overfitting).

Pasting: Same idea but sampling WITHOUT replacement. Less diversity but sometimes better.

You want to predict election results. Instead of one giant poll, you run 100 small independent polls in random samples of voters. You average all results → more reliable than any single poll.

Python — BaggingClassifiersklearn

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(),
    n_estimators=100,
    max_samples=0.8,       # 80% of data per tree
    bootstrap=True,        # False = Pasting
    oob_score=True
)
bag.fit(X_train, y_train)

Random Patches & Random Subspaces

ENSEMBLE

Random Subspaces: Each model is trained on all samples but only a random subset of features. Reduces feature correlation between trees.

Random Patches: Each model sees random subset of BOTH samples AND features. Maximum diversity.

Random Subspaces: Each expert sees the full patient list but only some of the symptoms. Random Patches: Each expert sees only some patients AND only some symptoms. Both force diversity.

Python — Random Subspaces/Patchessklearn

# Random Subspaces: full samples, subset of features
rs = BaggingClassifier(bootstrap=False, max_features=0.5)

# Random Patches: subset of both samples AND features
rp = BaggingClassifier(bootstrap=True, max_features=0.5)

Voting Classifier / Regressor

ENSEMBLE

What it is: Combine fundamentally different models (SVM + Random Forest + Logistic Regression) and take majority vote (hard) or average probabilities (soft).

Asking 3 doctors — a GP, a specialist, and a radiologist — for a diagnosis. Each sees from a different angle. If all 3 agree, you're probably right. Even when they disagree, the majority vote is usually better than any one alone.

Python — VotingClassifiersklearn

from sklearn.ensemble import VotingClassifier

voting = VotingClassifier(
    estimators=[
        ('lr', LogisticRegression()),
        ('rf', RandomForestClassifier(n_estimators=100)),
        ('svm', SVC(probability=True))
    ],
    voting='soft'  # 'soft' averages probabilities (better)
)
voting.fit(X_train, y_train)

Gradient Boosting (GBM)

BOOSTING

What it is: Trees are built SEQUENTIALLY. Each new tree corrects the errors of the previous. The model "boosts" by focusing on where it was wrong before.

A teacher gives a test. After marking, they identify the hardest questions. Next class, they focus ONLY on those hard questions. Next test: focus on the new weak spots. Each iteration targets remaining weaknesses.

F_m(x) = F_{m-1}(x) + η × h_m(x) where h_m = new tree fitted to RESIDUALS of previous model η = learning rate (how much to trust each new tree)

Python — GradientBoostingsklearn

from sklearn.ensemble import GradientBoostingClassifier

gbm = GradientBoostingClassifier(
    n_estimators=200,
    learning_rate=0.05,  # smaller = more trees needed but better
    max_depth=3,          # shallow trees work best for boosting
    subsample=0.8         # stochastic gradient boosting
)
gbm.fit(X_train, y_train)

XGBoost — Extreme Gradient Boosting

BOOSTING

What it is: XGBoost = GBM + regularization + parallelization + hardware optimization. Dominated ML competitions for years. Still one of the best algorithms for tabular data.

XGBoost extras vs vanilla GBM

Built-in L1 + L2 regularization (prevents overfitting)
Parallel tree building (fast even on large data)
Handles missing values automatically
Early stopping (stop when validation score stops improving)

Python — XGBoostxgboost

import xgboost as xgb

model = xgb.XGBClassifier(
    n_estimators=500,
    learning_rate=0.05,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,  # random feature subset per tree
    reg_alpha=0.1,         # L1 regularization
    reg_lambda=1.0,        # L2 regularization
    early_stopping_rounds=50,
    eval_metric='logloss',
    use_label_encoder=False
)
model.fit(X_train, y_train,
          eval_set=[(X_test, y_test)],
          verbose=100)

LightGBM & CatBoost

BOOSTING

LightGBM: Microsoft's GBM. Grows trees leaf-wise (not level-wise) → faster, more accurate for large datasets. Great when you have millions of rows.

CatBoost: Yandex's GBM. Handles categorical features natively without encoding. Reduces overfitting on small datasets. Very easy to use.

Python — LightGBM & CatBoostlgb/catboost

import lightgbm as lgb
from catboost import CatBoostClassifier

# LightGBM
lgbm = lgb.LGBMClassifier(
    n_estimators=500, learning_rate=0.05,
    num_leaves=31  # key param: controls complexity
)

# CatBoost — no encoding needed for categorical columns!
cat_features = ['city', 'category', 'brand']  # column names
cb = CatBoostClassifier(
    iterations=500, learning_rate=0.05,
    cat_features=cat_features,  # just tell it which are categorical
    verbose=100
)

AdaBoost

BOOSTING

What it is: The original boosting algorithm. Misclassified samples get higher weights in the next iteration. Each new model focuses more on hard cases.

A quiz where wrong answers get 2x points in the next round. You'd study harder for questions you got wrong. AdaBoost does the same — increases the importance of data points the model keeps getting wrong.

Python — AdaBoostsklearn

from sklearn.ensemble import AdaBoostClassifier

ada = AdaBoostClassifier(
    n_estimators=200,
    learning_rate=0.5,
    algorithm='SAMME.R'  # uses probabilities
)
ada.fit(X_train, y_train)

Stacking & Blending

ENSEMBLE

Stacking: Train Level-1 models. Their predictions become features for a Level-2 "meta-model". The meta-model learns how to best combine Level-1 predictions.

Blending: Simpler version — Level-1 models predict on a holdout set, meta-model trains on those predictions.

You ask 5 specialist doctors (Level-1) for their diagnosis. Then a senior consultant (meta-model/Level-2) reviews all 5 opinions and makes the final call — knowing each doctor's strengths and biases.

Python — StackingClassifiersklearn

from sklearn.ensemble import StackingClassifier

level1 = [
    ('rf', RandomForestClassifier(n_estimators=100)),
    ('xgb', xgb.XGBClassifier()),
    ('svm', SVC(probability=True))
]
meta_model = LogisticRegression()

stacking = StackingClassifier(
    estimators=level1,
    final_estimator=meta_model,
    cv=5  # cross-validation to avoid overfitting
)
stacking.fit(X_train, y_train)

Confusion Matrix — TP, TN, FP, FN

EVALUATION

What it is: A 2×2 table showing the 4 types of predictions. The foundation of ALL classification metrics.

               Predicted YES    Predicted NO
Actual YES  →   TP (True Pos)    FN (False Neg)
Actual NO   →   FP (False Pos)   TN (True Neg)

Pregnancy test: TP = test says pregnant, IS pregnant. TN = test says not pregnant, is NOT pregnant. FP = test says pregnant but is NOT (false alarm). FN = test says not pregnant but IS (missed!). In medical tests, FN is usually more dangerous than FP.

Python — Confusion Matrixsklearn

from sklearn.metrics import confusion_matrix, classification_report

cm = confusion_matrix(y_test, y_pred)
# [[TN, FP],
#  [FN, TP]]

# Full report in one line!
print(classification_report(y_test, y_pred))

Accuracy, Precision, Recall, F1-Score

EVALUATION

Accuracy = (TP + TN) / Total ← overall correct Precision = TP / (TP + FP) ← of all predicted YES, how many actually YES? Recall = TP / (TP + FN) ← of all actual YES, how many did we catch? F1-Score = 2 × (Precision × Recall) / (Precision + Recall)

Email spam filter: Precision = "of emails I flagged as spam, how many actually were?" (don't want to flag real emails). Recall = "of all actual spam emails, how many did I catch?" (don't want spam getting through). F1 = balance of both when you can't maximize one without hurting the other.

When to use which metric

Accuracy — balanced classes, general purpose
Precision — when FP is costly (spam filter, false police alerts)
Recall — when FN is costly (cancer detection, fraud)
F1 — imbalanced classes, need balance of P & R

ROC-AUC Curve

EVALUATION

ROC: Receiver Operating Characteristic curve — plots True Positive Rate vs False Positive Rate at every threshold. AUC: Area Under the Curve. AUC=1.0 is perfect, AUC=0.5 is random (useless).

You have a sorting machine separating apples from oranges. Lower the sensitivity → catches more apples but also mislabels oranges. ROC-AUC measures how well you can separate both across ALL sensitivity levels — not just one threshold.

Python — ROC AUCsklearn

from sklearn.metrics import roc_auc_score, roc_curve
import matplotlib.pyplot as plt

# Get probabilities, not binary predictions
y_proba = model.predict_proba(X_test)[:, 1]

auc = roc_auc_score(y_test, y_proba)
print(f"AUC: {auc:.3f}")  # 0.5=random, 1.0=perfect

fpr, tpr, thresholds = roc_curve(y_test, y_proba)
plt.plot(fpr, tpr, label=f'AUC={auc:.2f}')
plt.plot([0,1],[0,1],'--')  # random baseline
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()

K-Fold & Stratified K-Fold Cross Validation

EVALUATION

What it is: Instead of one train/test split, use K different splits. Each slice gets to be the test set once. Average K scores = more reliable estimate of real performance.

Testing a new medicine on patients. Instead of using the same 20% as test group every time (which might be unusual patients), you rotate — everyone gets to be in the test group once across 5 trials. Much fairer estimate of how the medicine works overall.

Stratified K-Fold: Each fold maintains the same class ratio as the original data. Essential for imbalanced datasets.

Python — Cross Validationsklearn

from sklearn.model_selection import cross_val_score, StratifiedKFold

# Simple K-Fold (5 folds)
scores = cross_val_score(model, X, y, cv=5, scoring='f1')
print(f"CV F1: {scores.mean():.3f} ± {scores.std():.3f}")

# Stratified (preserves class ratio per fold)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=skf, scoring='roc_auc')

GridSearchCV

TUNING

What it is: Try every single combination of hyperparameters you specify. Guaranteed to find the best within the grid, but SLOW for large grids.

Buying a new phone: you try every combination of brand × color × storage → brand(3) × color(4) × storage(3) = 36 combinations. GridSearch is that exhaustive — it tries them all.

Python — GridSearchCVsklearn

from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 5, 7, None],
    'min_samples_split': [2, 5, 10]
}
# 3 × 4 × 3 = 36 combinations × 5 CV folds = 180 fits

gs = GridSearchCV(
    RandomForestClassifier(), param_grid,
    cv=5, scoring='f1', n_jobs=-1, verbose=1
)
gs.fit(X_train, y_train)

print(gs.best_params_)   # {'max_depth': 7, 'n_estimators': 200, ...}
print(gs.best_score_)    # 0.923
best_model = gs.best_estimator_

RandomizedSearchCV

TUNING

What it is: Instead of trying every combination, randomly sample N combinations. Faster than GridSearch and often just as good — or better — because it explores more of the space.

Same phone buying but instead of testing all 36 combinations, you randomly try 15. You'll likely find something great much faster, and often the extra 21 tests wouldn't change your pick anyway.

Python — RandomizedSearchCVsklearn

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

# Can use distributions, not just lists!
param_dist = {
    'n_estimators': randint(100, 500),       # random int 100-500
    'max_depth': randint(3, 15),
    'learning_rate': uniform(0.01, 0.2),     # random float
    'subsample': uniform(0.6, 0.4)
}

rs = RandomizedSearchCV(
    xgb.XGBClassifier(), param_dist,
    n_iter=50,   # try 50 random combinations (vs 36 exhaustive)
    cv=5, scoring='roc_auc', n_jobs=-1
)
rs.fit(X_train, y_train)
print(rs.best_params_)

Model Selection & Evaluation Pipeline

PIPELINE

The right workflow: Never tune on your test set. Use train → validation → test. Pipeline prevents data leakage by ensuring preprocessing is learned from train data only.

CORRECT: Train set → fit scaler, fit model Val set → tune hyperparameters (never touch test set here!) Test set → final evaluation only (use ONCE at the very end)

Python — Full Pipelinesklearn

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# ✅ Scaler is INSIDE the pipeline → no leakage!
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', RandomForestClassifier(n_estimators=100))
])

# GridSearch the whole pipeline
param_grid = {
    'model__n_estimators': [100, 200],  # prefix = pipeline step name
    'model__max_depth': [5, 10]
}
gs = GridSearchCV(pipeline, param_grid, cv=5)
gs.fit(X_train, y_train)

# Final test set evaluation — only done ONCE
final_score = gs.score(X_test, y_test)

Machine Learning
from Scratch

Learning Roadmap

Math Foundations

Data Preprocessing

Regression

Classification

Clustering (Unsupervised)

Anomaly Detection & Association Rules

Ensemble Methods & Boosting

Model Evaluation

Hyperparameter Tuning

Machine Learningfrom Scratch

Learning Roadmap

Math Foundations

Data Preprocessing

Regression

Classification

Clustering (Unsupervised)

Anomaly Detection & Association Rules

Ensemble Methods & Boosting

Model Evaluation

Hyperparameter Tuning

Machine Learning
from Scratch