45+

Topics Covered

8

Learning Modules

100%

From Your Syllabus

Learning Roadmap

Step 1Math Foundations
Step 2Data Preprocessing
Step 3Regression
Step 4Classification
Step 5Clustering
Step 6Ensemble Methods
Step 7Evaluation

โš ๏ธ What the syllabus images don't fully show (you'd need to add):

The visible slides were cut off before showing: F1-Score, ROC-AUC, Log Loss, Rยฒ / Adjusted Rยฒ, MAE/MSE/RMSE metrics. These are covered in the Evaluation section of this guide.

Linear Algebra โ€” Matrices & Vectors
FOUNDATION

What it is: Math for working with tables of numbers. A vector is one column of data. A matrix is a whole table. In ML, your entire dataset is a matrix.

Your Angular component's @Input() properties are like a vector. The full component tree is like a matrix โ€” rows = components, columns = properties.

Key concepts: Dot product (multiplying two vectors = one number), Matrix multiplication (transforming data), Eigenvalues (used in PCA โ€” finding the "most important directions" in data).

dot_product = ฮฃ(a[i] ร— b[i]) for each index i matrix_multiply: C[i][j] = ฮฃ A[i][k] ร— B[k][j]
Python โ€” numpy basicsnumpy
import numpy as np

# A vector โ€” one row of your dataset (one sample)
sample = np.array([28, 75000, 3])  # age, salary, years_exp

# A matrix โ€” your whole dataset
dataset = np.array([
    [28, 75000, 3],
    [35, 90000, 8],
    [22, 45000, 1]
])

# Shape = (rows=samples, cols=features)
print(dataset.shape)  # (3, 3)

# Dot product
weights = np.array([0.5, 0.3, 0.2])
result = np.dot(sample, weights)  # 14 + 22500 + 0.6
Calculus โ€” Derivatives & Gradients
FOUNDATION

What it is: A derivative tells you "if I change X a tiny bit, how much does Y change?" In ML, this is how the model learns โ€” it adjusts its weights in the direction that reduces error.

Imagine you're lost on a hilly field in fog and want to reach the lowest valley. You feel the slope under your feet (that's the gradient) and take a step downhill. That's exactly how neural networks learn โ€” it's called "Gradient Descent".

Gradient Descent: The core learning algorithm. Move weights in the direction that reduces loss.

new_weight = old_weight - (learning_rate ร— gradient) gradient = dLoss/dWeight (how much loss changes per weight change)
Python โ€” Gradient Descent from scratchconcept
# Imagine y = 3x + noise (we want to find w=3)
w = 0.0   # start guess
lr = 0.01  # learning rate โ€” step size

for epoch in range(100):
    prediction = w * x
    error = prediction - y_true
    gradient = 2 * np.mean(error * x)  # dLoss/dw
    w = w - lr * gradient               # step downhill

# After 100 steps, w โ‰ˆ 3.0
Probability & Bayes' Theorem
FOUNDATION

What it is: P(A|B) = "the probability of A, GIVEN that B is true." This is how Naive Bayes classifiers work โ€” they use conditional probability to classify.

You get an email with the word "FREE MONEY". What's the probability it's spam given it has those words? That's exactly Bayes' Theorem in action.
P(Spam | "FREE MONEY") = P("FREE MONEY" | Spam) ร— P(Spam) / P("FREE MONEY") = Likelihood ร— Prior / Evidence

Why it matters: Used directly in Naive Bayes. Underlies all probabilistic ML. Understanding priors vs posteriors helps you understand how models update beliefs.

Key Distributions to Know
  • Normal/Gaussian โ€” bell curve, most natural data
  • Bernoulli โ€” yes/no outcomes (click or no click)
  • Multinomial โ€” multiple categories (Naive Bayes for text)
Descriptive & Inferential Statistics
FOUNDATION

Descriptive: Summarizing data you have. Mean, median, mode, variance, standard deviation. These are also used directly in preprocessing (Z-Score needs mean + std).

Inferential: Drawing conclusions about a whole population from a sample. Used in hypothesis testing, A/B tests, and understanding model performance.

Python โ€” Basic statsnumpy/pandas
import pandas as pd
import numpy as np

df = pd.read_csv('data.csv')

# Descriptive stats in one line
df.describe()  # count, mean, std, min, max, quartiles

# Individual stats
mean = df['salary'].mean()
std  = df['salary'].std()
med  = df['salary'].median()
var  = df['salary'].var()

# Correlation matrix โ€” which features relate?
df.corr()  # values close to 1 or -1 = strong relationship
Data Imputation (Handling Missing Values)
PREPROCESSING

The problem: Real data always has missing values. You can't just delete all rows โ€” you'd lose too much data.

A spreadsheet where some cells are blank. You need to fill them in intelligently, not just put "0" everywhere.
Strategies
  • Mean imputation โ€” fill with average (for numeric, no outliers)
  • Median imputation โ€” fill with middle value (better when outliers exist)
  • Mode imputation โ€” fill with most frequent (for categorical data)
  • KNN Imputer โ€” fill using k nearest neighbours' values (smartest)
Python โ€” Imputation methodssklearn
from sklearn.impute import SimpleImputer, KNNImputer
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'age': [25, np.nan, 30, 22, np.nan],
    'salary': [50000, 60000, np.nan, 45000, 70000]
})

# Mean imputation
imputer = SimpleImputer(strategy='mean')
df_filled = imputer.fit_transform(df)

# KNN Imputer (uses nearby rows to guess missing value)
knn_imp = KNNImputer(n_neighbors=2)
df_knn = knn_imp.fit_transform(df)
Outlier Detection โ€” Z-Score & IQR
PREPROCESSING

What it is: Outliers are data points so far from the rest that they'd skew your model. Like one employee earning โ‚น10 Crore in a list of โ‚น5-15 Lakh salaries.

Imagine rating a restaurant. 99 people give 4/5. One bitter person gives 0/5. Their score is an outlier that pulls the average down unfairly.
Z-Score: z = (value - mean) / std_dev โ†’ |z| > 3 usually means outlier IQR Method: IQR = Q3 - Q1 โ†’ Outlier if value < Q1 - 1.5ร—IQR OR value > Q3 + 1.5ร—IQR
Python โ€” Outlier Detectionscipy/pandas
from scipy import stats

# Z-Score method
z_scores = np.abs(stats.zscore(df['salary']))
df_clean = df[z_scores < 3]  # keep only non-outliers

# IQR method
Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1
df_iqr = df[(df['salary'] >= Q1 - 1.5*IQR) &
            (df['salary'] <= Q3 + 1.5*IQR)]
Feature Scaling โ€” Min-Max & Standard Scaler
PREPROCESSING

Why it matters: If age is 0-100 and salary is 0-10,000,000, the model thinks salary is a million times more important. Scaling fixes this.

Converting all measurements to the same unit. You wouldn't compare a person's height in feet with their weight in grams without converting first.
Min-Max: x_scaled = (x - min) / (max - min) โ†’ range [0, 1] Standard Scaler: x_scaled = (x - mean) / std_dev โ†’ mean=0, std=1

When to use which: Min-Max when you know bounds and no big outliers. Standard Scaler (Z-score normalization) when you have outliers or need Gaussian distribution.

Python โ€” Feature Scalingsklearn
from sklearn.preprocessing import MinMaxScaler, StandardScaler

X = [[25, 50000], [30, 90000], [22, 30000]]

# Min-Max: scales to 0-1
mm = MinMaxScaler()
X_mm = mm.fit_transform(X)
# age: [0.375, 0.875, 0.0], salary: [0.5, 1.0, 0.0]

# Standard Scaler: mean=0, std=1
ss = StandardScaler()
X_ss = ss.fit_transform(X)
Feature Transformation โ€” Log & Box-Cox
PREPROCESSING

What it is: Many ML models assume data is normally distributed (bell curve). But real data like salaries, prices, or website visits is skewed. Transformations fix the shape.

Imagine plotting salaries of 1000 people. Most earn 30k-80k, but a few billionaires make the x-axis stretch to millions. Log transform squishes it back to a readable bell shape.
Python โ€” Log & Box-Cox Transformscipy
import numpy as np
from scipy.stats import boxcox

salaries = [30000, 45000, 60000, 500000, 1200000]

# Log transform โ€” great for right-skewed data
log_sal = np.log1p(salaries)  # log1p = log(x+1), handles 0

# Box-Cox โ€” finds the BEST power transformation automatically
bc_sal, lambda_ = boxcox(salaries)  # lambda_ is the optimal power
Handling Imbalanced Data โ€” SMOTE & Undersampling
PREPROCESSING

The problem: You have 9,900 legit transactions and 100 fraudulent ones. If your model just predicts "not fraud" for everything, it's 99% accurate but totally useless.

A class with 99 right-handed and 1 left-handed student. If a teacher treats everyone as right-handed, they're "99% accurate" but failing that one student completely.
Solutions
  • SMOTE โ€” creates synthetic minority samples (oversampling)
  • Random Undersampling โ€” removes majority class samples
  • class_weight='balanced' โ€” tells the model to care more about minority
Python โ€” SMOTEimbalanced-learn
from imblearn.over_sampling import SMOTE

sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X_train, y_train)
# Now minority class has synthetic samples added
print(y_res.value_counts())  # should be balanced now
Dimensionality Reduction โ€” PCA, LDA, t-SNE
PREPROCESSING

What it is: Your dataset might have 500 features (columns). Many are redundant. Reduce to 2-10 key dimensions without losing important information.

A 3D sculpture's shadow on a wall is a 2D projection. You lose the depth, but you still see the shape. PCA finds the "best angle" to project from to keep maximum information.

PCA (Principal Component Analysis) โ€” unsupervised, finds directions of max variance. LDA (Linear Discriminant Analysis) โ€” supervised, finds directions that best separate classes. t-SNE โ€” great for visualization only (2D/3D plots of high-dim data).

Python โ€” PCAsklearn
from sklearn.decomposition import PCA

# Reduce 500 features โ†’ 10 principal components
pca = PCA(n_components=10)
X_reduced = pca.fit_transform(X_scaled)

# How much variance does each component explain?
print(pca.explained_variance_ratio_)
# [0.32, 0.18, 0.12, ...]  โ†’ first 3 explain 62% of info
Simple & Multiple Linear Regression
REGRESSION

Simple: One input feature โ†’ one output. Draw the best straight line through data points.

Multiple: Many input features โ†’ one output. The line becomes a plane (or hyperplane).

Predicting your electric bill. Simple: bill depends on AC hours. Multiple: bill depends on AC hours + number of people + season + appliances.
y = w0 + w1ร—x1 + w2ร—x2 + ... + wnร—xn + error Goal: minimize SSE (Sum of Squared Errors) = ฮฃ(actual - predicted)ยฒ
Python โ€” Linear Regressionsklearn
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print(f"Rยฒ: {r2_score(y_test, y_pred):.3f}")        # 1.0 = perfect
print(f"RMSE: {mean_squared_error(y_test, y_pred)**0.5:.2f}")
Evaluation Metrics for Regression
  • MAE โ€” Mean Absolute Error (avg distance from truth, easy to interpret)
  • MSE โ€” Mean Squared Error (punishes big errors more)
  • RMSE โ€” Root MSE (same unit as target, most common)
  • Rยฒ โ€” 0 to 1, how much variance the model explains
Polynomial Regression
REGRESSION

What it is: Linear regression can only fit straight lines. Polynomial regression fits curves by adding xยฒ, xยณ etc. as features.

Speed vs fuel efficiency isn't a straight line โ€” it curves. At 40 km/h it's good, at 80 km/h still ok, at 160 km/h it tanks. You need a curve, not a line.
y = w0 + w1ร—x + w2ร—xยฒ + w3ร—xยณ + ... (Still "linear regression" because the weights w are linear โ€” only x is polynomial)
Python โ€” Polynomial Regressionsklearn
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

# Degree 2 = add xยฒ features, degree 3 = add xยฒ, xยณ
poly_model = Pipeline([
    ('poly', PolynomialFeatures(degree=2)),
    ('linear', LinearRegression())
])
poly_model.fit(X_train, y_train)

# โš ๏ธ High degree = overfitting risk!
# degree=10 fits training data perfectly but fails on new data
Regularization โ€” L1 Lasso, L2 Ridge, ElasticNet
REGRESSION

The problem it solves: Overfitting โ€” model learns training data too well (memorizes noise) and fails on new data.

A student who memorizes past exam papers vs one who understands the subject. Regularization penalizes "too much complexity" โ€” like a teacher saying "explain it simply, don't recite 100 rules".

L1 (Lasso): Can reduce some weights to exactly 0 โ€” effectively removes useless features. Good for feature selection.

L2 (Ridge): Shrinks all weights toward 0 but never exactly 0. Good when all features matter a little.

ElasticNet: Mix of both L1 + L2.

Loss(Lasso) = MSE + ฮฑ ร— ฮฃ|weights| โ† L1 Loss(Ridge) = MSE + ฮฑ ร— ฮฃ(weightsยฒ) โ† L2 Loss(Elastic) = MSE + ฮฑร—L1 + ฮฒร—L2
Python โ€” Regularizationsklearn
from sklearn.linear_model import Lasso, Ridge, ElasticNet

# alpha = regularization strength (higher = more penalty)
lasso = Lasso(alpha=0.1)    # some weights become exactly 0
ridge = Ridge(alpha=1.0)    # all weights shrink but stay non-zero
enet  = ElasticNet(alpha=0.5, l1_ratio=0.5)  # 50/50 mix

lasso.fit(X_train, y_train)
print(lasso.coef_)  # see which features got zero'd out
Decision Tree & Random Forest Regression
REGRESSION

Decision Tree Regression: Instead of fitting a line, it splits data into boxes and predicts the average value in each box.

A flowchart: "Is it a house or flat? If house โ†’ is it > 3 BHK? If yes โ†’ predict โ‚น1.2Cr. If no โ†’ predict โ‚น80L". Each box predicts a number.

Random Forest Regression: 100+ decision trees, each trained on random data subsets. Final prediction = average of all trees. Much more accurate and robust.

Python โ€” Random Forest Regressionsklearn
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(
    n_estimators=100,   # 100 trees
    max_depth=10,        # max tree depth
    random_state=42
)
rf.fit(X_train, y_train)

# Which features matter most?
importances = pd.Series(rf.feature_importances_, index=feature_names)
importances.sort_values(ascending=False).plot.bar()
Support Vector Regression (SVR)
REGRESSION

What it is: Find a "tube" (epsilon-tube) around the prediction line. Only data points OUTSIDE the tube contribute to the error. Good for non-linear data with kernel trick.

You're drawing a pipe through a cloud of points. Points inside the pipe are "fine". Only the ones sticking out matter โ€” those are the "support vectors".
When to use SVR
  • Small to medium datasets (gets slow on large data)
  • Non-linear relationships (use RBF kernel)
  • When you want robustness to outliers
Python โ€” SVRsklearn
from sklearn.svm import SVR

# IMPORTANT: SVR needs scaled features!
svr = SVR(kernel='rbf', C=100, epsilon=0.1)
svr.fit(X_train_scaled, y_train)
y_pred = svr.predict(X_test_scaled)
Logistic Regression
CLASSIFICATION

โš ๏ธ Confusing name: Despite "Regression" in the name, this is a CLASSIFIER. It outputs a probability (0-1) and you set a threshold (usually 0.5) to classify.

You're deciding whether to give a loan. Logistic regression outputs "73% chance they'll repay". You decide: above 60% = approve. It's giving you a probability, not a category directly.
sigmoid(z) = 1 / (1 + e^(-z)) โ†’ squishes output to [0,1] z = w0 + w1ร—x1 + w2ร—x2 + ... If sigmoid(z) > 0.5 โ†’ class 1, else class 0
Python โ€” Logistic Regressionsklearn
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(C=1.0, max_iter=1000)
model.fit(X_train, y_train)

# Probabilities (not just 0/1)
probs = model.predict_proba(X_test)   # [[0.27, 0.73], ...]
preds = model.predict(X_test)         # [1, 0, 1, 1, ...]

# Multiclass: set multi_class='multinomial'
Naive Bayes (Gaussian, Multinomial, Bernoulli)
CLASSIFICATION

What it is: Uses Bayes' Theorem assuming all features are independent. "Naive" because real features are rarely truly independent, but it still works surprisingly well.

Spam filter: it checks each word separately โ€” "FREE", "MONEY", "CLICK" each raise the spam probability independently. It ignores that "FREE MONEY" together is more suspicious than each word alone.

Gaussian NB: continuous features (age, salary). Multinomial NB: count data (word frequencies). Bernoulli NB: binary features (word present/absent).

Python โ€” Naive Bayessklearn
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB

# Gaussian: for continuous features like age, salary
gnb = GaussianNB()

# Multinomial: for text classification (word counts)
mnb = MultinomialNB(alpha=1.0)  # alpha = smoothing

# Bernoulli: for binary features (word present: yes/no)
bnb = BernoulliNB()

gnb.fit(X_train, y_train)
accuracy = gnb.score(X_test, y_test)
Best for
  • Text classification (spam, sentiment)
  • Real-time predictions (very fast)
  • Small datasets where it often beats complex models
K-Nearest Neighbors (KNN)
CLASSIFICATION

What it is: For a new data point, find the K most similar points in training data. Majority class among those K neighbors = prediction.

You move to a new city and wonder which restaurant to try. You ask your 5 nearest neighbours (K=5). 3 recommend "Restaurant A", 2 say "B". You go to A. That's KNN.

Choosing K: Low K (1-3) = flexible, may overfit. High K = smooth, may underfit. Usually try K=5, then tune with cross-validation. K should be odd for binary classification to avoid ties.

Distance (Euclidean) = โˆš(ฮฃ(xi - xj)ยฒ) Predict: majority class among K nearest points
Python โ€” KNN Classifiersklearn
from sklearn.neighbors import KNeighborsClassifier

# MUST scale features! KNN is distance-based
knn = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn.fit(X_train_scaled, y_train)

# Find optimal K
scores = []
for k in range(1, 21):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train_scaled, y_train)
    scores.append(knn.score(X_test_scaled, y_test))
# Pick k with highest score
Support Vector Machines (SVM) & Kernel SVM
CLASSIFICATION

What it is: Find the "best dividing line" (hyperplane) between classes, maximizing the margin (gap) between the line and the nearest points of each class. Those nearest points are the "support vectors".

Drawing a line between red and blue dots on paper. You want the line that has the maximum "no man's land" gap on either side. The dots right on the edge of each side are the support vectors.

Kernel SVM: When data isn't linearly separable, the kernel trick maps data to a higher dimension where it IS separable โ€” without actually computing that transformation (it's a mathematical shortcut).

Kernels
  • Linear kernel โ€” for linearly separable data
  • RBF (Radial Basis Function) โ€” most common, works for most non-linear data
  • Polynomial kernel โ€” for polynomial boundaries
Python โ€” SVMsklearn
from sklearn.svm import SVC

# C = margin hardness (high C = fewer margin violations)
# gamma = RBF influence radius (high = complex boundary)
svm = SVC(kernel='rbf', C=10, gamma='scale', probability=True)
svm.fit(X_train_scaled, y_train)

# probability=True enables predict_proba()
probs = svm.predict_proba(X_test_scaled)
Decision Tree Classification
CLASSIFICATION

What it is: A flowchart of yes/no questions. At each node, it splits data to maximize "purity" (all one class together). Leaves are final class predictions.

The game "20 Questions". "Is it alive? โ†’ Is it bigger than a dog? โ†’ Does it have stripes?" Each question narrows down the answer. The tree learns which questions are most useful.
Gini Impurity = 1 - ฮฃ(piยฒ) (0 = pure, 0.5 = max impurity) Entropy = -ฮฃ(pi ร— log2(pi)) Information Gain = Entropy(parent) - weighted_avg(Entropy(children))
Python โ€” Decision Treesklearn
from sklearn.tree import DecisionTreeClassifier, export_text

dt = DecisionTreeClassifier(
    max_depth=5,          # prevents overfitting
    min_samples_leaf=5,   # min 5 samples per leaf
    criterion='gini'
)
dt.fit(X_train, y_train)

# Print the tree rules โ€” interpretable!
print(export_text(dt, feature_names=feature_names))
Random Forest Classification
CLASSIFICATION

What it is: An ensemble of decision trees. Each tree is trained on a random subset of data and features. Final prediction = majority vote of all trees.

Instead of asking one expert, you ask 100 experts. Each sees slightly different data. You take the majority vote. This "wisdom of crowds" approach is far more accurate than any single tree.

Why it's better than one tree: Reduces overfitting (high variance โ†’ low variance), handles missing values, gives feature importances, rarely needs tuning.

Python โ€” Random Forestsklearn
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(
    n_estimators=200,    # more trees = more stable
    max_features='sqrt', # each tree sees โˆš(n_features)
    oob_score=True,      # out-of-bag score (free validation)
    n_jobs=-1            # use all CPU cores
)
rf.fit(X_train, y_train)
print(f"OOB Score: {rf.oob_score_:.3f}")
K-Means Clustering & K-Means++
CLUSTERING

What it is: Group data into K clusters. Each point belongs to the cluster with the nearest center (centroid). Centers are updated iteratively until stable.

You have 1000 customers and want to group them into 3 types. K-Means finds 3 "average customers" and assigns everyone to their closest type. Think of it as finding 3 representative profiles automatically.

K-Means++: Better initialization โ€” centroids start spread out, not random. Converges faster and to better solutions.

Step 1: Place K centroids randomly Step 2: Assign each point to nearest centroid Step 3: Move each centroid to the mean of its assigned points Repeat 2-3 until centroids stop moving
Python โ€” K-Means Clusteringsklearn
from sklearn.cluster import KMeans

km = KMeans(
    n_clusters=3,
    init='k-means++',  # smarter init (default)
    n_init=10,         # run 10 times, pick best
    random_state=42
)
labels = km.fit_predict(X_scaled)
centers = km.cluster_centers_
inertia = km.inertia_  # sum of squared distances to centers
Elbow Method & Silhouette Score
CLUSTERING

The problem: K-Means needs you to choose K. How do you know the right number of clusters?

Elbow method: Imagine adding more employees to a team. First 5 reduce workload a lot. Adding employees 6-10 helps less. After 15, it's chaos. The "elbow" is where adding more stops helping much.

Silhouette Score: For each point, measures "how well does it fit its own cluster vs the nearest other cluster". Range: -1 (bad) to +1 (perfect). Higher = better clusters.

Python โ€” Finding optimal Ksklearn
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt

inertias = []
sil_scores = []

for k in range(2, 11):
    km = KMeans(n_clusters=k, random_state=42)
    labels = km.fit_predict(X_scaled)
    inertias.append(km.inertia_)
    sil_scores.append(silhouette_score(X_scaled, labels))

# Plot inertia โ€” look for the "elbow"
plt.plot(range(2, 11), inertias, marker='o')
plt.title('Elbow Method')

# Best K = highest silhouette score
best_k = range(2, 11)[sil_scores.index(max(sil_scores))]
Hierarchical Clustering & Dendrograms
CLUSTERING

What it is: Builds a tree (dendrogram) showing how data points merge into clusters at different scales. You don't need to specify K upfront.

Think of evolution's family tree. Individual species โ†’ subspecies โ†’ species โ†’ genus โ†’ family โ†’ order. You can "cut" the tree at any level to get any number of clusters.

Agglomerative (bottom-up): Start with every point as its own cluster, merge closest pairs repeatedly. Divisive (top-down): Start with one big cluster, split repeatedly.

Python โ€” Hierarchical Clusteringsklearn/scipy
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt

# Draw the dendrogram to see natural clusters
Z = linkage(X_scaled, method='ward')
dendrogram(Z, truncate_mode='level', p=5)
plt.title('Dendrogram')
plt.show()

# Cut at K=3 clusters
hc = AgglomerativeClustering(n_clusters=3, linkage='ward')
labels = hc.fit_predict(X_scaled)
DBSCAN (Density-Based Clustering)
CLUSTERING

What it is: Clusters are "dense regions" of points. Works on any cluster shape (not just circles). Automatically identifies outliers as noise (label = -1).

Imagine people standing in a park. Groups of people close together are clusters. Lone people standing far away from any group are "noise points" โ€” outliers.

Parameters: eps = max distance between neighbors, min_samples = min points to form a cluster core.

DBSCAN vs K-Means
  • DBSCAN: no need to specify K, finds arbitrary shapes, handles noise
  • K-Means: faster, works better on spherical clusters, needs K upfront
  • Use DBSCAN when clusters are irregular or you have many outliers
Python โ€” DBSCANsklearn
from sklearn.cluster import DBSCAN

db = DBSCAN(eps=0.5, min_samples=5)
labels = db.fit_predict(X_scaled)

n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = list(labels).count(-1)  # -1 = outlier/noise
print(f"Clusters: {n_clusters}, Noise points: {n_noise}")
Gaussian Mixture Models (GMM)
CLUSTERING

What it is: Assumes data comes from a mix of K Gaussian (normal) distributions. "Soft clustering" โ€” each point gets a probability of belonging to each cluster, not a hard assignment.

K-Means is like forcing everyone into exactly one political party. GMM is like saying "this voter is 60% party A, 30% party B, 10% party C" โ€” they might switch depending on context.
Python โ€” Gaussian Mixture Modelsklearn
from sklearn.mixture import GaussianMixture

gmm = GaussianMixture(n_components=3, covariance_type='full')
gmm.fit(X_scaled)

# Hard labels
labels = gmm.predict(X_scaled)

# Soft probabilities โ€” unique to GMM!
probs = gmm.predict_proba(X_scaled)
# Each row sums to 1.0, shows cluster membership probability
Anomaly Detection
UNSUPERVISED

What it is: Find data points that don't fit the normal pattern. Used for fraud detection, network intrusion, manufacturing defects, medical anomalies.

Security camera watching a parking lot. Every day 50 cars park normally. One day a car parks backwards and stays for 12 hours at 3am. That's an anomaly โ€” it stands out from normal behavior.
Common Methods
  • Isolation Forest โ€” isolates anomalies with random splits (fast, scalable)
  • One-Class SVM โ€” learns the "normal" boundary, flags outside it
  • Statistical: flag Z-score > 3 as anomaly
  • Autoencoders (deep learning) โ€” high reconstruction error = anomaly
Python โ€” Isolation Forestsklearn
from sklearn.ensemble import IsolationForest

iso = IsolationForest(
    contamination=0.05,  # expect ~5% anomalies
    random_state=42
)
iso.fit(X_train)
predictions = iso.predict(X_test)
# Returns: 1 = normal, -1 = anomaly
anomalies = X_test[predictions == -1]
Association Rule Learning โ€” Apriori & Eclat
UNSUPERVISED

What it is: Find which items appear together frequently. Classic use case: Market Basket Analysis โ€” "customers who buy diapers also buy beer on Fridays".

You run a supermarket. You look at thousands of receipts and notice: 73% of people who buy bread also buy butter in the same trip. That's an association rule. You move butter next to bread โ†’ sales increase.
Support = (transactions with {A and B}) / (total transactions) Confidence = support({A,B}) / support({A}) Lift = Confidence / support({B}) โ†’ Lift > 1 means buying A INCREASES chance of buying B
Python โ€” Apriori Algorithmmlxtend
from mlxtend.frequent_patterns import apriori, association_rules

# data = one-hot encoded: rows=transactions, cols=items
# e.g. df[['bread','butter','milk','eggs']] = True/False

frequent_items = apriori(df, min_support=0.05, use_colnames=True)
rules = association_rules(frequent_items, metric='confidence', min_threshold=0.5)

# Top rules by lift
rules.sort_values('lift', ascending=False).head(10)
Bagging (Bootstrap Aggregating) & Pasting
ENSEMBLE

Bagging: Train multiple models on different random SAMPLES (with replacement) of training data. Combine with voting/averaging. Reduces variance (overfitting).

Pasting: Same idea but sampling WITHOUT replacement. Less diversity but sometimes better.

You want to predict election results. Instead of one giant poll, you run 100 small independent polls in random samples of voters. You average all results โ†’ more reliable than any single poll.
Python โ€” BaggingClassifiersklearn
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(),
    n_estimators=100,
    max_samples=0.8,       # 80% of data per tree
    bootstrap=True,        # False = Pasting
    oob_score=True
)
bag.fit(X_train, y_train)
Random Patches & Random Subspaces
ENSEMBLE

Random Subspaces: Each model is trained on all samples but only a random subset of features. Reduces feature correlation between trees.

Random Patches: Each model sees random subset of BOTH samples AND features. Maximum diversity.

Random Subspaces: Each expert sees the full patient list but only some of the symptoms. Random Patches: Each expert sees only some patients AND only some symptoms. Both force diversity.
Python โ€” Random Subspaces/Patchessklearn
# Random Subspaces: full samples, subset of features
rs = BaggingClassifier(bootstrap=False, max_features=0.5)

# Random Patches: subset of both samples AND features
rp = BaggingClassifier(bootstrap=True, max_features=0.5)
Voting Classifier / Regressor
ENSEMBLE

What it is: Combine fundamentally different models (SVM + Random Forest + Logistic Regression) and take majority vote (hard) or average probabilities (soft).

Asking 3 doctors โ€” a GP, a specialist, and a radiologist โ€” for a diagnosis. Each sees from a different angle. If all 3 agree, you're probably right. Even when they disagree, the majority vote is usually better than any one alone.
Python โ€” VotingClassifiersklearn
from sklearn.ensemble import VotingClassifier

voting = VotingClassifier(
    estimators=[
        ('lr', LogisticRegression()),
        ('rf', RandomForestClassifier(n_estimators=100)),
        ('svm', SVC(probability=True))
    ],
    voting='soft'  # 'soft' averages probabilities (better)
)
voting.fit(X_train, y_train)
Gradient Boosting (GBM)
BOOSTING

What it is: Trees are built SEQUENTIALLY. Each new tree corrects the errors of the previous. The model "boosts" by focusing on where it was wrong before.

A teacher gives a test. After marking, they identify the hardest questions. Next class, they focus ONLY on those hard questions. Next test: focus on the new weak spots. Each iteration targets remaining weaknesses.
F_m(x) = F_{m-1}(x) + ฮท ร— h_m(x) where h_m = new tree fitted to RESIDUALS of previous model ฮท = learning rate (how much to trust each new tree)
Python โ€” GradientBoostingsklearn
from sklearn.ensemble import GradientBoostingClassifier

gbm = GradientBoostingClassifier(
    n_estimators=200,
    learning_rate=0.05,  # smaller = more trees needed but better
    max_depth=3,          # shallow trees work best for boosting
    subsample=0.8         # stochastic gradient boosting
)
gbm.fit(X_train, y_train)
XGBoost โ€” Extreme Gradient Boosting
BOOSTING

What it is: XGBoost = GBM + regularization + parallelization + hardware optimization. Dominated ML competitions for years. Still one of the best algorithms for tabular data.

XGBoost extras vs vanilla GBM
  • Built-in L1 + L2 regularization (prevents overfitting)
  • Parallel tree building (fast even on large data)
  • Handles missing values automatically
  • Early stopping (stop when validation score stops improving)
Python โ€” XGBoostxgboost
import xgboost as xgb

model = xgb.XGBClassifier(
    n_estimators=500,
    learning_rate=0.05,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,  # random feature subset per tree
    reg_alpha=0.1,         # L1 regularization
    reg_lambda=1.0,        # L2 regularization
    early_stopping_rounds=50,
    eval_metric='logloss',
    use_label_encoder=False
)
model.fit(X_train, y_train,
          eval_set=[(X_test, y_test)],
          verbose=100)
LightGBM & CatBoost
BOOSTING

LightGBM: Microsoft's GBM. Grows trees leaf-wise (not level-wise) โ†’ faster, more accurate for large datasets. Great when you have millions of rows.

CatBoost: Yandex's GBM. Handles categorical features natively without encoding. Reduces overfitting on small datasets. Very easy to use.

Python โ€” LightGBM & CatBoostlgb/catboost
import lightgbm as lgb
from catboost import CatBoostClassifier

# LightGBM
lgbm = lgb.LGBMClassifier(
    n_estimators=500, learning_rate=0.05,
    num_leaves=31  # key param: controls complexity
)

# CatBoost โ€” no encoding needed for categorical columns!
cat_features = ['city', 'category', 'brand']  # column names
cb = CatBoostClassifier(
    iterations=500, learning_rate=0.05,
    cat_features=cat_features,  # just tell it which are categorical
    verbose=100
)
AdaBoost
BOOSTING

What it is: The original boosting algorithm. Misclassified samples get higher weights in the next iteration. Each new model focuses more on hard cases.

A quiz where wrong answers get 2x points in the next round. You'd study harder for questions you got wrong. AdaBoost does the same โ€” increases the importance of data points the model keeps getting wrong.
Python โ€” AdaBoostsklearn
from sklearn.ensemble import AdaBoostClassifier

ada = AdaBoostClassifier(
    n_estimators=200,
    learning_rate=0.5,
    algorithm='SAMME.R'  # uses probabilities
)
ada.fit(X_train, y_train)
Stacking & Blending
ENSEMBLE

Stacking: Train Level-1 models. Their predictions become features for a Level-2 "meta-model". The meta-model learns how to best combine Level-1 predictions.

Blending: Simpler version โ€” Level-1 models predict on a holdout set, meta-model trains on those predictions.

You ask 5 specialist doctors (Level-1) for their diagnosis. Then a senior consultant (meta-model/Level-2) reviews all 5 opinions and makes the final call โ€” knowing each doctor's strengths and biases.
Python โ€” StackingClassifiersklearn
from sklearn.ensemble import StackingClassifier

level1 = [
    ('rf', RandomForestClassifier(n_estimators=100)),
    ('xgb', xgb.XGBClassifier()),
    ('svm', SVC(probability=True))
]
meta_model = LogisticRegression()

stacking = StackingClassifier(
    estimators=level1,
    final_estimator=meta_model,
    cv=5  # cross-validation to avoid overfitting
)
stacking.fit(X_train, y_train)
Confusion Matrix โ€” TP, TN, FP, FN
EVALUATION

What it is: A 2ร—2 table showing the 4 types of predictions. The foundation of ALL classification metrics.

Predicted YES Predicted NO
Actual YES โ†’ TP (True Pos) FN (False Neg)
Actual NO โ†’ FP (False Pos) TN (True Neg)
Pregnancy test: TP = test says pregnant, IS pregnant. TN = test says not pregnant, is NOT pregnant. FP = test says pregnant but is NOT (false alarm). FN = test says not pregnant but IS (missed!). In medical tests, FN is usually more dangerous than FP.
Python โ€” Confusion Matrixsklearn
from sklearn.metrics import confusion_matrix, classification_report

cm = confusion_matrix(y_test, y_pred)
# [[TN, FP],
#  [FN, TP]]

# Full report in one line!
print(classification_report(y_test, y_pred))
Accuracy, Precision, Recall, F1-Score
EVALUATION
Accuracy = (TP + TN) / Total โ† overall correct Precision = TP / (TP + FP) โ† of all predicted YES, how many actually YES? Recall = TP / (TP + FN) โ† of all actual YES, how many did we catch? F1-Score = 2 ร— (Precision ร— Recall) / (Precision + Recall)
Email spam filter: Precision = "of emails I flagged as spam, how many actually were?" (don't want to flag real emails). Recall = "of all actual spam emails, how many did I catch?" (don't want spam getting through). F1 = balance of both when you can't maximize one without hurting the other.
When to use which metric
  • Accuracy โ€” balanced classes, general purpose
  • Precision โ€” when FP is costly (spam filter, false police alerts)
  • Recall โ€” when FN is costly (cancer detection, fraud)
  • F1 โ€” imbalanced classes, need balance of P & R
ROC-AUC Curve
EVALUATION

ROC: Receiver Operating Characteristic curve โ€” plots True Positive Rate vs False Positive Rate at every threshold. AUC: Area Under the Curve. AUC=1.0 is perfect, AUC=0.5 is random (useless).

You have a sorting machine separating apples from oranges. Lower the sensitivity โ†’ catches more apples but also mislabels oranges. ROC-AUC measures how well you can separate both across ALL sensitivity levels โ€” not just one threshold.
Python โ€” ROC AUCsklearn
from sklearn.metrics import roc_auc_score, roc_curve
import matplotlib.pyplot as plt

# Get probabilities, not binary predictions
y_proba = model.predict_proba(X_test)[:, 1]

auc = roc_auc_score(y_test, y_proba)
print(f"AUC: {auc:.3f}")  # 0.5=random, 1.0=perfect

fpr, tpr, thresholds = roc_curve(y_test, y_proba)
plt.plot(fpr, tpr, label=f'AUC={auc:.2f}')
plt.plot([0,1],[0,1],'--')  # random baseline
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
K-Fold & Stratified K-Fold Cross Validation
EVALUATION

What it is: Instead of one train/test split, use K different splits. Each slice gets to be the test set once. Average K scores = more reliable estimate of real performance.

Testing a new medicine on patients. Instead of using the same 20% as test group every time (which might be unusual patients), you rotate โ€” everyone gets to be in the test group once across 5 trials. Much fairer estimate of how the medicine works overall.

Stratified K-Fold: Each fold maintains the same class ratio as the original data. Essential for imbalanced datasets.

Python โ€” Cross Validationsklearn
from sklearn.model_selection import cross_val_score, StratifiedKFold

# Simple K-Fold (5 folds)
scores = cross_val_score(model, X, y, cv=5, scoring='f1')
print(f"CV F1: {scores.mean():.3f} ยฑ {scores.std():.3f}")

# Stratified (preserves class ratio per fold)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=skf, scoring='roc_auc')
GridSearchCV
TUNING

What it is: Try every single combination of hyperparameters you specify. Guaranteed to find the best within the grid, but SLOW for large grids.

Buying a new phone: you try every combination of brand ร— color ร— storage โ†’ brand(3) ร— color(4) ร— storage(3) = 36 combinations. GridSearch is that exhaustive โ€” it tries them all.
Python โ€” GridSearchCVsklearn
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 5, 7, None],
    'min_samples_split': [2, 5, 10]
}
# 3 ร— 4 ร— 3 = 36 combinations ร— 5 CV folds = 180 fits

gs = GridSearchCV(
    RandomForestClassifier(), param_grid,
    cv=5, scoring='f1', n_jobs=-1, verbose=1
)
gs.fit(X_train, y_train)

print(gs.best_params_)   # {'max_depth': 7, 'n_estimators': 200, ...}
print(gs.best_score_)    # 0.923
best_model = gs.best_estimator_
RandomizedSearchCV
TUNING

What it is: Instead of trying every combination, randomly sample N combinations. Faster than GridSearch and often just as good โ€” or better โ€” because it explores more of the space.

Same phone buying but instead of testing all 36 combinations, you randomly try 15. You'll likely find something great much faster, and often the extra 21 tests wouldn't change your pick anyway.
Python โ€” RandomizedSearchCVsklearn
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

# Can use distributions, not just lists!
param_dist = {
    'n_estimators': randint(100, 500),       # random int 100-500
    'max_depth': randint(3, 15),
    'learning_rate': uniform(0.01, 0.2),     # random float
    'subsample': uniform(0.6, 0.4)
}

rs = RandomizedSearchCV(
    xgb.XGBClassifier(), param_dist,
    n_iter=50,   # try 50 random combinations (vs 36 exhaustive)
    cv=5, scoring='roc_auc', n_jobs=-1
)
rs.fit(X_train, y_train)
print(rs.best_params_)
Model Selection & Evaluation Pipeline
PIPELINE

The right workflow: Never tune on your test set. Use train โ†’ validation โ†’ test. Pipeline prevents data leakage by ensuring preprocessing is learned from train data only.

CORRECT: Train set โ†’ fit scaler, fit model Val set โ†’ tune hyperparameters (never touch test set here!) Test set โ†’ final evaluation only (use ONCE at the very end)
Python โ€” Full Pipelinesklearn
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# โœ… Scaler is INSIDE the pipeline โ†’ no leakage!
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', RandomForestClassifier(n_estimators=100))
])

# GridSearch the whole pipeline
param_grid = {
    'model__n_estimators': [100, 200],  # prefix = pipeline step name
    'model__max_depth': [5, 10]
}
gs = GridSearchCV(pipeline, param_grid, cv=5)
gs.fit(X_train, y_train)

# Final test set evaluation โ€” only done ONCE
final_score = gs.score(X_test, y_test)