Machine Learning
from Scratch
You're an Angular dev with 3 years experience. You can read Python. This guide is built for YOU โ it connects ML to things you already know, explains the math simply, and shows real code.
Topics Covered
Learning Modules
From Your Syllabus
Learning Roadmap
โ ๏ธ What the syllabus images don't fully show (you'd need to add):
The visible slides were cut off before showing: F1-Score, ROC-AUC, Log Loss, Rยฒ / Adjusted Rยฒ, MAE/MSE/RMSE metrics. These are covered in the Evaluation section of this guide.
Math Foundations
You don't need a math degree. You need to understand 4 things: Linear Algebra, Calculus, Probability, and Statistics. Here's each one in plain English.
What it is: Math for working with tables of numbers. A vector is one column of data. A matrix is a whole table. In ML, your entire dataset is a matrix.
Key concepts: Dot product (multiplying two vectors = one number), Matrix multiplication (transforming data), Eigenvalues (used in PCA โ finding the "most important directions" in data).
import numpy as np # A vector โ one row of your dataset (one sample) sample = np.array([28, 75000, 3]) # age, salary, years_exp # A matrix โ your whole dataset dataset = np.array([ [28, 75000, 3], [35, 90000, 8], [22, 45000, 1] ]) # Shape = (rows=samples, cols=features) print(dataset.shape) # (3, 3) # Dot product weights = np.array([0.5, 0.3, 0.2]) result = np.dot(sample, weights) # 14 + 22500 + 0.6
What it is: A derivative tells you "if I change X a tiny bit, how much does Y change?" In ML, this is how the model learns โ it adjusts its weights in the direction that reduces error.
Gradient Descent: The core learning algorithm. Move weights in the direction that reduces loss.
# Imagine y = 3x + noise (we want to find w=3) w = 0.0 # start guess lr = 0.01 # learning rate โ step size for epoch in range(100): prediction = w * x error = prediction - y_true gradient = 2 * np.mean(error * x) # dLoss/dw w = w - lr * gradient # step downhill # After 100 steps, w โ 3.0
What it is: P(A|B) = "the probability of A, GIVEN that B is true." This is how Naive Bayes classifiers work โ they use conditional probability to classify.
Why it matters: Used directly in Naive Bayes. Underlies all probabilistic ML. Understanding priors vs posteriors helps you understand how models update beliefs.
- Normal/Gaussian โ bell curve, most natural data
- Bernoulli โ yes/no outcomes (click or no click)
- Multinomial โ multiple categories (Naive Bayes for text)
Descriptive: Summarizing data you have. Mean, median, mode, variance, standard deviation. These are also used directly in preprocessing (Z-Score needs mean + std).
Inferential: Drawing conclusions about a whole population from a sample. Used in hypothesis testing, A/B tests, and understanding model performance.
import pandas as pd import numpy as np df = pd.read_csv('data.csv') # Descriptive stats in one line df.describe() # count, mean, std, min, max, quartiles # Individual stats mean = df['salary'].mean() std = df['salary'].std() med = df['salary'].median() var = df['salary'].var() # Correlation matrix โ which features relate? df.corr() # values close to 1 or -1 = strong relationship
Data Preprocessing
Garbage in = garbage out. 80% of real ML work is here. This is where your dev background actually gives you a big advantage โ it's mostly data wrangling.
The problem: Real data always has missing values. You can't just delete all rows โ you'd lose too much data.
- Mean imputation โ fill with average (for numeric, no outliers)
- Median imputation โ fill with middle value (better when outliers exist)
- Mode imputation โ fill with most frequent (for categorical data)
- KNN Imputer โ fill using k nearest neighbours' values (smartest)
from sklearn.impute import SimpleImputer, KNNImputer import pandas as pd import numpy as np df = pd.DataFrame({ 'age': [25, np.nan, 30, 22, np.nan], 'salary': [50000, 60000, np.nan, 45000, 70000] }) # Mean imputation imputer = SimpleImputer(strategy='mean') df_filled = imputer.fit_transform(df) # KNN Imputer (uses nearby rows to guess missing value) knn_imp = KNNImputer(n_neighbors=2) df_knn = knn_imp.fit_transform(df)
What it is: Outliers are data points so far from the rest that they'd skew your model. Like one employee earning โน10 Crore in a list of โน5-15 Lakh salaries.
from scipy import stats # Z-Score method z_scores = np.abs(stats.zscore(df['salary'])) df_clean = df[z_scores < 3] # keep only non-outliers # IQR method Q1 = df['salary'].quantile(0.25) Q3 = df['salary'].quantile(0.75) IQR = Q3 - Q1 df_iqr = df[(df['salary'] >= Q1 - 1.5*IQR) & (df['salary'] <= Q3 + 1.5*IQR)]
Why it matters: If age is 0-100 and salary is 0-10,000,000, the model thinks salary is a million times more important. Scaling fixes this.
When to use which: Min-Max when you know bounds and no big outliers. Standard Scaler (Z-score normalization) when you have outliers or need Gaussian distribution.
from sklearn.preprocessing import MinMaxScaler, StandardScaler X = [[25, 50000], [30, 90000], [22, 30000]] # Min-Max: scales to 0-1 mm = MinMaxScaler() X_mm = mm.fit_transform(X) # age: [0.375, 0.875, 0.0], salary: [0.5, 1.0, 0.0] # Standard Scaler: mean=0, std=1 ss = StandardScaler() X_ss = ss.fit_transform(X)
What it is: Many ML models assume data is normally distributed (bell curve). But real data like salaries, prices, or website visits is skewed. Transformations fix the shape.
import numpy as np from scipy.stats import boxcox salaries = [30000, 45000, 60000, 500000, 1200000] # Log transform โ great for right-skewed data log_sal = np.log1p(salaries) # log1p = log(x+1), handles 0 # Box-Cox โ finds the BEST power transformation automatically bc_sal, lambda_ = boxcox(salaries) # lambda_ is the optimal power
The problem: You have 9,900 legit transactions and 100 fraudulent ones. If your model just predicts "not fraud" for everything, it's 99% accurate but totally useless.
- SMOTE โ creates synthetic minority samples (oversampling)
- Random Undersampling โ removes majority class samples
- class_weight='balanced' โ tells the model to care more about minority
from imblearn.over_sampling import SMOTE sm = SMOTE(random_state=42) X_res, y_res = sm.fit_resample(X_train, y_train) # Now minority class has synthetic samples added print(y_res.value_counts()) # should be balanced now
What it is: Your dataset might have 500 features (columns). Many are redundant. Reduce to 2-10 key dimensions without losing important information.
PCA (Principal Component Analysis) โ unsupervised, finds directions of max variance. LDA (Linear Discriminant Analysis) โ supervised, finds directions that best separate classes. t-SNE โ great for visualization only (2D/3D plots of high-dim data).
from sklearn.decomposition import PCA # Reduce 500 features โ 10 principal components pca = PCA(n_components=10) X_reduced = pca.fit_transform(X_scaled) # How much variance does each component explain? print(pca.explained_variance_ratio_) # [0.32, 0.18, 0.12, ...] โ first 3 explain 62% of info
Regression
Predicting a continuous number โ house prices, salaries, temperatures. Output is a number, not a category.
Simple: One input feature โ one output. Draw the best straight line through data points.
Multiple: Many input features โ one output. The line becomes a plane (or hyperplane).
from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error, r2_score X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42) model = LinearRegression() model.fit(X_train, y_train) y_pred = model.predict(X_test) print(f"Rยฒ: {r2_score(y_test, y_pred):.3f}") # 1.0 = perfect print(f"RMSE: {mean_squared_error(y_test, y_pred)**0.5:.2f}")
- MAE โ Mean Absolute Error (avg distance from truth, easy to interpret)
- MSE โ Mean Squared Error (punishes big errors more)
- RMSE โ Root MSE (same unit as target, most common)
- Rยฒ โ 0 to 1, how much variance the model explains
What it is: Linear regression can only fit straight lines. Polynomial regression fits curves by adding xยฒ, xยณ etc. as features.
from sklearn.preprocessing import PolynomialFeatures from sklearn.pipeline import Pipeline # Degree 2 = add xยฒ features, degree 3 = add xยฒ, xยณ poly_model = Pipeline([ ('poly', PolynomialFeatures(degree=2)), ('linear', LinearRegression()) ]) poly_model.fit(X_train, y_train) # โ ๏ธ High degree = overfitting risk! # degree=10 fits training data perfectly but fails on new data
The problem it solves: Overfitting โ model learns training data too well (memorizes noise) and fails on new data.
L1 (Lasso): Can reduce some weights to exactly 0 โ effectively removes useless features. Good for feature selection.
L2 (Ridge): Shrinks all weights toward 0 but never exactly 0. Good when all features matter a little.
ElasticNet: Mix of both L1 + L2.
from sklearn.linear_model import Lasso, Ridge, ElasticNet # alpha = regularization strength (higher = more penalty) lasso = Lasso(alpha=0.1) # some weights become exactly 0 ridge = Ridge(alpha=1.0) # all weights shrink but stay non-zero enet = ElasticNet(alpha=0.5, l1_ratio=0.5) # 50/50 mix lasso.fit(X_train, y_train) print(lasso.coef_) # see which features got zero'd out
Decision Tree Regression: Instead of fitting a line, it splits data into boxes and predicts the average value in each box.
Random Forest Regression: 100+ decision trees, each trained on random data subsets. Final prediction = average of all trees. Much more accurate and robust.
from sklearn.ensemble import RandomForestRegressor rf = RandomForestRegressor( n_estimators=100, # 100 trees max_depth=10, # max tree depth random_state=42 ) rf.fit(X_train, y_train) # Which features matter most? importances = pd.Series(rf.feature_importances_, index=feature_names) importances.sort_values(ascending=False).plot.bar()
What it is: Find a "tube" (epsilon-tube) around the prediction line. Only data points OUTSIDE the tube contribute to the error. Good for non-linear data with kernel trick.
- Small to medium datasets (gets slow on large data)
- Non-linear relationships (use RBF kernel)
- When you want robustness to outliers
from sklearn.svm import SVR # IMPORTANT: SVR needs scaled features! svr = SVR(kernel='rbf', C=100, epsilon=0.1) svr.fit(X_train_scaled, y_train) y_pred = svr.predict(X_test_scaled)
Classification
Predicting a category โ spam vs not spam, disease vs healthy, cat vs dog. Output is a class label, not a number.
โ ๏ธ Confusing name: Despite "Regression" in the name, this is a CLASSIFIER. It outputs a probability (0-1) and you set a threshold (usually 0.5) to classify.
from sklearn.linear_model import LogisticRegression model = LogisticRegression(C=1.0, max_iter=1000) model.fit(X_train, y_train) # Probabilities (not just 0/1) probs = model.predict_proba(X_test) # [[0.27, 0.73], ...] preds = model.predict(X_test) # [1, 0, 1, 1, ...] # Multiclass: set multi_class='multinomial'
What it is: Uses Bayes' Theorem assuming all features are independent. "Naive" because real features are rarely truly independent, but it still works surprisingly well.
Gaussian NB: continuous features (age, salary). Multinomial NB: count data (word frequencies). Bernoulli NB: binary features (word present/absent).
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB # Gaussian: for continuous features like age, salary gnb = GaussianNB() # Multinomial: for text classification (word counts) mnb = MultinomialNB(alpha=1.0) # alpha = smoothing # Bernoulli: for binary features (word present: yes/no) bnb = BernoulliNB() gnb.fit(X_train, y_train) accuracy = gnb.score(X_test, y_test)
- Text classification (spam, sentiment)
- Real-time predictions (very fast)
- Small datasets where it often beats complex models
What it is: For a new data point, find the K most similar points in training data. Majority class among those K neighbors = prediction.
Choosing K: Low K (1-3) = flexible, may overfit. High K = smooth, may underfit. Usually try K=5, then tune with cross-validation. K should be odd for binary classification to avoid ties.
from sklearn.neighbors import KNeighborsClassifier # MUST scale features! KNN is distance-based knn = KNeighborsClassifier(n_neighbors=5, metric='euclidean') knn.fit(X_train_scaled, y_train) # Find optimal K scores = [] for k in range(1, 21): knn = KNeighborsClassifier(n_neighbors=k) knn.fit(X_train_scaled, y_train) scores.append(knn.score(X_test_scaled, y_test)) # Pick k with highest score
What it is: Find the "best dividing line" (hyperplane) between classes, maximizing the margin (gap) between the line and the nearest points of each class. Those nearest points are the "support vectors".
Kernel SVM: When data isn't linearly separable, the kernel trick maps data to a higher dimension where it IS separable โ without actually computing that transformation (it's a mathematical shortcut).
- Linear kernel โ for linearly separable data
- RBF (Radial Basis Function) โ most common, works for most non-linear data
- Polynomial kernel โ for polynomial boundaries
from sklearn.svm import SVC # C = margin hardness (high C = fewer margin violations) # gamma = RBF influence radius (high = complex boundary) svm = SVC(kernel='rbf', C=10, gamma='scale', probability=True) svm.fit(X_train_scaled, y_train) # probability=True enables predict_proba() probs = svm.predict_proba(X_test_scaled)
What it is: A flowchart of yes/no questions. At each node, it splits data to maximize "purity" (all one class together). Leaves are final class predictions.
from sklearn.tree import DecisionTreeClassifier, export_text dt = DecisionTreeClassifier( max_depth=5, # prevents overfitting min_samples_leaf=5, # min 5 samples per leaf criterion='gini' ) dt.fit(X_train, y_train) # Print the tree rules โ interpretable! print(export_text(dt, feature_names=feature_names))
What it is: An ensemble of decision trees. Each tree is trained on a random subset of data and features. Final prediction = majority vote of all trees.
Why it's better than one tree: Reduces overfitting (high variance โ low variance), handles missing values, gives feature importances, rarely needs tuning.
from sklearn.ensemble import RandomForestClassifier rf = RandomForestClassifier( n_estimators=200, # more trees = more stable max_features='sqrt', # each tree sees โ(n_features) oob_score=True, # out-of-bag score (free validation) n_jobs=-1 # use all CPU cores ) rf.fit(X_train, y_train) print(f"OOB Score: {rf.oob_score_:.3f}")
Clustering (Unsupervised)
No labels, no supervision. Find natural groups in data. Nobody told the model what groups to look for โ it discovers them.
What it is: Group data into K clusters. Each point belongs to the cluster with the nearest center (centroid). Centers are updated iteratively until stable.
K-Means++: Better initialization โ centroids start spread out, not random. Converges faster and to better solutions.
from sklearn.cluster import KMeans km = KMeans( n_clusters=3, init='k-means++', # smarter init (default) n_init=10, # run 10 times, pick best random_state=42 ) labels = km.fit_predict(X_scaled) centers = km.cluster_centers_ inertia = km.inertia_ # sum of squared distances to centers
The problem: K-Means needs you to choose K. How do you know the right number of clusters?
Silhouette Score: For each point, measures "how well does it fit its own cluster vs the nearest other cluster". Range: -1 (bad) to +1 (perfect). Higher = better clusters.
from sklearn.metrics import silhouette_score import matplotlib.pyplot as plt inertias = [] sil_scores = [] for k in range(2, 11): km = KMeans(n_clusters=k, random_state=42) labels = km.fit_predict(X_scaled) inertias.append(km.inertia_) sil_scores.append(silhouette_score(X_scaled, labels)) # Plot inertia โ look for the "elbow" plt.plot(range(2, 11), inertias, marker='o') plt.title('Elbow Method') # Best K = highest silhouette score best_k = range(2, 11)[sil_scores.index(max(sil_scores))]
What it is: Builds a tree (dendrogram) showing how data points merge into clusters at different scales. You don't need to specify K upfront.
Agglomerative (bottom-up): Start with every point as its own cluster, merge closest pairs repeatedly. Divisive (top-down): Start with one big cluster, split repeatedly.
from sklearn.cluster import AgglomerativeClustering from scipy.cluster.hierarchy import dendrogram, linkage import matplotlib.pyplot as plt # Draw the dendrogram to see natural clusters Z = linkage(X_scaled, method='ward') dendrogram(Z, truncate_mode='level', p=5) plt.title('Dendrogram') plt.show() # Cut at K=3 clusters hc = AgglomerativeClustering(n_clusters=3, linkage='ward') labels = hc.fit_predict(X_scaled)
What it is: Clusters are "dense regions" of points. Works on any cluster shape (not just circles). Automatically identifies outliers as noise (label = -1).
Parameters: eps = max distance between neighbors, min_samples = min points to form a cluster core.
- DBSCAN: no need to specify K, finds arbitrary shapes, handles noise
- K-Means: faster, works better on spherical clusters, needs K upfront
- Use DBSCAN when clusters are irregular or you have many outliers
from sklearn.cluster import DBSCAN db = DBSCAN(eps=0.5, min_samples=5) labels = db.fit_predict(X_scaled) n_clusters = len(set(labels)) - (1 if -1 in labels else 0) n_noise = list(labels).count(-1) # -1 = outlier/noise print(f"Clusters: {n_clusters}, Noise points: {n_noise}")
What it is: Assumes data comes from a mix of K Gaussian (normal) distributions. "Soft clustering" โ each point gets a probability of belonging to each cluster, not a hard assignment.
from sklearn.mixture import GaussianMixture gmm = GaussianMixture(n_components=3, covariance_type='full') gmm.fit(X_scaled) # Hard labels labels = gmm.predict(X_scaled) # Soft probabilities โ unique to GMM! probs = gmm.predict_proba(X_scaled) # Each row sums to 1.0, shows cluster membership probability
Anomaly Detection & Association Rules
Finding the unusual. Finding what happens together. Two different but powerful unsupervised techniques.
What it is: Find data points that don't fit the normal pattern. Used for fraud detection, network intrusion, manufacturing defects, medical anomalies.
- Isolation Forest โ isolates anomalies with random splits (fast, scalable)
- One-Class SVM โ learns the "normal" boundary, flags outside it
- Statistical: flag Z-score > 3 as anomaly
- Autoencoders (deep learning) โ high reconstruction error = anomaly
from sklearn.ensemble import IsolationForest iso = IsolationForest( contamination=0.05, # expect ~5% anomalies random_state=42 ) iso.fit(X_train) predictions = iso.predict(X_test) # Returns: 1 = normal, -1 = anomaly anomalies = X_test[predictions == -1]
What it is: Find which items appear together frequently. Classic use case: Market Basket Analysis โ "customers who buy diapers also buy beer on Fridays".
from mlxtend.frequent_patterns import apriori, association_rules # data = one-hot encoded: rows=transactions, cols=items # e.g. df[['bread','butter','milk','eggs']] = True/False frequent_items = apriori(df, min_support=0.05, use_colnames=True) rules = association_rules(frequent_items, metric='confidence', min_threshold=0.5) # Top rules by lift rules.sort_values('lift', ascending=False).head(10)
Ensemble Methods & Boosting
Combining many weak models to build a powerful one. These are the techniques that win Kaggle competitions and dominate real-world ML benchmarks.
Bagging: Train multiple models on different random SAMPLES (with replacement) of training data. Combine with voting/averaging. Reduces variance (overfitting).
Pasting: Same idea but sampling WITHOUT replacement. Less diversity but sometimes better.
from sklearn.ensemble import BaggingClassifier from sklearn.tree import DecisionTreeClassifier bag = BaggingClassifier( base_estimator=DecisionTreeClassifier(), n_estimators=100, max_samples=0.8, # 80% of data per tree bootstrap=True, # False = Pasting oob_score=True ) bag.fit(X_train, y_train)
Random Subspaces: Each model is trained on all samples but only a random subset of features. Reduces feature correlation between trees.
Random Patches: Each model sees random subset of BOTH samples AND features. Maximum diversity.
# Random Subspaces: full samples, subset of features rs = BaggingClassifier(bootstrap=False, max_features=0.5) # Random Patches: subset of both samples AND features rp = BaggingClassifier(bootstrap=True, max_features=0.5)
What it is: Combine fundamentally different models (SVM + Random Forest + Logistic Regression) and take majority vote (hard) or average probabilities (soft).
from sklearn.ensemble import VotingClassifier voting = VotingClassifier( estimators=[ ('lr', LogisticRegression()), ('rf', RandomForestClassifier(n_estimators=100)), ('svm', SVC(probability=True)) ], voting='soft' # 'soft' averages probabilities (better) ) voting.fit(X_train, y_train)
What it is: Trees are built SEQUENTIALLY. Each new tree corrects the errors of the previous. The model "boosts" by focusing on where it was wrong before.
from sklearn.ensemble import GradientBoostingClassifier gbm = GradientBoostingClassifier( n_estimators=200, learning_rate=0.05, # smaller = more trees needed but better max_depth=3, # shallow trees work best for boosting subsample=0.8 # stochastic gradient boosting ) gbm.fit(X_train, y_train)
What it is: XGBoost = GBM + regularization + parallelization + hardware optimization. Dominated ML competitions for years. Still one of the best algorithms for tabular data.
- Built-in L1 + L2 regularization (prevents overfitting)
- Parallel tree building (fast even on large data)
- Handles missing values automatically
- Early stopping (stop when validation score stops improving)
import xgboost as xgb model = xgb.XGBClassifier( n_estimators=500, learning_rate=0.05, max_depth=6, subsample=0.8, colsample_bytree=0.8, # random feature subset per tree reg_alpha=0.1, # L1 regularization reg_lambda=1.0, # L2 regularization early_stopping_rounds=50, eval_metric='logloss', use_label_encoder=False ) model.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=100)
LightGBM: Microsoft's GBM. Grows trees leaf-wise (not level-wise) โ faster, more accurate for large datasets. Great when you have millions of rows.
CatBoost: Yandex's GBM. Handles categorical features natively without encoding. Reduces overfitting on small datasets. Very easy to use.
import lightgbm as lgb from catboost import CatBoostClassifier # LightGBM lgbm = lgb.LGBMClassifier( n_estimators=500, learning_rate=0.05, num_leaves=31 # key param: controls complexity ) # CatBoost โ no encoding needed for categorical columns! cat_features = ['city', 'category', 'brand'] # column names cb = CatBoostClassifier( iterations=500, learning_rate=0.05, cat_features=cat_features, # just tell it which are categorical verbose=100 )
What it is: The original boosting algorithm. Misclassified samples get higher weights in the next iteration. Each new model focuses more on hard cases.
from sklearn.ensemble import AdaBoostClassifier ada = AdaBoostClassifier( n_estimators=200, learning_rate=0.5, algorithm='SAMME.R' # uses probabilities ) ada.fit(X_train, y_train)
Stacking: Train Level-1 models. Their predictions become features for a Level-2 "meta-model". The meta-model learns how to best combine Level-1 predictions.
Blending: Simpler version โ Level-1 models predict on a holdout set, meta-model trains on those predictions.
from sklearn.ensemble import StackingClassifier level1 = [ ('rf', RandomForestClassifier(n_estimators=100)), ('xgb', xgb.XGBClassifier()), ('svm', SVC(probability=True)) ] meta_model = LogisticRegression() stacking = StackingClassifier( estimators=level1, final_estimator=meta_model, cv=5 # cross-validation to avoid overfitting ) stacking.fit(X_train, y_train)
Model Evaluation
How do you know if your model is actually good? This is what separates real ML engineers from people who just run code. These metrics are heavily asked in interviews.
What it is: A 2ร2 table showing the 4 types of predictions. The foundation of ALL classification metrics.
from sklearn.metrics import confusion_matrix, classification_report cm = confusion_matrix(y_test, y_pred) # [[TN, FP], # [FN, TP]] # Full report in one line! print(classification_report(y_test, y_pred))
- Accuracy โ balanced classes, general purpose
- Precision โ when FP is costly (spam filter, false police alerts)
- Recall โ when FN is costly (cancer detection, fraud)
- F1 โ imbalanced classes, need balance of P & R
ROC: Receiver Operating Characteristic curve โ plots True Positive Rate vs False Positive Rate at every threshold. AUC: Area Under the Curve. AUC=1.0 is perfect, AUC=0.5 is random (useless).
from sklearn.metrics import roc_auc_score, roc_curve import matplotlib.pyplot as plt # Get probabilities, not binary predictions y_proba = model.predict_proba(X_test)[:, 1] auc = roc_auc_score(y_test, y_proba) print(f"AUC: {auc:.3f}") # 0.5=random, 1.0=perfect fpr, tpr, thresholds = roc_curve(y_test, y_proba) plt.plot(fpr, tpr, label=f'AUC={auc:.2f}') plt.plot([0,1],[0,1],'--') # random baseline plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.legend()
What it is: Instead of one train/test split, use K different splits. Each slice gets to be the test set once. Average K scores = more reliable estimate of real performance.
Stratified K-Fold: Each fold maintains the same class ratio as the original data. Essential for imbalanced datasets.
from sklearn.model_selection import cross_val_score, StratifiedKFold # Simple K-Fold (5 folds) scores = cross_val_score(model, X, y, cv=5, scoring='f1') print(f"CV F1: {scores.mean():.3f} ยฑ {scores.std():.3f}") # Stratified (preserves class ratio per fold) skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) scores = cross_val_score(model, X, y, cv=skf, scoring='roc_auc')
Hyperparameter Tuning
Model parameters are learned from data. Hyperparameters are settings YOU choose before training. Tuning them is the difference between a decent model and a great one.
What it is: Try every single combination of hyperparameters you specify. Guaranteed to find the best within the grid, but SLOW for large grids.
from sklearn.model_selection import GridSearchCV param_grid = { 'n_estimators': [100, 200, 300], 'max_depth': [3, 5, 7, None], 'min_samples_split': [2, 5, 10] } # 3 ร 4 ร 3 = 36 combinations ร 5 CV folds = 180 fits gs = GridSearchCV( RandomForestClassifier(), param_grid, cv=5, scoring='f1', n_jobs=-1, verbose=1 ) gs.fit(X_train, y_train) print(gs.best_params_) # {'max_depth': 7, 'n_estimators': 200, ...} print(gs.best_score_) # 0.923 best_model = gs.best_estimator_
What it is: Instead of trying every combination, randomly sample N combinations. Faster than GridSearch and often just as good โ or better โ because it explores more of the space.
from sklearn.model_selection import RandomizedSearchCV from scipy.stats import randint, uniform # Can use distributions, not just lists! param_dist = { 'n_estimators': randint(100, 500), # random int 100-500 'max_depth': randint(3, 15), 'learning_rate': uniform(0.01, 0.2), # random float 'subsample': uniform(0.6, 0.4) } rs = RandomizedSearchCV( xgb.XGBClassifier(), param_dist, n_iter=50, # try 50 random combinations (vs 36 exhaustive) cv=5, scoring='roc_auc', n_jobs=-1 ) rs.fit(X_train, y_train) print(rs.best_params_)
The right workflow: Never tune on your test set. Use train โ validation โ test. Pipeline prevents data leakage by ensuring preprocessing is learned from train data only.
from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler # โ Scaler is INSIDE the pipeline โ no leakage! pipeline = Pipeline([ ('scaler', StandardScaler()), ('model', RandomForestClassifier(n_estimators=100)) ]) # GridSearch the whole pipeline param_grid = { 'model__n_estimators': [100, 200], # prefix = pipeline step name 'model__max_depth': [5, 10] } gs = GridSearchCV(pipeline, param_grid, cv=5) gs.fit(X_train, y_train) # Final test set evaluation โ only done ONCE final_score = gs.score(X_test, y_test)