Your ML Foundation โ€” Recap + Deep Dive

Python ยท NumPy ยท Pandas
Everything You Need for ML

A comprehensive recap of Python fundamentals, then deep dives into NumPy (matrices, broadcasting, linear algebra) and Pandas (DataFrames, cleaning, EDA). Every concept you'll use daily in ML โ€” with live examples and interactive exercises.

Python 3.11+
NumPy 1.26
Pandas 2.x
13 Modules
Interactive Examples
ML Context Always
Module 01 ยท Python

Data Types & Variables

Python is dynamically typed โ€” you don't declare types, Python infers them. But in ML you ALWAYS care about types because the wrong type (str vs float) breaks your entire pipeline silently.

Python Type Hierarchy Interactive
01_data_types.py
python
# โ”€โ”€โ”€ The 5 core types you use in ML โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
x = 42          # int   โ€” index, count, label (0 or 1)
y = 3.14        # float โ€” price, probability, weight
b = True         # bool  โ€” is_fraud, has_feature (True=1, False=0)
s = "income"    # str   โ€” column name, category label
n = None         # NoneType โ€” missing value (like NaN before pandas)

# โ”€โ”€โ”€ Type checking and conversion โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
type(x)           # โ†’ <class 'int'>
isinstance(x, int)  # โ†’ True  (use this, not type() == int)

float("3.14")    # โ†’ 3.14  (string to float โ€” common when reading CSVs)
int(3.9)         # โ†’ 3    (truncates, does NOT round)
str(42)          # โ†’ "42"
bool(0)          # โ†’ False  (0, 0.0, "", [], None are all False)

# โ”€โ”€โ”€ ML gotcha: int division vs float division โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
7 / 2     # โ†’ 3.5   (true division, always float)
7 // 2    # โ†’ 3     (floor division, int result)
7 % 2     # โ†’ 1     (modulo โ€” remainder)
2 ** 10   # โ†’ 1024  (power โ€” used for learning rate schedules)

# โ”€โ”€โ”€ F-strings โ€” use these everywhere โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
acc = 0.9342
epochs = 10
print(f"Epoch {epochs}: accuracy = {acc:.2f}")
# โ†’ Epoch 10: accuracy = 0.93
print(f"Loss: {0.00234:.4e}")  # โ†’ Loss: 2.3400e-03
Quick Check: What does int(3.9) return?
Module 02 ยท Python

Control Flow & Loops

Conditionals and loops are the skeleton of every ML preprocessing script, training loop, and evaluation pipeline. Master these completely โ€” you'll write them thousands of times.

02_control_flow.py
python
# โ”€โ”€โ”€ if / elif / else โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
score = 0.87

if score >= 0.9:
    print("Excellent model!")
elif score >= 0.7:
    print("Good โ€” consider tuning")
else:
    print("Needs work")

# โ”€โ”€โ”€ for loop โ€” the workhorse โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
losses = [0.9, 0.7, 0.5, 0.3, 0.15]

for i, loss in enumerate(losses):       # enumerate gives (index, value)
    print(f"Epoch {i+1}: loss={loss:.2f}")

# โ”€โ”€โ”€ range() โ€” for training epochs โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
for epoch in range(1, 101):    # 1 to 100 inclusive
    pass                         # pass = placeholder

# โ”€โ”€โ”€ while loop โ€” for convergence checks โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
loss = 1.0
threshold = 0.01
steps = 0

while loss > threshold:
    loss *= 0.85   # simulate loss decreasing
    steps += 1
    if steps > 1000:
        print("Did not converge!")
        break        # break exits the loop
print(f"Converged in {steps} steps")

# โ”€โ”€โ”€ zip() โ€” iterate two lists together โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
features = ["age", "income", "score"]
weights  = [0.3, 0.5, 0.2]

for feat, w in zip(features, weights):
    print(f"{feat}: weight = {w}")
Simulated Training Loop โ€” Loss Over Epochs Live Animation

This is literally what a Python for loop over epochs looks like โ€” each step calls backward(), optimizer.step(), and appends loss to a list exactly like the code above.

Module 03 ยท Python

Functions & Scope

In ML, functions are everywhere: preprocessing steps, loss functions, metric calculators, data loaders. Understanding scope, default arguments, and *args/**kwargs is essential.

03_functions.py
python
# โ”€โ”€โ”€ Basic function โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
def accuracy(y_true: list, y_pred: list) -> float:
    """Calculate classification accuracy.
    
    Args:
        y_true: actual labels
        y_pred: predicted labels
    Returns:
        accuracy score between 0 and 1
    """
    correct = sum(t == p for t, p in zip(y_true, y_pred))
    return correct / len(y_true)

# โ”€โ”€โ”€ Default arguments โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
def normalize(x: float, method: str = "minmax", clip: bool = False) -> float:
    # Default args must come AFTER positional args
    pass

normalize(5.0)                     # uses defaults
normalize(5.0, method="zscore")   # keyword arg
normalize(5.0, "zscore", True)     # positional

# โ”€โ”€โ”€ *args and **kwargs โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
def log_metrics(*metrics, **extra_info):
    # *args: any number of positional args โ†’ tuple
    # **kwargs: any keyword args โ†’ dict
    for m in metrics:
        print(f"metric: {m}")
    for k, v in extra_info.items():
        print(f"{k} = {v}")

log_metrics(0.92, 0.87, epoch=10, lr=0.001)

# โ”€โ”€โ”€ Returning multiple values (returns a tuple) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
def train_test_split_manual(data, ratio=0.8):
    split = int(len(data) * ratio)
    return data[:split], data[split:]  # returns tuple

train, test = train_test_split_manual([1,2,3,4,5])
# train=[1,2,3,4]  test=[5]
Module 04 ยท Python

Lists, Dicts, Tuples & Sets

Lists store sequences of features. Dicts store hyperparameters and configs. Tuples store immutable shapes. Sets check membership. You use all four daily in ML.

04_collections.py
python
# โ”€โ”€โ”€ LIST โ€” ordered, mutable, allows duplicates โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
features = ["age", "income", "score", "age"]
features.append("city")       # add to end
features.remove("score")     # remove first occurrence
features.sort()              # sort in place
features[0]                  # โ†’ "age"  (first item)
features[-1]                 # โ†’ last item
features[1:3]               # โ†’ ["income", "score"] (slicing)
len(features)                # โ†’ count

# โ”€โ”€โ”€ DICT โ€” key-value pairs, unordered, fast lookup โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
hyperparams = {
    "lr": 0.001,
    "epochs": 100,
    "batch_size": 32,
    "optimizer": "adam"
}
hyperparams["lr"]           # โ†’ 0.001
hyperparams.get("dropout", 0.5)  # โ†’ 0.5 (default if missing)
"epochs" in hyperparams     # โ†’ True (key lookup)
hyperparams.keys()           # โ†’ dict_keys(["lr", "epochs"...])
hyperparams.values()         # โ†’ dict_values([0.001, 100...])
hyperparams.items()          # โ†’ for k,v in ...  โ† use this!

# โ”€โ”€โ”€ TUPLE โ€” immutable, used for shapes and coordinates โ”€โ”€โ”€โ”€
shape = (64, 3, 224, 224)   # batch, channels, H, W (image tensor shape)
batch_size, channels, h, w = shape  # unpacking

# โ”€โ”€โ”€ SET โ€” unique items, fast membership check โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
seen_labels = {"cat", "dog", "bird"}
"cat" in seen_labels         # โ†’ True  (O(1) vs O(n) for list)
seen_labels.add("fish")     # add element
set(features)               # remove duplicates from list
List Slicing โ€” Critical for ML
a[2:5]
index 2,3,4
a[:3]
first 3 elements
a[-3:]
last 3 elements
a[::2]
every other element
a[::-1]
reversed list
a[1:8:2]
start:stop:step
Dict Patterns in ML
d.get(k, default)
safe lookup
k in d
key exists?
{**d1, **d2}
merge two dicts
d.update({k:v})
add/update key
sorted(d, key=...)
sort by value
collections.Counter
count occurrences
Module 05 ยท Python

OOP & Classes

Every sklearn model is a class. Every PyTorch model is a class. Understanding __init__, methods, inheritance, and dunder methods makes you able to write custom models and transformers.

05_oop.py โ€” Custom sklearn-style transformer
python
# โ”€โ”€โ”€ The pattern ALL sklearn models follow โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
class MinMaxScaler:
    """Scales features to [0, 1] range."""
    
    def __init__(self, feature_range=(0, 1)):
        self.feature_range = feature_range
        self.min_ = None     # underscore = "set by fit()"
        self.max_ = None
    
    def fit(self, X):
        """Learn min and max from training data."""
        self.min_ = min(X)
        self.max_ = max(X)
        return self           # return self enables chaining
    
    def transform(self, X):
        """Apply learned scaling to new data."""
        if self.min_ is None:
            raise RuntimeError("Call fit() before transform()")
        return [(x - self.min_) / (self.max_ - self.min_) for x in X]
    
    def fit_transform(self, X):
        return self.fit(X).transform(X)   # method chaining!
    
    def __repr__(self):
        return f"MinMaxScaler(feature_range={self.feature_range})"

# โ”€โ”€โ”€ Inheritance โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
class ClippingScaler(MinMaxScaler):  # inherits from MinMaxScaler
    def transform(self, X):            # override parent method
        scaled = super().transform(X)  # call parent's transform
        return [max(0, min(1, v)) for v in scaled]  # then clip

scaler = MinMaxScaler()
scaler.fit_transform([10, 20, 30, 40])  # โ†’ [0.0, 0.33, 0.67, 1.0]
Module 06 ยท Python

Comprehensions, Lambdas & Functional Tools

Comprehensions make data processing code dramatically shorter. Lambdas are for quick one-liner functions. map/filter/sorted are used constantly in feature engineering.

06_comprehensions.py
python
# โ”€โ”€โ”€ List comprehension โ€” the most used Python pattern โ”€โ”€โ”€โ”€โ”€
squares = [x**2 for x in range(10)]
# [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

# With filter condition:
pos_features = [x for x in values if x > 0]

# Transform and filter at once:
clean = [float(x) for x in raw_data if x is not None]

# โ”€โ”€โ”€ Dict comprehension โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
feature_names = ["age", "income", "score"]
feature_index = {name: i for i, name in enumerate(feature_names)}
# โ†’ {"age": 0, "income": 1, "score": 2}

# Invert a dictionary:
index_feature = {v: k for k, v in feature_index.items()}

# โ”€โ”€โ”€ Lambda โ€” anonymous one-liner function โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
square = lambda x: x**2
square(5)    # โ†’ 25

# Most common use: as key= argument for sorting
models = [("RF", 0.92), ("SVM", 0.87), ("LR", 0.81)]
sorted(models, key=lambda x: x[1], reverse=True)
# โ†’ [("RF", 0.92), ("SVM", 0.87), ("LR", 0.81)]

# โ”€โ”€โ”€ map() and filter() โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
doubled = list(map(lambda x: x*2, [1,2,3]))   # [2, 4, 6]
positive = list(filter(lambda x: x>0, [-1,2,-3,4]))  # [2, 4]

# โ”€โ”€โ”€ any() and all() โ€” used in validation โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
any([False, True, False])  # โ†’ True (at least one True)
all([True, True, True])   # โ†’ True (all True)
any(x > 0.9 for x in scores)  # any score above 0.9?
Module 07 ยท NumPy

Arrays, Operations & Vectorization

NumPy is the numerical backbone of ML. A NumPy array is like a Python list โ€” but 100x faster because operations happen in compiled C. Your ML dataset is an ndarray. Every sklearn model input is an ndarray.

๐Ÿ’ก The Core Idea โ€” Vectorization

Never loop over array elements in Python. NumPy operates on entire arrays at once in compiled C code. Loop over 1M elements โ†’ slow. NumPy add โ†’ instant. This principle applies everywhere in ML.

NumPy Array Shapes โ€” Visualized 1D ยท 2D ยท 3D
07_numpy_arrays.py
python
import numpy as np   # always aliased as np

# โ”€โ”€โ”€ Creating arrays โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
a = np.array([1, 2, 3, 4, 5])           # from Python list
b = np.zeros((5, 3))                   # 5ร—3 zeros
c = np.ones((3, 3))                    # 3ร—3 ones
I = np.eye(4)                          # 4ร—4 identity matrix
r = np.random.randn(100, 10)          # 100 samples, 10 features (Gaussian)
u = np.random.uniform(0, 1, (50, 5))  # uniform random, 50ร—5
s = np.linspace(0, 1, 100)            # 100 evenly spaced from 0 to 1

# โ”€โ”€โ”€ Shape, dtype, size โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
X = np.random.randn(1000, 20)  # 1000 samples, 20 features
X.shape   # โ†’ (1000, 20)
X.ndim    # โ†’ 2
X.dtype   # โ†’ float64
X.size    # โ†’ 20000 (total elements)

# โ”€โ”€โ”€ VECTORIZED operations โ€” FAST โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
X + 10         # add 10 to every element
X * 2          # multiply every element by 2
X ** 2         # square every element
np.exp(X)      # e^x for every element (sigmoid needs this!)
np.log(X)      # natural log (cross-entropy loss needs this!)
np.sqrt(X)     # square root
np.abs(X)      # absolute value (MAE loss!)

# โ”€โ”€โ”€ Aggregate operations โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
X.sum()              # total sum
X.sum(axis=0)        # sum each column (feature means)
X.sum(axis=1)        # sum each row (sample totals)
X.mean(), X.std()   # global mean and std
X.max(axis=0)        # max per feature
X.argmax()           # index of maximum (predicted class!)

# โ”€โ”€โ”€ dtype conversion โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
X.astype(np.float32)  # float32 uses half the memory of float64
X.astype(int)        # convert to integer
Module 08 ยท NumPy

Indexing, Slicing & Broadcasting

Broadcasting is NumPy's most powerful (and confusing) feature. It lets you do operations between arrays of different shapes without writing loops. Understanding it unlocks everything in deep learning.

Broadcasting โ€” How Shapes Align Visual

A (3ร—3) matrix minus a (3,) vector: the vector is automatically "broadcast" across all rows. This is how you subtract the mean from every sample in one line.

08_indexing_broadcasting.py
python
import numpy as np

X = np.random.randn(100, 5)   # 100 samples, 5 features

# โ”€โ”€โ”€ Basic indexing โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
X[0]            # first row (sample 0)
X[0, 2]         # row 0, column 2
X[-1]           # last row
X[5:10]         # rows 5-9
X[:, 0]         # entire first column (feature 0)
X[:, :3]        # first 3 features only
X[10:20, 1:4]   # rows 10-19, columns 1-3

# โ”€โ”€โ”€ Boolean indexing โ€” most important for filtering โ”€โ”€โ”€โ”€โ”€โ”€โ”€
X[X > 0]                   # all positive values (flattens!)
mask = X[:, 0] > 0         # mask: True/False for each sample
X[mask]                    # samples where feature 0 is positive
X[X[:, 0] > 0]             # one-liner version

# โ”€โ”€โ”€ Fancy indexing โ€” select specific rows โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
indices = np.array([0, 5, 10, 50])
X[indices]   # select exactly those rows

# โ”€โ”€โ”€ Reshaping โ€” critical for deep learning โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
flat = X.reshape(-1)              # flatten to 1D (500,)
col = X.reshape(-1, 1)           # column vector (500, 1)
img = np.zeros((28*28,)).reshape(1, 28, 28)  # MNIST pixel โ†’ image
X_T = X.T                          # transpose

# โ”€โ”€โ”€ BROADCASTING โ€” the magic โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
mean = X.mean(axis=0)   # shape (5,) โ€” per-feature mean
std  = X.std(axis=0)    # shape (5,)
X_norm = (X - mean) / std  # (100,5)-(5,) โ†’ broadcast! StandardScaler in one line

# Broadcasting rules:
# 1. Align shapes from the RIGHT
# 2. Dimensions of size 1 stretch to match
# (100,5) - (5,)  โ†’  (100,5) - (1,5)  โ†’  (100,5) โœ“

# โ”€โ”€โ”€ np.where โ€” vectorized if/else โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
labels = np.where(X[:, 0] > 0, 1, 0)  # 1 if positive else 0
Module 09 ยท NumPy

Linear Algebra & Statistics

np.linalg and np.dot are the building blocks of every ML algorithm. Matrix multiplication IS neural network forward pass. Eigendecomposition IS PCA. These aren't optional.

09_linalg.py
python
import numpy as np

# โ”€โ”€โ”€ Matrix multiplication โ€” the core of forward pass โ”€โ”€โ”€โ”€โ”€โ”€
X = np.random.randn(64, 20)   # batch of 64, 20 features
W = np.random.randn(20, 10)   # weight matrix
b = np.zeros(10)               # bias vector

Z = X @ W + b                   # โ†’ shape (64, 10) โ€” one linear layer!
# Same as: Z = np.dot(X, W) + b

# โ”€โ”€โ”€ Dot product โ€” similarity between vectors โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
a = np.array([1, 2, 3])
b_v = np.array([4, 5, 6])
np.dot(a, b_v)     # โ†’ 32   (1*4 + 2*5 + 3*6)

# Cosine similarity (used in RAG you already built!):
cos_sim = np.dot(a, b_v) / (np.linalg.norm(a) * np.linalg.norm(b_v))

# โ”€โ”€โ”€ Essential linalg โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
np.linalg.norm(a)          # Euclidean norm (length of vector)
np.linalg.det(W[:10,:10]) # determinant
np.linalg.inv(W[:10,:10]) # matrix inverse (OLS formula needs this)
vals, vecs = np.linalg.eig(np.cov(X.T))  # eigenvalues/vectors โ†’ PCA!
U, S, Vt = np.linalg.svd(X, full_matrices=False)  # SVD โ†’ dimensionality reduction

# โ”€โ”€โ”€ Statistics โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
np.mean(X, axis=0)    # per-feature mean
np.std(X, axis=0)     # per-feature std
np.median(X, axis=0) # per-feature median (robust to outliers)
np.percentile(X, [25, 75], axis=0)  # Q1 and Q3 for IQR
np.corrcoef(X.T)       # correlation matrix (feature correlations)
np.cov(X.T)            # covariance matrix (input to PCA)

# โ”€โ”€โ”€ Random with seed โ€” reproducibility โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
np.random.seed(42)                    # old way
rng = np.random.default_rng(42)     # modern way
rng.integers(0, 100, size=10)       # random ints
rng.shuffle(X)                        # shuffle rows in-place
Module 10 ยท Pandas

Series & DataFrame โ€” The Foundation

Pandas is your primary tool for loading, exploring, and cleaning tabular data before it goes into any ML model. A DataFrame is like an Excel sheet with superpowers โ€” and the Python you already know transfers directly.

DataFrame Anatomy โ€” What Every Part Is Visual
10_pandas_basics.py
python
import pandas as pd    # always aliased as pd
import numpy as np

# โ”€โ”€โ”€ Series โ€” 1D labeled array โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
ages = pd.Series([25, 30, 35, 28], index=["Alice", "Bob", "Charlie", "Diana"])
ages["Bob"]      # โ†’ 30  (label-based)
ages[1]          # โ†’ 30  (position-based)
ages > 27        # โ†’ boolean Series (True/False)
ages.mean()      # โ†’ 29.5

# โ”€โ”€โ”€ DataFrame โ€” the main tool โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
df = pd.DataFrame({
    "age":    [25, 30, 35, 28, 45],
    "income": [50000, 75000, 90000, 62000, 120000],
    "label":  [0, 1, 1, 0, 1]
})

# โ”€โ”€โ”€ First things you do with ANY new dataset โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
df.shape           # โ†’ (5, 3) โ€” rows, columns
df.head(5)         # first 5 rows
df.tail(3)         # last 3 rows
df.info()          # column types + null counts โ† RUN THIS FIRST
df.describe()      # count, mean, std, min, quartiles, max
df.dtypes          # data type of each column
df.isnull().sum()  # missing values per column โ† ALWAYS CHECK
df.columns         # column names

# โ”€โ”€โ”€ Column access โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
df["age"]             # โ†’ Series (single column)
df[["age", "income"]] # โ†’ DataFrame (multiple columns)
df.age                 # dot notation (only if no spaces in name)

# โ”€โ”€โ”€ .loc vs .iloc โ€” IMPORTANT distinction โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
df.loc[0]               # row by INDEX LABEL
df.iloc[0]              # row by POSITION  โ† usually want this
df.loc[0:2, "age"]       # rows 0-2 (inclusive!), column "age"
df.iloc[0:2, 0]          # rows 0-1 (exclusive), column 0

# โ”€โ”€โ”€ Load CSV โ€” your most used line โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
df = pd.read_csv("data.csv")
df = pd.read_csv("data.csv", sep=";", encoding="utf-8", low_memory=False)
Module 11 ยท Pandas

Filtering, Sorting & GroupBy

Filtering lets you select relevant subsets. GroupBy lets you compute stats per category โ€” like mean income per label, or fraud rate per city. These are EDA workhorses.

11_filter_groupby.py
python
# โ”€โ”€โ”€ Boolean filtering โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
young = df[df["age"] < 30]
high_income = df[df["income"] > 80000]
fraud = df[(df["label"] == 1) & (df["income"] < 40000)]  # & not 'and'
either = df[(df["age"] > 50) | (df["label"] == 1)]       # | not 'or'

# .query() method โ€” sometimes cleaner
df.query("age < 30 and income > 50000")
df.query("label in [0, 1] and age > @threshold")  # @ for variables

# .isin() โ€” filter by list of values
df[df["city"].isin(["Mumbai", "Delhi", "Bangalore"])]

# โ”€โ”€โ”€ Sorting โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
df.sort_values("income", ascending=False)           # by income desc
df.sort_values(["label", "income"], ascending=[True, False])

# โ”€โ”€โ”€ GroupBy โ€” the most important Pandas operation โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# Mean income per label (class 0 vs class 1)
df.groupby("label")["income"].mean()

# Multiple aggregations at once
df.groupby("label").agg({
    "income": ["mean", "std", "count"],
    "age":    ["mean", "min", "max"]
})

# Count per category
df["label"].value_counts()           # class distribution
df["city"].value_counts(normalize=True)  # as percentages

# GroupBy + transform โ€” add group mean as new column
df["mean_income_by_label"] = df.groupby("label")["income"].transform("mean")
Module 12 ยท Pandas

Data Cleaning & Merging

Real datasets are dirty. Missing values, wrong types, duplicates, inconsistent strings. This is where 70% of real ML time goes. Master these patterns and you'll work 10x faster.

12_cleaning_merging.py
python
# โ”€โ”€โ”€ Missing values โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
df.isnull().sum()             # null count per column
df.isnull().mean() * 100     # % missing per column

df.dropna()                   # drop ALL rows with any null
df.dropna(subset=["income"])  # drop only if income is null
df.dropna(thresh=3)           # keep rows with โ‰ฅ3 non-null values

df["income"].fillna(df["income"].median())        # fill with median
df["city"].fillna("Unknown")                        # fill categorical
df.fillna(method="ffill")                           # forward fill (time series)

# โ”€โ”€โ”€ Duplicates โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
df.duplicated().sum()             # how many duplicate rows?
df.drop_duplicates(inplace=True)  # remove them

# โ”€โ”€โ”€ Type conversion โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
df["age"] = df["age"].astype(int)
df["date"] = pd.to_datetime(df["date"])           # parse dates
df["income"] = pd.to_numeric(df["income"], errors="coerce")  # "coerce" โ†’ NaN on error

# โ”€โ”€โ”€ String cleaning โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
df["name"].str.lower()                  # lowercase
df["name"].str.strip()                  # remove whitespace
df["name"].str.replace(",", "")          # remove commas
df["city"].str.contains("Mumbai")       # boolean mask
df["email"].str.extract(r"(\w+)@")      # regex extract

# โ”€โ”€โ”€ Merging (like SQL JOINs) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
result = pd.merge(df1, df2, on="user_id", how="left")
# how: "inner"(both), "left"(all left), "right", "outer"(all)

# Concatenate DataFrames
combined = pd.concat([df_train, df_val], ignore_index=True)   # rows
wide = pd.concat([df1, df2], axis=1)                          # columns

# โ”€โ”€โ”€ apply() โ€” apply any function to rows/columns โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
df["income_log"] = df["income"].apply(np.log1p)  # log transform
df["age_bucket"] = df["age"].apply(lambda x: "young" if x < 30 else "senior")
Module 13 ยท Pandas

EDA & Pre-ML Workflow

Exploratory Data Analysis (EDA) is the process of understanding your dataset BEFORE building any model. This is the complete workflow โ€” from loading a CSV to handing off a clean array to sklearn.

Feature Correlation Matrix โ€” EDA Tool Live

High correlation (near ยฑ1) between two features = multicollinearity. Consider dropping one. High correlation with target = useful feature.

13_eda_workflow.py โ€” The complete pre-ML pipeline
python
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# โ•โ•โ• STEP 1: Load and initial inspection โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
df = pd.read_csv("data.csv")
print(f"Shape: {df.shape}")
print(df.info())            # types + null counts
print(df.describe())        # stats
print(df.isnull().sum())   # missing per column

# โ•โ•โ• STEP 2: Separate target and features โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
X = df.drop("label", axis=1)   # features DataFrame
y = df["label"]               # target Series
print(f"Class distribution:\n{y.value_counts()}")

# โ•โ•โ• STEP 3: Identify column types โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
numeric_cols = X.select_dtypes(include=[np.number]).columns.tolist()
cat_cols = X.select_dtypes(include=["object", "category"]).columns.tolist()
print(f"Numeric: {numeric_cols}")
print(f"Categorical: {cat_cols}")

# โ•โ•โ• STEP 4: Check for problems โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
for col in numeric_cols:
    skewness = X[col].skew()
    if abs(skewness) > 1.0:
        print(f"{col}: skewness={skewness:.2f} โ†’ consider log transform")

# Correlation with target
correlations = df[numeric_cols + ["label"]].corr()["label"].abs().sort_values(ascending=False)
print("Feature-target correlations:")
print(correlations)

# โ•โ•โ• STEP 5: Feature engineering โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
X["income_log"] = np.log1p(X["income"])        # log transform skewed
X["income_per_age"] = X["income"] / X["age"]  # interaction feature

# โ•โ•โ• STEP 6: Split (BEFORE any scaling!) โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y   # stratify for imbalanced!
)
print(f"Train: {X_train.shape}, Test: {X_test.shape}")

# โ•โ•โ• STEP 7: Now feed into sklearn Pipeline (from ML guide)
# The preprocessor fits on X_train only โ€” no leakage!
๐ŸŽ“ Python + NumPy + Pandas Complete

You're ML-ready when you can:

  • Write list/dict/set comprehensions without thinking
  • Create and manipulate NumPy arrays with broadcasting
  • Perform matrix multiplication for a linear layer: X @ W + b
  • Load a CSV, run info()/describe()/isnull(), split X/y correctly
  • Filter, groupby, and clean a DataFrame in under 5 minutes
  • Hand off a clean DataFrame to sklearn without data leakage