Python ยท NumPy ยท Pandas
Everything You Need for ML
A comprehensive recap of Python fundamentals, then deep dives into NumPy (matrices, broadcasting, linear algebra) and Pandas (DataFrames, cleaning, EDA). Every concept you'll use daily in ML โ with live examples and interactive exercises.
Data Types & Variables
Python is dynamically typed โ you don't declare types, Python infers them. But in ML you ALWAYS care about types because the wrong type (str vs float) breaks your entire pipeline silently.
# โโโ The 5 core types you use in ML โโโโโโโโโโโโโโโโโโโโโโโ x = 42 # int โ index, count, label (0 or 1) y = 3.14 # float โ price, probability, weight b = True # bool โ is_fraud, has_feature (True=1, False=0) s = "income" # str โ column name, category label n = None # NoneType โ missing value (like NaN before pandas) # โโโ Type checking and conversion โโโโโโโโโโโโโโโโโโโโโโโโโ type(x) # โ <class 'int'> isinstance(x, int) # โ True (use this, not type() == int) float("3.14") # โ 3.14 (string to float โ common when reading CSVs) int(3.9) # โ 3 (truncates, does NOT round) str(42) # โ "42" bool(0) # โ False (0, 0.0, "", [], None are all False) # โโโ ML gotcha: int division vs float division โโโโโโโโโโโโโ 7 / 2 # โ 3.5 (true division, always float) 7 // 2 # โ 3 (floor division, int result) 7 % 2 # โ 1 (modulo โ remainder) 2 ** 10 # โ 1024 (power โ used for learning rate schedules) # โโโ F-strings โ use these everywhere โโโโโโโโโโโโโโโโโโโโโ acc = 0.9342 epochs = 10 print(f"Epoch {epochs}: accuracy = {acc:.2f}") # โ Epoch 10: accuracy = 0.93 print(f"Loss: {0.00234:.4e}") # โ Loss: 2.3400e-03
int(3.9) return?Control Flow & Loops
Conditionals and loops are the skeleton of every ML preprocessing script, training loop, and evaluation pipeline. Master these completely โ you'll write them thousands of times.
# โโโ if / elif / else โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ score = 0.87 if score >= 0.9: print("Excellent model!") elif score >= 0.7: print("Good โ consider tuning") else: print("Needs work") # โโโ for loop โ the workhorse โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ losses = [0.9, 0.7, 0.5, 0.3, 0.15] for i, loss in enumerate(losses): # enumerate gives (index, value) print(f"Epoch {i+1}: loss={loss:.2f}") # โโโ range() โ for training epochs โโโโโโโโโโโโโโโโโโโโโโโโ for epoch in range(1, 101): # 1 to 100 inclusive pass # pass = placeholder # โโโ while loop โ for convergence checks โโโโโโโโโโโโโโโโโโ loss = 1.0 threshold = 0.01 steps = 0 while loss > threshold: loss *= 0.85 # simulate loss decreasing steps += 1 if steps > 1000: print("Did not converge!") break # break exits the loop print(f"Converged in {steps} steps") # โโโ zip() โ iterate two lists together โโโโโโโโโโโโโโโโโโโ features = ["age", "income", "score"] weights = [0.3, 0.5, 0.2] for feat, w in zip(features, weights): print(f"{feat}: weight = {w}")
This is literally what a Python for loop over epochs looks like โ each step calls backward(), optimizer.step(), and appends loss to a list exactly like the code above.
Functions & Scope
In ML, functions are everywhere: preprocessing steps, loss functions, metric calculators, data loaders. Understanding scope, default arguments, and *args/**kwargs is essential.
# โโโ Basic function โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ def accuracy(y_true: list, y_pred: list) -> float: """Calculate classification accuracy. Args: y_true: actual labels y_pred: predicted labels Returns: accuracy score between 0 and 1 """ correct = sum(t == p for t, p in zip(y_true, y_pred)) return correct / len(y_true) # โโโ Default arguments โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ def normalize(x: float, method: str = "minmax", clip: bool = False) -> float: # Default args must come AFTER positional args pass normalize(5.0) # uses defaults normalize(5.0, method="zscore") # keyword arg normalize(5.0, "zscore", True) # positional # โโโ *args and **kwargs โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ def log_metrics(*metrics, **extra_info): # *args: any number of positional args โ tuple # **kwargs: any keyword args โ dict for m in metrics: print(f"metric: {m}") for k, v in extra_info.items(): print(f"{k} = {v}") log_metrics(0.92, 0.87, epoch=10, lr=0.001) # โโโ Returning multiple values (returns a tuple) โโโโโโโโโโโ def train_test_split_manual(data, ratio=0.8): split = int(len(data) * ratio) return data[:split], data[split:] # returns tuple train, test = train_test_split_manual([1,2,3,4,5]) # train=[1,2,3,4] test=[5]
Lists, Dicts, Tuples & Sets
Lists store sequences of features. Dicts store hyperparameters and configs. Tuples store immutable shapes. Sets check membership. You use all four daily in ML.
# โโโ LIST โ ordered, mutable, allows duplicates โโโโโโโโโโโโ features = ["age", "income", "score", "age"] features.append("city") # add to end features.remove("score") # remove first occurrence features.sort() # sort in place features[0] # โ "age" (first item) features[-1] # โ last item features[1:3] # โ ["income", "score"] (slicing) len(features) # โ count # โโโ DICT โ key-value pairs, unordered, fast lookup โโโโโโโโ hyperparams = { "lr": 0.001, "epochs": 100, "batch_size": 32, "optimizer": "adam" } hyperparams["lr"] # โ 0.001 hyperparams.get("dropout", 0.5) # โ 0.5 (default if missing) "epochs" in hyperparams # โ True (key lookup) hyperparams.keys() # โ dict_keys(["lr", "epochs"...]) hyperparams.values() # โ dict_values([0.001, 100...]) hyperparams.items() # โ for k,v in ... โ use this! # โโโ TUPLE โ immutable, used for shapes and coordinates โโโโ shape = (64, 3, 224, 224) # batch, channels, H, W (image tensor shape) batch_size, channels, h, w = shape # unpacking # โโโ SET โ unique items, fast membership check โโโโโโโโโโโโโ seen_labels = {"cat", "dog", "bird"} "cat" in seen_labels # โ True (O(1) vs O(n) for list) seen_labels.add("fish") # add element set(features) # remove duplicates from list
OOP & Classes
Every sklearn model is a class. Every PyTorch model is a class. Understanding __init__, methods, inheritance, and dunder methods makes you able to write custom models and transformers.
# โโโ The pattern ALL sklearn models follow โโโโโโโโโโโโโโโโโ class MinMaxScaler: """Scales features to [0, 1] range.""" def __init__(self, feature_range=(0, 1)): self.feature_range = feature_range self.min_ = None # underscore = "set by fit()" self.max_ = None def fit(self, X): """Learn min and max from training data.""" self.min_ = min(X) self.max_ = max(X) return self # return self enables chaining def transform(self, X): """Apply learned scaling to new data.""" if self.min_ is None: raise RuntimeError("Call fit() before transform()") return [(x - self.min_) / (self.max_ - self.min_) for x in X] def fit_transform(self, X): return self.fit(X).transform(X) # method chaining! def __repr__(self): return f"MinMaxScaler(feature_range={self.feature_range})" # โโโ Inheritance โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ class ClippingScaler(MinMaxScaler): # inherits from MinMaxScaler def transform(self, X): # override parent method scaled = super().transform(X) # call parent's transform return [max(0, min(1, v)) for v in scaled] # then clip scaler = MinMaxScaler() scaler.fit_transform([10, 20, 30, 40]) # โ [0.0, 0.33, 0.67, 1.0]
Comprehensions, Lambdas & Functional Tools
Comprehensions make data processing code dramatically shorter. Lambdas are for quick one-liner functions. map/filter/sorted are used constantly in feature engineering.
# โโโ List comprehension โ the most used Python pattern โโโโโ squares = [x**2 for x in range(10)] # [0, 1, 4, 9, 16, 25, 36, 49, 64, 81] # With filter condition: pos_features = [x for x in values if x > 0] # Transform and filter at once: clean = [float(x) for x in raw_data if x is not None] # โโโ Dict comprehension โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ feature_names = ["age", "income", "score"] feature_index = {name: i for i, name in enumerate(feature_names)} # โ {"age": 0, "income": 1, "score": 2} # Invert a dictionary: index_feature = {v: k for k, v in feature_index.items()} # โโโ Lambda โ anonymous one-liner function โโโโโโโโโโโโโโโโโ square = lambda x: x**2 square(5) # โ 25 # Most common use: as key= argument for sorting models = [("RF", 0.92), ("SVM", 0.87), ("LR", 0.81)] sorted(models, key=lambda x: x[1], reverse=True) # โ [("RF", 0.92), ("SVM", 0.87), ("LR", 0.81)] # โโโ map() and filter() โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ doubled = list(map(lambda x: x*2, [1,2,3])) # [2, 4, 6] positive = list(filter(lambda x: x>0, [-1,2,-3,4])) # [2, 4] # โโโ any() and all() โ used in validation โโโโโโโโโโโโโโโโโโ any([False, True, False]) # โ True (at least one True) all([True, True, True]) # โ True (all True) any(x > 0.9 for x in scores) # any score above 0.9?
Arrays, Operations & Vectorization
NumPy is the numerical backbone of ML. A NumPy array is like a Python list โ but 100x faster because operations happen in compiled C. Your ML dataset is an ndarray. Every sklearn model input is an ndarray.
Never loop over array elements in Python. NumPy operates on entire arrays at once in compiled C code. Loop over 1M elements โ slow. NumPy add โ instant. This principle applies everywhere in ML.
import numpy as np # always aliased as np # โโโ Creating arrays โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ a = np.array([1, 2, 3, 4, 5]) # from Python list b = np.zeros((5, 3)) # 5ร3 zeros c = np.ones((3, 3)) # 3ร3 ones I = np.eye(4) # 4ร4 identity matrix r = np.random.randn(100, 10) # 100 samples, 10 features (Gaussian) u = np.random.uniform(0, 1, (50, 5)) # uniform random, 50ร5 s = np.linspace(0, 1, 100) # 100 evenly spaced from 0 to 1 # โโโ Shape, dtype, size โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ X = np.random.randn(1000, 20) # 1000 samples, 20 features X.shape # โ (1000, 20) X.ndim # โ 2 X.dtype # โ float64 X.size # โ 20000 (total elements) # โโโ VECTORIZED operations โ FAST โโโโโโโโโโโโโโโโโโโโโโโโโโ X + 10 # add 10 to every element X * 2 # multiply every element by 2 X ** 2 # square every element np.exp(X) # e^x for every element (sigmoid needs this!) np.log(X) # natural log (cross-entropy loss needs this!) np.sqrt(X) # square root np.abs(X) # absolute value (MAE loss!) # โโโ Aggregate operations โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ X.sum() # total sum X.sum(axis=0) # sum each column (feature means) X.sum(axis=1) # sum each row (sample totals) X.mean(), X.std() # global mean and std X.max(axis=0) # max per feature X.argmax() # index of maximum (predicted class!) # โโโ dtype conversion โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ X.astype(np.float32) # float32 uses half the memory of float64 X.astype(int) # convert to integer
Indexing, Slicing & Broadcasting
Broadcasting is NumPy's most powerful (and confusing) feature. It lets you do operations between arrays of different shapes without writing loops. Understanding it unlocks everything in deep learning.
A (3ร3) matrix minus a (3,) vector: the vector is automatically "broadcast" across all rows. This is how you subtract the mean from every sample in one line.
import numpy as np X = np.random.randn(100, 5) # 100 samples, 5 features # โโโ Basic indexing โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ X[0] # first row (sample 0) X[0, 2] # row 0, column 2 X[-1] # last row X[5:10] # rows 5-9 X[:, 0] # entire first column (feature 0) X[:, :3] # first 3 features only X[10:20, 1:4] # rows 10-19, columns 1-3 # โโโ Boolean indexing โ most important for filtering โโโโโโโ X[X > 0] # all positive values (flattens!) mask = X[:, 0] > 0 # mask: True/False for each sample X[mask] # samples where feature 0 is positive X[X[:, 0] > 0] # one-liner version # โโโ Fancy indexing โ select specific rows โโโโโโโโโโโโโโโโโ indices = np.array([0, 5, 10, 50]) X[indices] # select exactly those rows # โโโ Reshaping โ critical for deep learning โโโโโโโโโโโโโโโโ flat = X.reshape(-1) # flatten to 1D (500,) col = X.reshape(-1, 1) # column vector (500, 1) img = np.zeros((28*28,)).reshape(1, 28, 28) # MNIST pixel โ image X_T = X.T # transpose # โโโ BROADCASTING โ the magic โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ mean = X.mean(axis=0) # shape (5,) โ per-feature mean std = X.std(axis=0) # shape (5,) X_norm = (X - mean) / std # (100,5)-(5,) โ broadcast! StandardScaler in one line # Broadcasting rules: # 1. Align shapes from the RIGHT # 2. Dimensions of size 1 stretch to match # (100,5) - (5,) โ (100,5) - (1,5) โ (100,5) โ # โโโ np.where โ vectorized if/else โโโโโโโโโโโโโโโโโโโโโโโโโ labels = np.where(X[:, 0] > 0, 1, 0) # 1 if positive else 0
Linear Algebra & Statistics
np.linalg and np.dot are the building blocks of every ML algorithm. Matrix multiplication IS neural network forward pass. Eigendecomposition IS PCA. These aren't optional.
import numpy as np # โโโ Matrix multiplication โ the core of forward pass โโโโโโ X = np.random.randn(64, 20) # batch of 64, 20 features W = np.random.randn(20, 10) # weight matrix b = np.zeros(10) # bias vector Z = X @ W + b # โ shape (64, 10) โ one linear layer! # Same as: Z = np.dot(X, W) + b # โโโ Dot product โ similarity between vectors โโโโโโโโโโโโโโ a = np.array([1, 2, 3]) b_v = np.array([4, 5, 6]) np.dot(a, b_v) # โ 32 (1*4 + 2*5 + 3*6) # Cosine similarity (used in RAG you already built!): cos_sim = np.dot(a, b_v) / (np.linalg.norm(a) * np.linalg.norm(b_v)) # โโโ Essential linalg โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ np.linalg.norm(a) # Euclidean norm (length of vector) np.linalg.det(W[:10,:10]) # determinant np.linalg.inv(W[:10,:10]) # matrix inverse (OLS formula needs this) vals, vecs = np.linalg.eig(np.cov(X.T)) # eigenvalues/vectors โ PCA! U, S, Vt = np.linalg.svd(X, full_matrices=False) # SVD โ dimensionality reduction # โโโ Statistics โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ np.mean(X, axis=0) # per-feature mean np.std(X, axis=0) # per-feature std np.median(X, axis=0) # per-feature median (robust to outliers) np.percentile(X, [25, 75], axis=0) # Q1 and Q3 for IQR np.corrcoef(X.T) # correlation matrix (feature correlations) np.cov(X.T) # covariance matrix (input to PCA) # โโโ Random with seed โ reproducibility โโโโโโโโโโโโโโโโโโโโ np.random.seed(42) # old way rng = np.random.default_rng(42) # modern way rng.integers(0, 100, size=10) # random ints rng.shuffle(X) # shuffle rows in-place
Series & DataFrame โ The Foundation
Pandas is your primary tool for loading, exploring, and cleaning tabular data before it goes into any ML model. A DataFrame is like an Excel sheet with superpowers โ and the Python you already know transfers directly.
import pandas as pd # always aliased as pd import numpy as np # โโโ Series โ 1D labeled array โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ ages = pd.Series([25, 30, 35, 28], index=["Alice", "Bob", "Charlie", "Diana"]) ages["Bob"] # โ 30 (label-based) ages[1] # โ 30 (position-based) ages > 27 # โ boolean Series (True/False) ages.mean() # โ 29.5 # โโโ DataFrame โ the main tool โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ df = pd.DataFrame({ "age": [25, 30, 35, 28, 45], "income": [50000, 75000, 90000, 62000, 120000], "label": [0, 1, 1, 0, 1] }) # โโโ First things you do with ANY new dataset โโโโโโโโโโโโโโ df.shape # โ (5, 3) โ rows, columns df.head(5) # first 5 rows df.tail(3) # last 3 rows df.info() # column types + null counts โ RUN THIS FIRST df.describe() # count, mean, std, min, quartiles, max df.dtypes # data type of each column df.isnull().sum() # missing values per column โ ALWAYS CHECK df.columns # column names # โโโ Column access โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ df["age"] # โ Series (single column) df[["age", "income"]] # โ DataFrame (multiple columns) df.age # dot notation (only if no spaces in name) # โโโ .loc vs .iloc โ IMPORTANT distinction โโโโโโโโโโโโโโโโโ df.loc[0] # row by INDEX LABEL df.iloc[0] # row by POSITION โ usually want this df.loc[0:2, "age"] # rows 0-2 (inclusive!), column "age" df.iloc[0:2, 0] # rows 0-1 (exclusive), column 0 # โโโ Load CSV โ your most used line โโโโโโโโโโโโโโโโโโโโโโโโ df = pd.read_csv("data.csv") df = pd.read_csv("data.csv", sep=";", encoding="utf-8", low_memory=False)
Filtering, Sorting & GroupBy
Filtering lets you select relevant subsets. GroupBy lets you compute stats per category โ like mean income per label, or fraud rate per city. These are EDA workhorses.
# โโโ Boolean filtering โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ young = df[df["age"] < 30] high_income = df[df["income"] > 80000] fraud = df[(df["label"] == 1) & (df["income"] < 40000)] # & not 'and' either = df[(df["age"] > 50) | (df["label"] == 1)] # | not 'or' # .query() method โ sometimes cleaner df.query("age < 30 and income > 50000") df.query("label in [0, 1] and age > @threshold") # @ for variables # .isin() โ filter by list of values df[df["city"].isin(["Mumbai", "Delhi", "Bangalore"])] # โโโ Sorting โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ df.sort_values("income", ascending=False) # by income desc df.sort_values(["label", "income"], ascending=[True, False]) # โโโ GroupBy โ the most important Pandas operation โโโโโโโโโ # Mean income per label (class 0 vs class 1) df.groupby("label")["income"].mean() # Multiple aggregations at once df.groupby("label").agg({ "income": ["mean", "std", "count"], "age": ["mean", "min", "max"] }) # Count per category df["label"].value_counts() # class distribution df["city"].value_counts(normalize=True) # as percentages # GroupBy + transform โ add group mean as new column df["mean_income_by_label"] = df.groupby("label")["income"].transform("mean")
Data Cleaning & Merging
Real datasets are dirty. Missing values, wrong types, duplicates, inconsistent strings. This is where 70% of real ML time goes. Master these patterns and you'll work 10x faster.
# โโโ Missing values โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ df.isnull().sum() # null count per column df.isnull().mean() * 100 # % missing per column df.dropna() # drop ALL rows with any null df.dropna(subset=["income"]) # drop only if income is null df.dropna(thresh=3) # keep rows with โฅ3 non-null values df["income"].fillna(df["income"].median()) # fill with median df["city"].fillna("Unknown") # fill categorical df.fillna(method="ffill") # forward fill (time series) # โโโ Duplicates โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ df.duplicated().sum() # how many duplicate rows? df.drop_duplicates(inplace=True) # remove them # โโโ Type conversion โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ df["age"] = df["age"].astype(int) df["date"] = pd.to_datetime(df["date"]) # parse dates df["income"] = pd.to_numeric(df["income"], errors="coerce") # "coerce" โ NaN on error # โโโ String cleaning โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ df["name"].str.lower() # lowercase df["name"].str.strip() # remove whitespace df["name"].str.replace(",", "") # remove commas df["city"].str.contains("Mumbai") # boolean mask df["email"].str.extract(r"(\w+)@") # regex extract # โโโ Merging (like SQL JOINs) โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ result = pd.merge(df1, df2, on="user_id", how="left") # how: "inner"(both), "left"(all left), "right", "outer"(all) # Concatenate DataFrames combined = pd.concat([df_train, df_val], ignore_index=True) # rows wide = pd.concat([df1, df2], axis=1) # columns # โโโ apply() โ apply any function to rows/columns โโโโโโโโโโ df["income_log"] = df["income"].apply(np.log1p) # log transform df["age_bucket"] = df["age"].apply(lambda x: "young" if x < 30 else "senior")
EDA & Pre-ML Workflow
Exploratory Data Analysis (EDA) is the process of understanding your dataset BEFORE building any model. This is the complete workflow โ from loading a CSV to handing off a clean array to sklearn.
High correlation (near ยฑ1) between two features = multicollinearity. Consider dropping one. High correlation with target = useful feature.
import pandas as pd import numpy as np from sklearn.model_selection import train_test_split # โโโ STEP 1: Load and initial inspection โโโโโโโโโโโโโโโโโโ df = pd.read_csv("data.csv") print(f"Shape: {df.shape}") print(df.info()) # types + null counts print(df.describe()) # stats print(df.isnull().sum()) # missing per column # โโโ STEP 2: Separate target and features โโโโโโโโโโโโโโโโ X = df.drop("label", axis=1) # features DataFrame y = df["label"] # target Series print(f"Class distribution:\n{y.value_counts()}") # โโโ STEP 3: Identify column types โโโโโโโโโโโโโโโโโโโโโโโ numeric_cols = X.select_dtypes(include=[np.number]).columns.tolist() cat_cols = X.select_dtypes(include=["object", "category"]).columns.tolist() print(f"Numeric: {numeric_cols}") print(f"Categorical: {cat_cols}") # โโโ STEP 4: Check for problems โโโโโโโโโโโโโโโโโโโโโโโโโโ for col in numeric_cols: skewness = X[col].skew() if abs(skewness) > 1.0: print(f"{col}: skewness={skewness:.2f} โ consider log transform") # Correlation with target correlations = df[numeric_cols + ["label"]].corr()["label"].abs().sort_values(ascending=False) print("Feature-target correlations:") print(correlations) # โโโ STEP 5: Feature engineering โโโโโโโโโโโโโโโโโโโโโโโโ X["income_log"] = np.log1p(X["income"]) # log transform skewed X["income_per_age"] = X["income"] / X["age"] # interaction feature # โโโ STEP 6: Split (BEFORE any scaling!) โโโโโโโโโโโโโโโโ X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y # stratify for imbalanced! ) print(f"Train: {X_train.shape}, Test: {X_test.shape}") # โโโ STEP 7: Now feed into sklearn Pipeline (from ML guide) # The preprocessor fits on X_train only โ no leakage!
You're ML-ready when you can:
- Write list/dict/set comprehensions without thinking
- Create and manipulate NumPy arrays with broadcasting
- Perform matrix multiplication for a linear layer: X @ W + b
- Load a CSV, run info()/describe()/isnull(), split X/y correctly
- Filter, groupby, and clean a DataFrame in under 5 minutes
- Hand off a clean DataFrame to sklearn without data leakage