NumPy — array computing
ndarray, dtypes, broadcasting
Foundation for Pandas and PyTorch tensors
NumPy arrays are homogeneous (one dtype per array), stored contiguously for speed. Vectorization avoids Python loops; broadcasting applies binary ops across compatible shapes. Axis 0 is usually “down rows” for 2-D arrays.
import numpy as np x = np.array([[1., 2.], [3., 4.]]) # column means — axis=0 aggregates over rows col_mean = x.mean(axis=0) # broadcasting: (2,2) + (2,) adds row-wise y = x + np.array([10, 20]) # Masking — keep values where condition holds mask = x > 2 z = x[mask]
Pandas — tables & time series
DataFrame, groupby, joins
Analytics engineering bread and butter
A DataFrame is columnar; each column is a Series with a shared index. groupby splits-apply-combine; merge is SQL-like joins. Always know your grain (one row = one what?) before aggregating.
import pandas as pd df = pd.read_parquet("events.parquet") daily = ( df.groupby(["user_id", df["ts"].dt.date]) .agg(events=("event_id", "count")) .reset_index() ) # Join to dim table — mind duplicate keys out = daily.merge(users, on="user_id", how="left")
df[a][b]) can return copies vs views—use .loc[row, col] for assignment. For large data consider Polars (see FastAPI & engineering page).Visualization
Matplotlib & Seaborn
EDA before modeling
Use plots to catch distribution shift, outliers, and label noise. Seaborn builds on Matplotlib with nicer defaults for statistical charts.
import matplotlib.pyplot as plt import seaborn as sns fig, ax = plt.subplots(figsize=(8, 4)) sns.histplot(df["score"], kde=True, ax=ax) ax.set_title("Score distribution") plt.tight_layout()
scikit-learn — pipelines & models
Estimator API & Pipeline
Fit on train only; preprocess inside the pipeline
All estimators implement fit / predict (or transform). Pipeline chains steps so cross-validation does not leak information from validation folds into preprocessing. Compare models with proper metrics (ROC-AUC, log loss, RMSE) on held-out data.
from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split # X, y from your feature matrix and labels # preprocessor = ColumnTransformer([...]) # numeric scale + categorical one-hot X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) pipe = Pipeline([ ("prep", preprocessor), # fit on train only inside CV ("clf", RandomForestClassifier(n_estimators=200, random_state=42)), ]) pipe.fit(X_train, y_train)
← Hub & Python core · → PyTorch & AI (deep learning) · FastAPI & engineering (serve models)