Page 2 of 4 · ← Hub

Data science & classical ML

NumPy for numerical arrays, Pandas for labeled tables, Matplotlib/Seaborn for plots, and scikit-learn for preprocessing and traditional ML models. This stack is the default for analytics engineering and feature work before deep learning.

NumPy Pandas matplotlib scikit-learn

Topics

→

PyTorch next

NumPy — array computing

▦

ndarray, dtypes, broadcasting

Foundation for Pandas and PyTorch tensors

▼

NumPy arrays are homogeneous (one dtype per array), stored contiguously for speed. Vectorization avoids Python loops; broadcasting applies binary ops across compatible shapes. Axis 0 is usually “down rows” for 2-D arrays.

python

import numpy as np

x = np.array([[1., 2.], [3., 4.]])
# column means — axis=0 aggregates over rows
col_mean = x.mean(axis=0)

# broadcasting: (2,2) + (2,) adds row-wise
y = x + np.array([10, 20])

# Masking — keep values where condition holds
mask = x > 2
z = x[mask]

✓

Prefer NumPy for numeric kernels; for labeled tables use Pandas. PyTorch tensors are conceptually similar—see the PyTorch page.

Pandas — tables & time series

▤

DataFrame, groupby, joins

Analytics engineering bread and butter

▼

A DataFrame is columnar; each column is a Series with a shared index. groupby splits-apply-combine; merge is SQL-like joins. Always know your grain (one row = one what?) before aggregating.

python

import pandas as pd

df = pd.read_parquet("events.parquet")
daily = (
    df.groupby(["user_id", df["ts"].dt.date])
      .agg(events=("event_id", "count"))
      .reset_index()
)
# Join to dim table — mind duplicate keys
out = daily.merge(users, on="user_id", how="left")

⚠

Chained indexing (df[a][b]) can return copies vs views—use .loc[row, col] for assignment. For large data consider Polars (see FastAPI & engineering page).

Visualization

📈

Matplotlib & Seaborn

EDA before modeling

▼

Use plots to catch distribution shift, outliers, and label noise. Seaborn builds on Matplotlib with nicer defaults for statistical charts.

python

import matplotlib.pyplot as plt
import seaborn as sns

fig, ax = plt.subplots(figsize=(8, 4))
sns.histplot(df["score"], kde=True, ax=ax)
ax.set_title("Score distribution")
plt.tight_layout()

scikit-learn — pipelines & models

⚙

Estimator API & Pipeline

Fit on train only; preprocess inside the pipeline

▼

All estimators implement fit / predict (or transform). Pipeline chains steps so cross-validation does not leak information from validation folds into preprocessing. Compare models with proper metrics (ROC-AUC, log loss, RMSE) on held-out data.

python

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# X, y from your feature matrix and labels
# preprocessor = ColumnTransformer([...])  # numeric scale + categorical one-hot
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

pipe = Pipeline([
    ("prep", preprocessor),  # fit on train only inside CV
    ("clf", RandomForestClassifier(n_estimators=200, random_state=42)),
])
pipe.fit(X_train, y_train)

ℹ

When data is tabular and samples are modest in size, gradient boosting (XGBoost, LightGBM, CatBoost) often beats random forests—still use pipelines and careful validation.

Related pages
← Hub & Python core · → PyTorch & AI (deep learning) · FastAPI & engineering (serve models)