Python Data Science & ML

Reproducibility Best Practices

Why Reproducibility Matters

Reproducible research ensures that results can be independently verified, builds trust, and accelerates collaboration.

1. Version Control Everything

Track code, configuration, and even data preprocessing scripts with git. Use meaningful commit messages.

git init
git add .
git commit -m "Initial commit: data cleaning script"

2. Manage Python Environments

Pin exact package versions using pip or conda. Store the environment definition alongside your code.

# pip
pip freeze > requirements.txt
# conda
conda env export > environment.yml

Recreate the environment:

pip install -r requirements.txt
# or
conda env create -f environment.yml

3. Fix Random Seeds

Set seeds for all libraries that generate randomness.

import numpy as np
import random
import torch

seed = 42
np.random.seed(seed)
random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)

4. Version Your Data

Large datasets should be versioned with tools like DVC or Git LFS.

dvc init
dvc add data/raw/train.csv
git add data/raw/train.csv.dvc .gitignore
git commit -m "Add raw training data"

5. Document Experiments

Use mlflow or simple markdown logs to capture parameters, metrics, and artifacts.

import mlflow

with mlflow.start_run():
    mlflow.log_params({"lr": 0.001, "batch_size": 32})
    mlflow.log_metric("accuracy", 0.93)
    mlflow.log_artifact("model.pkl")

6. Continuous Integration & Deployment

Automate tests and reproducibility checks with GitHub Actions.

name: CI

on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'
      - name: Install dependencies
        run: |
          pip install -r requirements.txt
      - name: Run tests
        run: |
          pytest tests/

7. Containerize Your Workflows

Encapsulate the environment with Docker.

FROM python:3.11-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["python", "train.py"]
↑ Top