Logistic Regression with Scikit-learn

Understanding Logistic Regression

Logistic Regression is a statistical model that models the probability of a binary outcome (yes/no, true/false) using a logistic function. Despite its name, it's a classification algorithm, not a regression algorithm, as it predicts categorical outcomes.

The Sigmoid Function

At the heart of logistic regression is the sigmoid function, also known as the logistic function. It maps any real-valued number into a value between 0 and 1. This output can be interpreted as a probability.

The formula is: σ(z) = 1 / (1 + e^-z)

Where 'z' is the linear combination of input features and their corresponding weights.

How it Works

Linear Combination: Calculate a weighted sum of the input features (z = w₀ + w₁x₁ + ... + w_nx_n).
Sigmoid Transformation: Apply the sigmoid function to 'z' to get a probability between 0 and 1.
Classification: A threshold (commonly 0.5) is used to classify the outcome. If the probability is greater than the threshold, it's classified as one class; otherwise, it's classified as the other.

Implementation with Scikit-learn

Scikit-learn provides a straightforward implementation of logistic regression through the LogisticRegression class in the sklearn.linear_model module.

Example: Binary Classification

Let's see how to train a logistic regression model.


from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np

# Sample data (replace with your actual data)
# Feature 1: Age, Feature 2: Income
X = np.array([[25, 50000], [30, 60000], [35, 70000], [22, 45000],
              [45, 90000], [50, 100000], [28, 55000], [40, 80000]])
# Target: 0 (No loan default), 1 (Loan default)
y = np.array([0, 0, 0, 0, 1, 1, 0, 1])

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize and train the Logistic Regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

# Get predicted probabilities
y_prob = model.predict_proba(X_test)
print("Predicted Probabilities (Class 0, Class 1):")
print(y_prob)

Key Parameters in `LogisticRegression`

penalty: Type of regularization term ('l1', 'l2', 'elasticnet', 'none'). Default is 'l2'.
C: Inverse of regularization strength; smaller values specify stronger regularization.
solver: Algorithm to use in the optimization problem ('liblinear', 'newton-cg', 'lbfgs', 'sag', 'saga').
max_iter: Maximum number of iterations to converge.

Advantages and Disadvantages

Advantages:

Simple to implement and interpret.
Computationally efficient and works well on linearly separable data.
Outputs probabilities, which can be useful for ranking or setting confidence thresholds.

Disadvantages:

Assumes linearity between features and the log-odds of the outcome.
May not perform well if the decision boundary is highly non-linear.
Sensitive to outliers.

Further Learning

Explore multi-class logistic regression, regularization techniques, and hyperparameter tuning to optimize your models.