Understanding Logistic Regression
Logistic Regression is a statistical model that models the probability of a binary outcome (yes/no, true/false) using a logistic function. Despite its name, it's a classification algorithm, not a regression algorithm, as it predicts categorical outcomes.
The Sigmoid Function
At the heart of logistic regression is the sigmoid function, also known as the logistic function. It maps any real-valued number into a value between 0 and 1. This output can be interpreted as a probability.
The formula is: σ(z) = 1 / (1 + e-z)
Where 'z' is the linear combination of input features and their corresponding weights.
How it Works
- Linear Combination: Calculate a weighted sum of the input features (z = w0 + w1x1 + ... + wnxn).
- Sigmoid Transformation: Apply the sigmoid function to 'z' to get a probability between 0 and 1.
- Classification: A threshold (commonly 0.5) is used to classify the outcome. If the probability is greater than the threshold, it's classified as one class; otherwise, it's classified as the other.
Implementation with Scikit-learn
Scikit-learn provides a straightforward implementation of logistic regression through the LogisticRegression
class in the sklearn.linear_model
module.
Example: Binary Classification
Let's see how to train a logistic regression model.
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np
# Sample data (replace with your actual data)
# Feature 1: Age, Feature 2: Income
X = np.array([[25, 50000], [30, 60000], [35, 70000], [22, 45000],
[45, 90000], [50, 100000], [28, 55000], [40, 80000]])
# Target: 0 (No loan default), 1 (Loan default)
y = np.array([0, 0, 0, 0, 1, 1, 0, 1])
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Initialize and train the Logistic Regression model
model = LogisticRegression()
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")
# Get predicted probabilities
y_prob = model.predict_proba(X_test)
print("Predicted Probabilities (Class 0, Class 1):")
print(y_prob)
Key Parameters in LogisticRegression
penalty
: Type of regularization term ('l1', 'l2', 'elasticnet', 'none'). Default is 'l2'.C
: Inverse of regularization strength; smaller values specify stronger regularization.solver
: Algorithm to use in the optimization problem ('liblinear', 'newton-cg', 'lbfgs', 'sag', 'saga').max_iter
: Maximum number of iterations to converge.
Advantages and Disadvantages
Advantages:
- Simple to implement and interpret.
- Computationally efficient and works well on linearly separable data.
- Outputs probabilities, which can be useful for ranking or setting confidence thresholds.
Disadvantages:
- Assumes linearity between features and the log-odds of the outcome.
- May not perform well if the decision boundary is highly non-linear.
- Sensitive to outliers.
Further Learning
Explore multi-class logistic regression, regularization techniques, and hyperparameter tuning to optimize your models.