Logistic Regression
Logistic Regression is a fundamental and widely used algorithm for binary classification tasks. Despite its name, it's a statistical model that estimates the probability of a binary outcome (e.g., yes/no, true/false, 0/1) based on one or more predictor variables.
Core Concepts
The Sigmoid Function
The key to Logistic Regression is the sigmoid function (also known as the logistic function). It maps any real-valued number into a value between 0 and 1, which can be interpreted as a probability.
Where z is a linear combination of the input features and their corresponding weights, plus a bias term: z = w1x1 + w2x2 + ... + wnxn + b.
Decision Boundary
The algorithm predicts the class based on whether the estimated probability is above or below a certain threshold, typically 0.5. The equation z = 0 defines the decision boundary, which separates the two classes.
How it Works
- Input Data: The algorithm takes a set of input features (
x) and their corresponding binary labels (y). - Linear Combination: It calculates a weighted sum of the input features,
z. - Sigmoid Transformation: The result
zis passed through the sigmoid function to produce a probabilityp(0 ≤p≤ 1). - Prediction: If
p>= 0.5, the instance is classified as class 1. Otherwise, it's classified as class 0. - Training: During training, the algorithm adjusts the weights (
w) and bias (b) to minimize a cost function, such as the log loss (binary cross-entropy), which penalizes incorrect predictions. This is typically done using an optimization algorithm like Gradient Descent.
The Cost Function (Log Loss)
The log loss is crucial for training Logistic Regression models. For a single training example:
Where y is the true label (0 or 1) and p is the predicted probability.
Applications
Logistic Regression is effective for a variety of problems, including:
- Spam detection (spam vs. not spam)
- Medical diagnosis (disease present vs. absent)
- Fraud detection (fraudulent vs. legitimate transaction)
- Customer churn prediction (churn vs. not churn)
Key Strengths
- Simple and computationally efficient.
- Interpretable results, providing probabilities.
- Good baseline model for binary classification.
Limitations
- Assumes a linear relationship between features and the log-odds of the outcome.
- May not perform well on complex, non-linear decision boundaries.
- Sensitive to outliers.
Implementation Details
Example using Scikit-learn
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score
# Generate sample data
X, y = make_classification(n_samples=100, n_features=5, random_state=42)
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Initialize and train the Logistic Regression model
model = LogisticRegression()
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1] # Probability of class 1
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
print(f"Predicted probabilities for first 5 test samples: {y_prob[:5]}")
print(f"Model coefficients: {model.coef_}")
print(f"Model intercept: {model.intercept_}")
Example using R
# Install and load the glm2 package if you don't have it
# install.packages("glm2")
library(glm2)
# Generate sample data (similar to Python example)
set.seed(42)
n_samples <- 100
n_features <- 5
X <- matrix(rnorm(n_samples * n_features), ncol = n_features)
# Create a binary response variable based on a linear combination
true_weights <- rnorm(n_features)
true_intercept <- rnorm(1)
logits <- X %*% true_weights + true_intercept
probabilities <- 1 / (1 + exp(-logits))
y <- rbinom(n_samples, 1, probabilities)
# Combine features and response into a data frame
data <- as.data.frame(X)
data$y <- y
# Split data into training and testing sets (simple split)
train_indices <- sample(1:n_samples, size = floor(0.7 * n_samples))
train_data <- data[train_indices, ]
test_data <- data[-train_indices, ]
# Fit the logistic regression model
# The formula y ~ . means predict y using all other columns as predictors
model <- glm(y ~ ., data = train_data, family = binomial(link = "logit"))
# Summarize the model
summary(model)
# Make predictions on the test set
probabilities_test <- predict(model, newdata = test_data, type = "response")
# Predict class labels (threshold at 0.5)
y_pred <- ifelse(probabilities_test > 0.5, 1, 0)
# Evaluate the model
accuracy <- sum(y_pred == test_data$y) / nrow(test_data)
print(paste("Accuracy:", round(accuracy, 2)))
print(paste("Predicted probabilities for first 5 test samples:", head(probabilities_test, 5)))