Logistic Regression

Logistic Regression is a fundamental and widely used algorithm for binary classification tasks. Despite its name, it's a statistical model that estimates the probability of a binary outcome (e.g., yes/no, true/false, 0/1) based on one or more predictor variables.

Core Concepts

The Sigmoid Function

The key to Logistic Regression is the sigmoid function (also known as the logistic function). It maps any real-valued number into a value between 0 and 1, which can be interpreted as a probability.

σ(z) = 1 / (1 + e^-z)

Where z is a linear combination of the input features and their corresponding weights, plus a bias term: z = w₁x₁ + w₂x₂ + ... + w_nx_n + b.

Decision Boundary

The algorithm predicts the class based on whether the estimated probability is above or below a certain threshold, typically 0.5. The equation z = 0 defines the decision boundary, which separates the two classes.

How it Works

Input Data: The algorithm takes a set of input features (x) and their corresponding binary labels (y).
Linear Combination: It calculates a weighted sum of the input features, z.
Sigmoid Transformation: The result z is passed through the sigmoid function to produce a probability p (0 ≤ p ≤ 1).
Prediction: If p >= 0.5, the instance is classified as class 1. Otherwise, it's classified as class 0.
Training: During training, the algorithm adjusts the weights (w) and bias (b) to minimize a cost function, such as the log loss (binary cross-entropy), which penalizes incorrect predictions. This is typically done using an optimization algorithm like Gradient Descent.

The Cost Function (Log Loss)

The log loss is crucial for training Logistic Regression models. For a single training example:

Cost(p, y) = -[ y * log(p) + (1 - y) * log(1 - p) ]

Where y is the true label (0 or 1) and p is the predicted probability.

Applications

Logistic Regression is effective for a variety of problems, including:

Spam detection (spam vs. not spam)
Medical diagnosis (disease present vs. absent)
Fraud detection (fraudulent vs. legitimate transaction)
Customer churn prediction (churn vs. not churn)

Key Strengths

Simple and computationally efficient.
Interpretable results, providing probabilities.
Good baseline model for binary classification.

Limitations

Assumes a linear relationship between features and the log-odds of the outcome.
May not perform well on complex, non-linear decision boundaries.
Sensitive to outliers.

Implementation Details

Example using Scikit-learn


from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score

# Generate sample data
X, y = make_classification(n_samples=100, n_features=5, random_state=42)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize and train the Logistic Regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1] # Probability of class 1

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
print(f"Predicted probabilities for first 5 test samples: {y_prob[:5]}")
print(f"Model coefficients: {model.coef_}")
print(f"Model intercept: {model.intercept_}")

Example using R


# Install and load the glm2 package if you don't have it
# install.packages("glm2")
library(glm2)

# Generate sample data (similar to Python example)
set.seed(42)
n_samples <- 100
n_features <- 5
X <- matrix(rnorm(n_samples * n_features), ncol = n_features)
# Create a binary response variable based on a linear combination
true_weights <- rnorm(n_features)
true_intercept <- rnorm(1)
logits <- X %*% true_weights + true_intercept
probabilities <- 1 / (1 + exp(-logits))
y <- rbinom(n_samples, 1, probabilities)

# Combine features and response into a data frame
data <- as.data.frame(X)
data$y <- y

# Split data into training and testing sets (simple split)
train_indices <- sample(1:n_samples, size = floor(0.7 * n_samples))
train_data <- data[train_indices, ]
test_data <- data[-train_indices, ]

# Fit the logistic regression model
# The formula y ~ . means predict y using all other columns as predictors
model <- glm(y ~ ., data = train_data, family = binomial(link = "logit"))

# Summarize the model
summary(model)

# Make predictions on the test set
probabilities_test <- predict(model, newdata = test_data, type = "response")

# Predict class labels (threshold at 0.5)
y_pred <- ifelse(probabilities_test > 0.5, 1, 0)

# Evaluate the model
accuracy <- sum(y_pred == test_data$y) / nrow(test_data)
print(paste("Accuracy:", round(accuracy, 2)))
print(paste("Predicted probabilities for first 5 test samples:", head(probabilities_test, 5)))