What is Regression?
Regression is a supervised machine learning technique used to predict a continuous numerical value. Unlike classification, which predicts discrete categories, regression aims to find the relationship between one or more independent variables (features) and a dependent variable (the target value). This relationship is typically modeled using a line or curve.
The goal is to build a model that can accurately estimate the target variable for new, unseen data based on the patterns learned from historical data.
Key Concepts
- Independent Variables (Features): The input variables used to make predictions.
- Dependent Variable (Target): The continuous numerical output we want to predict.
- Model: The mathematical equation or algorithm that describes the relationship between independent and dependent variables.
- Coefficients: Parameters within the model that determine the strength and direction of the relationship between features and the target.
- Loss Function (Cost Function): Measures how well the model's predictions match the actual values. Common examples include Mean Squared Error (MSE) and Mean Absolute Error (MAE).
- Optimization Algorithm: Used to minimize the loss function, thereby finding the best model parameters.
Types of Regression
Linear Regression
Assumes a linear relationship between independent and dependent variables.
- Simple Linear Regression (one feature)
- Multiple Linear Regression (multiple features)
y = β₀ + β₁x₁ + β₂x₂ + ... + βnxn + ε
Polynomial Regression
Models the relationship as an n-th degree polynomial.
- Useful for capturing non-linear patterns.
y = β₀ + β₁x + β₂x² + ... + βnxⁿ + ε
Ridge Regression
A regularized form of linear regression that adds an L2 penalty term.
- Helps prevent overfitting by shrinking coefficients.
Minimize [ Σ(yi - ŷi)² + λ Σ(βj)² ]
Lasso Regression
A regularized form of linear regression that adds an L1 penalty term.
- Can perform feature selection by driving some coefficients to zero.
Minimize [ Σ(yi - ŷi)² + λ Σ|βj| ]
Support Vector Regression (SVR)
Extends Support Vector Machines to regression tasks.
- Aims to fit the error within a certain threshold (epsilon).
Decision Tree Regression
Uses a tree-like structure to make predictions.
- Splits data based on feature values.
Random Forest Regression
An ensemble method that builds multiple decision trees.
- Averages predictions from individual trees to improve accuracy and reduce variance.
Applications of Regression
- Predicting house prices: Based on features like size, location, and number of bedrooms.
- Forecasting sales: Using historical sales data, marketing spend, and economic indicators.
- Estimating stock prices: Analyzing market trends, company performance, and economic news.
- Weather forecasting: Predicting temperature, rainfall, and wind speed.
- Medical diagnosis: Estimating patient risk factors or disease progression.
Evaluation Metrics
Evaluating regression models is crucial to understand their performance. Some common metrics include:
- Mean Squared Error (MSE): Average of the squared differences between predicted and actual values. Punishes larger errors more heavily.
- Root Mean Squared Error (RMSE): The square root of MSE. It's in the same units as the target variable, making it easier to interpret.
- Mean Absolute Error (MAE): Average of the absolute differences between predicted and actual values. Less sensitive to outliers than MSE.
- R-squared (Coefficient of Determination): Represents the proportion of the variance in the dependent variable that is predictable from the independent variables. Ranges from 0 to 1, where 1 indicates a perfect fit.
MSE = (1/n) Σ(yi - ŷi)²
RMSE = √MSE
MAE = (1/n) Σ|yi - ŷi|
R² = 1 - (Σ(yi - ŷi)² / Σ(yi - ȳ)²)
Example: Simple Linear Regression in Python (Conceptual)
Here's a conceptual look at how you might implement and use a simple linear regression model.
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
# Sample Data (replace with your actual data)
# X: Independent variable (e.g., years of experience)
# y: Dependent variable (e.g., salary)
X = np.array([[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]])
y = np.array([30000, 35000, 40000, 45000, 50000, 55000, 60000, 65000, 70000, 75000])
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create a Linear Regression model
model = LinearRegression()
# Train the model
model.fit(X_train, y_train)
# Make predictions on the test set
y_pred = model.predict(X_test)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Intercept (β₀): {model.intercept_}")
print(f"Coefficient (β₁): {model.coef_[0]}")
print(f"Mean Squared Error (MSE): {mse}")
print(f"R-squared (R²): {r2}")
# Predict a new value
new_experience = np.array([[11]])
predicted_salary = model.predict(new_experience)
print(f"Predicted salary for 11 years experience: ${predicted_salary[0]:,.2f}")
# Expected Output (values may vary slightly due to train_test_split):
# Intercept (β₀): 25000.0
# Coefficient (β₁): 5000.0
# Mean Squared Error (MSE): 0.0
# R-squared (R²): 1.0
# Predicted salary for 11 years experience: $80,000.00