Regression: Supervised Learning

What is Regression?

Regression is a type of supervised machine learning algorithm used to predict a continuous numerical value. Unlike classification, which predicts discrete categories (e.g., "spam" or "not spam"), regression aims to estimate a quantity that can fall anywhere within a range.

Think about predicting:

The price of a house based on its features.
The temperature tomorrow based on historical data.
The sales revenue for the next quarter.
A person's age based on their biometrics.

The core idea is to find a relationship between one or more independent variables (features) and a dependent variable (the target value you want to predict).

How Regression Works

Regression algorithms learn a mapping function from the input variables to the output variable. This function aims to minimize the difference between the predicted values and the actual values in the training data.

The most fundamental form of regression is Linear Regression. In simple linear regression, we assume a linear relationship between a single independent variable (X) and the dependent variable (Y):

Y = β₀ + β₁X + ε

Where:

Y is the dependent variable (what we want to predict).
X is the independent variable (the input feature).
β₀ is the y-intercept (the value of Y when X is 0).
β₁ is the slope (how much Y changes for a unit change in X).
ε is the error term (representing variability not explained by X).

The algorithm's goal is to find the optimal values for β₀ and β₁ that best fit the training data. This is typically done by minimizing the sum of squared errors (SSE) between the actual Y values and the predicted Y values.

Multiple Linear Regression

When you have more than one independent variable, you use Multiple Linear Regression:

Y = β₀ + β₁X₁ + β₂X₂ + ... + βnXn + ε

Here, X₁, X₂, ..., Xn are different input features, and β₁, β₂, ..., βn are their corresponding coefficients.

Common Regression Algorithms

While linear regression is foundational, many other algorithms are used for regression tasks, each with its strengths:

1. Linear Regression (and its variants)

Simple Linear Regression: One predictor variable.
Multiple Linear Regression: Two or more predictor variables.
Polynomial Regression: Models a non-linear relationship by using polynomial terms of the predictor variable (e.g., X²).
Ridge Regression & Lasso Regression: Regularized versions of linear regression that help prevent overfitting by adding penalties to the coefficients.

2. Decision Tree Regression

Decision trees partition the data space into regions and predict the average value of the target variable within each region. They can capture complex non-linear relationships.

3. Support Vector Regression (SVR)

SVR is an extension of Support Vector Machines (SVMs) for regression. It aims to find a function that deviates from the target values by a certain threshold (epsilon) while being as flat as possible.

4. Random Forest Regression

An ensemble method that builds multiple decision trees and aggregates their predictions (e.g., by averaging) to improve accuracy and robustness.

5. Gradient Boosting Regression (e.g., XGBoost, LightGBM)

Another powerful ensemble technique that builds trees sequentially, with each new tree trying to correct the errors made by the previous ones.

Evaluating Regression Models

Assessing the performance of a regression model is crucial. Key metrics include:

Mean Absolute Error (MAE): The average of the absolute differences between predicted and actual values. It's easy to interpret.
MAE = (1/n) Σ |yᵢ - ŷᵢ|
Mean Squared Error (MSE): The average of the squared differences between predicted and actual values. Penalizes larger errors more heavily.
MSE = (1/n) Σ (yᵢ - ŷᵢ)²
Root Mean Squared Error (RMSE): The square root of MSE. It's in the same units as the target variable, making it more interpretable than MSE.
RMSE = √MSE
R-squared (R²) or Coefficient of Determination: Represents the proportion of the variance in the dependent variable that is predictable from the independent variable(s). A value of 1 means the model explains all the variability.
R² = 1 - (SSE / SST)
(Where SSE is the sum of squared errors, and SST is the total sum of squares)

Choosing the right metric depends on the specific problem and the cost associated with different types of errors.

Illustrative Example: Predicting House Prices

Let's imagine we want to predict house prices. Our features (independent variables) might include:

Square footage of the house
Number of bedrooms
Location (e.g., distance from city center)
Age of the house

Our target variable (dependent variable) is the Price of the house.

Scenario

We have a dataset of houses with their features and actual sale prices. We train a regression model (e.g., Multiple Linear Regression) on this data. The model learns coefficients (β) that represent the impact of each feature on the price.

For instance, a model might suggest:


        Price = 50000  // Base price (intercept)
                + 150 * SquareFootage  // $150 per sq ft
                + 20000 * NumBedrooms  // $20,000 per bedroom
                - 1000 * DistanceToCenter // -$1,000 per mile from center
                - 500 * Age // -$500 per year of age

Using this trained model, we can input the features of a new, unseen house to predict its potential sale price.