Logistic Regression Algorithm

The Logistic Regression algorithm is a classification algorithm that is used to predict the probability of a binary outcome. It is a supervised learning algorithm that models the relationship between a dependent binary variable and one or more independent variables by fitting a logistic function to the data.

In SQL Server Analysis Services (SSAS), the Logistic Regression algorithm is implemented as a data mining algorithm that can be used to build predictive models for binary classification tasks. It is particularly useful when you need to understand the factors that influence a specific outcome and to predict the likelihood of that outcome occurring.

How It Works

The Logistic Regression algorithm works by fitting a logistic (sigmoid) function to the input data. This function maps any real-valued input to a value between 0 and 1, which can be interpreted as a probability. The algorithm learns a set of coefficients for each independent variable that best predict the binary outcome.

The core of the algorithm is the logistic function:

P(Y=1|X) = 1 / (1 + exp(- (b0 + b1*X1 + b2*X2 + ... + bn*Xn)))

Where:

P(Y=1|X) is the probability of the dependent variable Y being 1, given the independent variables X.
b0 is the intercept.
b1 through bn are the coefficients for the independent variables X1 through Xn.

The algorithm aims to find the coefficients that maximize the likelihood of observing the given training data.

Key Concepts

Binary Classification: Predicts one of two possible outcomes (e.g., Yes/No, True/False, Spam/Not Spam).
Probability Prediction: Outputs the probability of a specific outcome.
Feature Importance: Can identify which input variables have the most significant impact on the outcome.
Odds Ratio: The exponentiated coefficients can be interpreted as odds ratios, indicating how a unit change in an independent variable affects the odds of the outcome.

Parameters

The Logistic Regression algorithm in SSAS has several configurable parameters that can influence the model's behavior and performance:

Parameter	Description	Allowed Values
MAX_INPUT_ATTRIBUTES	Specifies the maximum number of input attributes that the algorithm can handle. If the number of input attributes exceeds this value, the algorithm will fail.	Integer (default: 65535)
MAX_OUTPUT_ATTRIBUTES	Specifies the maximum number of output attributes that the algorithm can handle.	Integer (default: 65535)
PENALTY_COEFFICIENT	Applies L2 regularization (ridge regression) to the model to prevent overfitting by penalizing large coefficient values. A higher value results in stronger regularization.	Floating-point number (default: 0.0)
SAMPLE_SIZE	Specifies the maximum number of cases to be sampled for training. If 0, all cases are used.	Integer (default: 0)
HOLDOUT_PERCENTAGE	Specifies the percentage of training data to reserve for testing. If 0, no data is reserved.	Integer (0-100, default: 0)

Note: Experiment with the PENALTY_COEFFICIENT to find the best balance between model fit and generalization for your specific dataset.

Usage in SSAS

To use the Logistic Regression algorithm in SSAS:

Create a Data Mining Project: In SQL Server Data Tools (SSDT), create a new Analysis Services Project.
Create a Data Source: Define a connection to your data source (e.g., SQL Server database).
Create a Data Source View: Select the tables and columns relevant to your classification task.
Create a Mining Structure:
- Select Logistic Regression as the algorithm.
- Identify your Key Columns, Predictable Column (the binary target variable), and Input Columns.
Train the Model: Process the mining structure to train the Logistic Regression model.
Explore and Predict: Use the mining viewer to explore the model (e.g., view coefficients, feature importance) and use prediction queries to make predictions on new data.

Example: Creating a model with DMX

CREATE MINING MODEL [Customer Churn Model]
(
    [CustomerID] LONG KEY,
    [Churn] DISCRETE PREDICTED TEXT, -- The binary outcome (e.g., 'Yes', 'No')
    [Gender] DISCRETE INPUT TEXT,
    [Age] CONTINUOUS INPUT NUMERIC,
    [AnnualIncome] CONTINUOUS INPUT NUMERIC,
    [ContractType] DISCRETE INPUT TEXT
)
USING
(
    WITH 
    [Logistic Regression] WITH
    (
        MAX_INPUT_ATTRIBUTES = 100,
        PENALTY_COEFFICIENT = 0.1
    )
);

Advantages

Interpretability: Coefficients provide clear insights into the relationship between predictors and the outcome.
Efficiency: Relatively fast to train and predict, especially for large datasets.
Probability Output: Provides well-calibrated probabilities, useful for risk assessment.
Handles Non-linear Relationships: The logistic function can model non-linear relationships between independent and dependent variables.

Disadvantages

Assumes Linearity: Assumes a linear relationship between the independent variables and the log-odds of the outcome.
Requires Feature Engineering: May require careful feature selection and engineering to perform well.
Sensitive to Outliers: Can be sensitive to outliers in the data.
Limited to Binary Outcomes: Primarily designed for binary classification; extensions are needed for multi-class problems.