Logistic Regression Algorithm

The Logistic Regression algorithm is a classification algorithm that is used to predict the probability of a binary outcome. It is a supervised learning algorithm that models the relationship between a dependent binary variable and one or more independent variables by fitting a logistic function to the data.

In SQL Server Analysis Services (SSAS), the Logistic Regression algorithm is implemented as a data mining algorithm that can be used to build predictive models for binary classification tasks. It is particularly useful when you need to understand the factors that influence a specific outcome and to predict the likelihood of that outcome occurring.

How It Works

The Logistic Regression algorithm works by fitting a logistic (sigmoid) function to the input data. This function maps any real-valued input to a value between 0 and 1, which can be interpreted as a probability. The algorithm learns a set of coefficients for each independent variable that best predict the binary outcome.

The core of the algorithm is the logistic function:

P(Y=1|X) = 1 / (1 + exp(- (b0 + b1*X1 + b2*X2 + ... + bn*Xn)))

Where:

The algorithm aims to find the coefficients that maximize the likelihood of observing the given training data.

Key Concepts

Parameters

The Logistic Regression algorithm in SSAS has several configurable parameters that can influence the model's behavior and performance:

Parameter Description Allowed Values
MAX_INPUT_ATTRIBUTES Specifies the maximum number of input attributes that the algorithm can handle. If the number of input attributes exceeds this value, the algorithm will fail. Integer (default: 65535)
MAX_OUTPUT_ATTRIBUTES Specifies the maximum number of output attributes that the algorithm can handle. Integer (default: 65535)
PENALTY_COEFFICIENT Applies L2 regularization (ridge regression) to the model to prevent overfitting by penalizing large coefficient values. A higher value results in stronger regularization. Floating-point number (default: 0.0)
SAMPLE_SIZE Specifies the maximum number of cases to be sampled for training. If 0, all cases are used. Integer (default: 0)
HOLDOUT_PERCENTAGE Specifies the percentage of training data to reserve for testing. If 0, no data is reserved. Integer (0-100, default: 0)
Note: Experiment with the PENALTY_COEFFICIENT to find the best balance between model fit and generalization for your specific dataset.

Usage in SSAS

To use the Logistic Regression algorithm in SSAS:

  1. Create a Data Mining Project: In SQL Server Data Tools (SSDT), create a new Analysis Services Project.
  2. Create a Data Source: Define a connection to your data source (e.g., SQL Server database).
  3. Create a Data Source View: Select the tables and columns relevant to your classification task.
  4. Create a Mining Structure:
    • Select Logistic Regression as the algorithm.
    • Identify your Key Columns, Predictable Column (the binary target variable), and Input Columns.
  5. Train the Model: Process the mining structure to train the Logistic Regression model.
  6. Explore and Predict: Use the mining viewer to explore the model (e.g., view coefficients, feature importance) and use prediction queries to make predictions on new data.
Example: Creating a model with DMX
CREATE MINING MODEL [Customer Churn Model]
(
    [CustomerID] LONG KEY,
    [Churn] DISCRETE PREDICTED TEXT, -- The binary outcome (e.g., 'Yes', 'No')
    [Gender] DISCRETE INPUT TEXT,
    [Age] CONTINUOUS INPUT NUMERIC,
    [AnnualIncome] CONTINUOUS INPUT NUMERIC,
    [ContractType] DISCRETE INPUT TEXT
)
USING
(
    WITH 
    [Logistic Regression] WITH
    (
        MAX_INPUT_ATTRIBUTES = 100,
        PENALTY_COEFFICIENT = 0.1
    )
);

Advantages

Disadvantages

See Also