SQL Server Analysis Services Documentation

Logistic Regression Algorithm in Analysis Services

This topic describes the logistic regression algorithm in SQL Server Analysis Services, how it works, its parameters, and how to use it to build predictive models.

Introduction to Logistic Regression

Logistic regression is a powerful statistical method used for classification problems. Unlike linear regression, which predicts a continuous value, logistic regression predicts the probability of an event occurring. In SQL Server Analysis Services, the logistic regression algorithm is a data mining algorithm that models the relationship between a set of independent variables and a binary dependent variable.

It's particularly useful for scenarios where you need to predict whether a customer will churn, if a transaction is fraudulent, or if a prospect will respond to a marketing campaign.

How the Logistic Regression Algorithm Works

The algorithm models the probability of a binary outcome (e.g., Yes/No, 1/0) using the logistic function (also known as the sigmoid function). This function maps any real-valued input to a value between 0 and 1, which can be interpreted as a probability.

The model finds the best-fitting coefficients for the independent variables that maximize the likelihood of observing the actual outcomes in the training data. The formula can be represented as:

P(Y=1|X) = 1 / (1 + exp(-(β₀ + β₁X₁ + ... + βnXn)))

Where:

Key Components and Concepts

Parameters for the Logistic Regression Algorithm

The logistic regression algorithm in Analysis Services supports several parameters to control its behavior:

Parameter Description Default Value
MAX_INPUT_ATTRIBUTES The maximum number of input attributes that can be processed. 100
MAX_OUTPUT_ATTRIBUTES The maximum number of output attributes that can be processed. 1
COMPUTE_PROBABILITY Specifies whether to compute the probability for the output. True
PREDICTION_THRESHOLD The probability threshold for making a prediction. 0.5

Building a Logistic Regression Model

To build a logistic regression model, you typically follow these steps within SQL Server Data Tools (SSDT) or SQL Server Management Studio (SSMS):

  1. Create a Data Mining Project: Start a new Analysis Services project.
  2. Define Data Sources: Connect to your data sources containing the training data.
  3. Create Data Mining Structures: Define the structure for your mining model, specifying the input and predictable columns. Ensure your predictable column is binary.
  4. Choose the Algorithm: Select the "Logistic Regression" algorithm.
  5. Configure Algorithm Properties: Adjust parameters like MAX_INPUT_ATTRIBUTES and PREDICTION_THRESHOLD as needed.
  6. Train the Model: Process the mining structure to train the model.
  7. Explore and Predict: Use the mining viewer to understand the model and predict outcomes for new data.

Example Usage in DMX

Here's a basic example of creating a model and making a prediction using DMX (Data Mining Extensions):

-- Create a new mining model
CREATE MINING MODEL [MyLogisticRegressionModel] (
    [CustomerID] LONG KEY,
    [Demographics] SEQUENCE(NODE),
    [HasChurned] DISCRETE CONTINUOUS PREDICTED
)
USING
    Microsoft_Logistic_Regression (
        MAX_INPUT_ATTRIBUTES = 100,
        COMPUTE_PROBABILITY = TRUE
    )
WITH FILTER STRUCTURE;

-- Select data for training
SELECT
    [CustomerID],
    [Demographics],
    [HasChurned]
FROM
    [MyDataSourceView];

-- Make a prediction
SELECT
    Predict([MyLogisticRegressionModel], [Demographics]) AS PredictedChurn
FROM
    [MyDataSourceView]
WHERE
    [CustomerID] = 12345;

Pros and Cons

Advantages:

Disadvantages:

Further Reading