Logistic Regression - SQL Server Analysis Services

Logistic Regression Algorithm in Analysis Services

This topic describes the logistic regression algorithm in SQL Server Analysis Services, how it works, its parameters, and how to use it to build predictive models.

Introduction to Logistic Regression

Logistic regression is a powerful statistical method used for classification problems. Unlike linear regression, which predicts a continuous value, logistic regression predicts the probability of an event occurring. In SQL Server Analysis Services, the logistic regression algorithm is a data mining algorithm that models the relationship between a set of independent variables and a binary dependent variable.

It's particularly useful for scenarios where you need to predict whether a customer will churn, if a transaction is fraudulent, or if a prospect will respond to a marketing campaign.

How the Logistic Regression Algorithm Works

The algorithm models the probability of a binary outcome (e.g., Yes/No, 1/0) using the logistic function (also known as the sigmoid function). This function maps any real-valued input to a value between 0 and 1, which can be interpreted as a probability.

The model finds the best-fitting coefficients for the independent variables that maximize the likelihood of observing the actual outcomes in the training data. The formula can be represented as:

P(Y=1|X) = 1 / (1 + exp(-(β₀ + β₁X₁ + ... + βnXn)))

Where:

P(Y=1|X) is the probability of the dependent variable being 1 given the independent variables.
β₀ is the intercept.
β₁, ..., βn are the coefficients for the independent variables X₁, ..., Xn.

Key Components and Concepts

Dependent Variable: Must be a discrete, binary variable. This is the variable you want to predict.
Independent Variables: Can be numeric (continuous or discrete) or categorical. These variables are used to predict the dependent variable.
Coefficients: Indicate the strength and direction of the relationship between each independent variable and the log-odds of the dependent variable.
Probability Threshold: A cutoff value (typically 0.5) used to convert probabilities into discrete class predictions.

Parameters for the Logistic Regression Algorithm

The logistic regression algorithm in Analysis Services supports several parameters to control its behavior:

Parameter	Description	Default Value
`MAX_INPUT_ATTRIBUTES`	The maximum number of input attributes that can be processed.	100
`MAX_OUTPUT_ATTRIBUTES`	The maximum number of output attributes that can be processed.	1
`COMPUTE_PROBABILITY`	Specifies whether to compute the probability for the output.	`True`
`PREDICTION_THRESHOLD`	The probability threshold for making a prediction.	0.5

Building a Logistic Regression Model

To build a logistic regression model, you typically follow these steps within SQL Server Data Tools (SSDT) or SQL Server Management Studio (SSMS):

Create a Data Mining Project: Start a new Analysis Services project.
Define Data Sources: Connect to your data sources containing the training data.
Create Data Mining Structures: Define the structure for your mining model, specifying the input and predictable columns. Ensure your predictable column is binary.
Choose the Algorithm: Select the "Logistic Regression" algorithm.
Configure Algorithm Properties: Adjust parameters like MAX_INPUT_ATTRIBUTES and PREDICTION_THRESHOLD as needed.
Train the Model: Process the mining structure to train the model.
Explore and Predict: Use the mining viewer to understand the model and predict outcomes for new data.

Example Usage in DMX

Here's a basic example of creating a model and making a prediction using DMX (Data Mining Extensions):

-- Create a new mining model
CREATE MINING MODEL [MyLogisticRegressionModel] (
    [CustomerID] LONG KEY,
    [Demographics] SEQUENCE(NODE),
    [HasChurned] DISCRETE CONTINUOUS PREDICTED
)
USING
    Microsoft_Logistic_Regression (
        MAX_INPUT_ATTRIBUTES = 100,
        COMPUTE_PROBABILITY = TRUE
    )
WITH FILTER STRUCTURE;

-- Select data for training
SELECT
    [CustomerID],
    [Demographics],
    [HasChurned]
FROM
    [MyDataSourceView];

-- Make a prediction
SELECT
    Predict([MyLogisticRegressionModel], [Demographics]) AS PredictedChurn
FROM
    [MyDataSourceView]
WHERE
    [CustomerID] = 12345;

Pros and Cons

Advantages:

Interpretability: Coefficients can be interpreted to understand feature importance and direction.
Efficiency: Generally fast to train and predict.
Probabilistic Output: Provides probabilities, which are valuable for risk assessment.
Handles Non-linear Relationships: The logistic function allows for modeling non-linear relationships between inputs and probabilities.

Disadvantages:

Binary Output: Primarily suited for binary classification. For multi-class problems, extensions or different algorithms are needed.
Assumption of Linearity: Assumes a linear relationship between independent variables and the log-odds of the dependent variable.
Sensitivity to Outliers: Can be sensitive to extreme values in the data.

SQL Server Analysis Services Documentation