The Logistic Regression algorithm is a classification algorithm that is used to predict the probability of a binary outcome. It is a supervised learning algorithm that models the relationship between a dependent binary variable and one or more independent variables by fitting a logistic function to the data.
In SQL Server Analysis Services (SSAS), the Logistic Regression algorithm is implemented as a data mining algorithm that can be used to build predictive models for binary classification tasks. It is particularly useful when you need to understand the factors that influence a specific outcome and to predict the likelihood of that outcome occurring.
The Logistic Regression algorithm works by fitting a logistic (sigmoid) function to the input data. This function maps any real-valued input to a value between 0 and 1, which can be interpreted as a probability. The algorithm learns a set of coefficients for each independent variable that best predict the binary outcome.
The core of the algorithm is the logistic function:
P(Y=1|X) = 1 / (1 + exp(- (b0 + b1*X1 + b2*X2 + ... + bn*Xn)))
Where:
P(Y=1|X) is the probability of the dependent variable Y being 1, given the independent variables X.b0 is the intercept.b1 through bn are the coefficients for the independent variables X1 through Xn.The algorithm aims to find the coefficients that maximize the likelihood of observing the given training data.
The Logistic Regression algorithm in SSAS has several configurable parameters that can influence the model's behavior and performance:
| Parameter | Description | Allowed Values |
|---|---|---|
| MAX_INPUT_ATTRIBUTES | Specifies the maximum number of input attributes that the algorithm can handle. If the number of input attributes exceeds this value, the algorithm will fail. | Integer (default: 65535) |
| MAX_OUTPUT_ATTRIBUTES | Specifies the maximum number of output attributes that the algorithm can handle. | Integer (default: 65535) |
| PENALTY_COEFFICIENT | Applies L2 regularization (ridge regression) to the model to prevent overfitting by penalizing large coefficient values. A higher value results in stronger regularization. | Floating-point number (default: 0.0) |
| SAMPLE_SIZE | Specifies the maximum number of cases to be sampled for training. If 0, all cases are used. | Integer (default: 0) |
| HOLDOUT_PERCENTAGE | Specifies the percentage of training data to reserve for testing. If 0, no data is reserved. | Integer (0-100, default: 0) |
PENALTY_COEFFICIENT to find the best balance between model fit and generalization for your specific dataset.
To use the Logistic Regression algorithm in SSAS:
CREATE MINING MODEL [Customer Churn Model]
(
[CustomerID] LONG KEY,
[Churn] DISCRETE PREDICTED TEXT, -- The binary outcome (e.g., 'Yes', 'No')
[Gender] DISCRETE INPUT TEXT,
[Age] CONTINUOUS INPUT NUMERIC,
[AnnualIncome] CONTINUOUS INPUT NUMERIC,
[ContractType] DISCRETE INPUT TEXT
)
USING
(
WITH
[Logistic Regression] WITH
(
MAX_INPUT_ATTRIBUTES = 100,
PENALTY_COEFFICIENT = 0.1
)
);