Azure AI Machine Learning Designer Reference

This document provides a detailed reference for all modules available in the Azure AI Machine Learning Designer. Understand each module's purpose, parameters, and usage to build effective machine learning pipelines.

Data Input Modules

These modules are used to bring data into your Machine Learning pipeline.

1. Datasets

Allows you to select pre-registered datasets within your Azure Machine Learning workspace.

Use Case: Load your training or testing data from your workspace's datastore.

Parameters:

Name	Description	Type	Required
Dataset	The specific dataset to load.	Dataset Reference	Yes

2. Enter Data Manually

Allows you to directly input small amounts of data in a tabular format.

Use Case: Create small, specific datasets for testing or simple tasks.

Parameters:

Name	Description	Type	Required
Data	Comma-separated values (CSV) representing the data.	String	Yes
Column Names	A comma-separated list of column headers.	String	No

Data Transformation Modules

Transform and clean your data to prepare it for machine learning.

1. Select Columns in Dataset

Enables you to choose specific columns from your dataset based on various criteria.

Use Case: Remove irrelevant features or select columns for specific analysis.

Parameters:

Name	Description	Type	Required
Selection mode	Mode for selecting columns (e.g., 'With names', 'Range').	Enum	Yes
Column names	List of columns to select.	Column List	Yes (if mode is 'With names')

2. Clean Missing Data

Handles missing values in your dataset by imputation or removal.

Use Case: Address incomplete data points that could affect model performance.

Parameters:

Name	Description	Type	Required
Missing data handling mode	Method to use (e.g., 'Remove row', 'Replace with mean').	Enum	Yes
Mean imputation, Median imputation, etc.	Options for imputation strategies.	Number	No

Model Training Modules

Train various machine learning models on your prepared data.

1. Train SVM Model

Trains a Support Vector Machine classifier.

Use Case: Classification tasks, especially with high-dimensional data.

Parameters:

Name	Description	Type	Required
Left dataset	Training data.	Dataset	Yes
Untrained model	An untrained SVM model object.	Model	Yes
Target column	The column to predict.	Column Name	Yes

2. Linear Regression

Trains a linear regression model for regression tasks.

Use Case: Predicting continuous values.

Parameters:

Name	Description	Type	Required
Left dataset	Training data.	Dataset	Yes
Formula	R-style formula for the model.	String	Yes

Model Evaluation Modules

Assess the performance of your trained models.

1. Score Model

Applies a trained model to a dataset to generate predictions.

Use Case: Get predictions on test data or new data.

Parameters:

Name	Description	Type	Required
Untrained model	The trained model.	Model	Yes
Dataset	Data to score.	Dataset	Yes

2. Evaluate Model

Calculates various metrics to evaluate the performance of a classification or regression model.

Use Case: Understand accuracy, precision, recall, AUC, RMSE, etc.

Parameters:

Name	Description	Type	Required
Scored dataset	Dataset with predicted values.	Dataset	Yes
Actual column	The column containing the true labels.	Column Name	Yes
Scored probabilities column	The column containing predicted probabilities (for classification).	Column Name	No

Scoring & Deployment Modules

Prepare models for deployment and generate scoring scripts.

1. Execute Python Script

Allows you to run custom Python code within your pipeline.

Use Case: Implement complex logic not covered by built-in modules, custom data preprocessing, or post-processing.

Parameters:

Name	Description	Type	Required
Script file	The Python script file (.py) to execute.	File Path	Yes
Module input 1, 2, ...	Input datasets/models to the script.	Dataset/Model	No

Utility Modules

Commonly used modules for pipeline management and data handling.

1. Split Data

Divides a dataset into two or more subsets.

Use Case: Creating training and testing sets.

Parameters:

Name	Description	Type	Required
Fraction of the first subset	The proportion of data for the first output.	Number (0.0 to 1.0)	Yes
Stratified split	Whether to perform stratified sampling.	Boolean	No