Azure Machine Learning Documentation

How-To Guides: Prepare Data

Preparing Your Data for Azure Machine Learning

Effective data preparation is a critical step in building successful machine learning models. Azure Machine Learning provides a comprehensive set of tools and services to help you load, clean, transform, and organize your data for training and deployment.

Connecting to Your Data Sources

Azure Machine Learning can connect to a variety of data sources, both within Azure and externally. Common sources include:

To connect, you typically create a Datastore in Azure Machine Learning, which securely references your data storage location.

Supported Data Formats

Azure Machine Learning supports a wide range of data formats. When preparing your data, consider using formats that are efficient for large-scale processing, such as:

Registering Data Assets

Once your data is accessible via a datastore, you can register it as a Data Asset in Azure Machine Learning. Data assets provide a versioned and discoverable way to manage your datasets. They can be registered as:

Registering data assets simplifies data access in your training scripts and ensures reproducibility.

Techniques for Data Preparation

Data Cleaning

Data cleaning involves identifying and correcting errors, inconsistencies, and inaccuracies in your dataset. This may include:

You can perform data cleaning using Python libraries like Pandas, or leverage Azure Machine Learning's visual designer or custom scripts.

Data Transformation

Data transformation involves converting data from one format or structure to another to make it more suitable for modeling. Common transformations include:

Azure Machine Learning pipelines can automate these transformations, ensuring consistency between training and inference.

Feature Engineering

Feature engineering is the process of creating new features from existing ones that can improve the performance of your machine learning models. This requires domain knowledge and creativity. Examples include:

Data Splitting

Before training, it's essential to split your data into training, validation, and testing sets. This allows you to train your model on one subset, tune hyperparameters on another, and evaluate its generalization performance on unseen data.

Common splitting strategies include:

Tip: Leverage Azure ML Pipelines

Automate your data preparation steps by building Azure Machine Learning pipelines. This ensures consistency and reproducibility throughout your machine learning lifecycle.

Example Workflow (Conceptual)

  1. Connect to Data: Create a Datastore for your Azure Blob Storage container.
  2. Register Data Asset: Register a CSV file from your datastore as a Data Asset named 'raw_customer_data'.
  3. Develop Preparation Script: Write a Python script using Pandas to:
    • Load 'raw_customer_data'.
    • Handle missing 'Age' values by imputing the mean.
    • One-hot encode the 'City' column.
    • Scale the 'Income' feature.
    • Save the prepared data as 'prepared_customer_data.parquet' to a new datastore location.
  4. Create Pipeline Step: Define a pipeline step that executes your preparation script, taking 'raw_customer_data' as input and producing 'prepared_customer_data.parquet' as output.
  5. Register Prepared Data: Register 'prepared_customer_data.parquet' as a new Data Asset for training.

Best Practices

Note: Schema Drift

Be mindful of schema drift, where the structure or data types of your input data change over time. Implement checks and validation within your data preparation pipelines to detect and handle such changes gracefully.