Quickstart: Text Classification with Azure AI Machine Learning

Introduction to Text Classification

Text classification, also known as text categorization, is the process of assigning categories or tags to text based on its content. This is a fundamental task in Natural Language Processing (NLP) with numerous applications, including sentiment analysis, spam detection, topic labeling, and intent recognition.

In this quickstart, you will learn how to use Azure AI Machine Learning to build and deploy a text classification model. We will cover the entire workflow, from setting up your environment to deploying a scalable endpoint.

Prerequisites

An Azure subscription. If you don't have one, you can create a free account.
An Azure AI Machine Learning workspace.
Python 3.7 or later installed on your machine.
The Azure CLI installed and configured.
The Azure ML SDK for Python installed. You can install it using pip:
```
pip install azure-ai-ml azure-identity
```

Setup Azure ML Environment

First, ensure you have created an Azure AI Machine Learning workspace. You can do this via the Azure portal or programmatically. Once your workspace is ready, you can connect to it using the Azure ML SDK.


from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

# Authenticate and get a handle to the workspace
ml_client = MLClient(
    credential=DefaultAzureCredential(),
    subscription_id="YOUR_SUBSCRIPTION_ID",
    resource_group_name="YOUR_RESOURCE_GROUP",
    workspace_name="YOUR_WORKSPACE_NAME",
)

print(f"Connected to workspace: {ml_client.workspace_name}")

Replace YOUR_SUBSCRIPTION_ID, YOUR_RESOURCE_GROUP, and YOUR_WORKSPACE_NAME with your actual Azure resource details.

Data Preparation

For text classification, your data typically consists of text documents and their corresponding labels. For this quickstart, we'll use a sample dataset. You'll need to upload this data to your Azure ML workspace's default datastore or a specific datastore.

Let's assume you have a CSV file with two columns: 'text' and 'label'.


from azure.ai.ml.entities import Data

# Define the data asset
text_classification_data = Data(
    path="azureml://datastores/workspaceblobstore/paths/text_classification/train.csv", # Update with your data path
    type="uri_file",
    description="Text classification dataset for training",
    name="text-classification-dataset"
)

# Create the data asset
ml_client.data.create_or_update(text_classification_data)
print(f"Data asset '{text_classification_data.name}' created.")

Model Training

Azure AI Machine Learning allows you to train models using various frameworks like scikit-learn, TensorFlow, or PyTorch. We'll use a simple scikit-learn pipeline for this example.

Create a Python script (e.g., train.py) that contains your training logic:


# train.py
import argparse
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
import pandas as pd
import joblib

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--data_path", type=str, help="Path to the training data")
    parser.add_argument("--model_output", type=str, help="Path to save the trained model")
    args = parser.parse_args()

    # Load data
    data = pd.read_csv(args.data_path)
    X_train = data['text']
    y_train = data['label']

    # Create a pipeline
    pipeline = Pipeline([
        ('tfidf', TfidfVectorizer()),
        ('clf', LogisticRegression(solver='liblinear', multi_class='auto'))
    ])

    # Train the model
    pipeline.fit(X_train, y_train)

    # Save the model
    joblib.dump(pipeline, args.model_output)
    print(f"Model saved to {args.model_output}")

if __name__ == "__main__":
    main()

Now, submit this script as a job to Azure ML:


from azure.ai.ml import command, Input
from azure.ai.ml.entities import Environment

# Define the command job
job = command(
    code="./src",  # Directory containing train.py and potentially other scripts
    command="python train.py --data_path ${{inputs.training_data}} --model_output ${{outputs.model_output_path}}",
    inputs={
        "training_data": Input(type="uri_file", path="azureml://datastores/workspaceblobstore/paths/text_classification/train.csv") # Link to your data asset
    },
    outputs={
        "model_output_path": {"type": "uri_folder", "path": "azureml://datastores/workspaceblobstore/paths/models/text_classification"}
    },
    environment="AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest", # Use a curated environment
    compute="your-compute-cluster-name" # Your compute cluster name
)

# Submit the job
returned_job = ml_client.jobs.create_or_update(job)
print(f"Job submitted: {returned_job.name}")
ml_client.jobs.stream(returned_job.name)

Ensure ./src contains your train.py script. Also, replace your-compute-cluster-name with the name of your Azure ML compute cluster.

Model Deployment

Once your model is trained, you can deploy it as a web service endpoint for real-time predictions.


from azure.ai.ml import MLClient
from azure.ai.ml.entities import Model, OnlineEndpoint, OnlineDeployment
from azure.ai.ml.models import ModelVersion

# Register the model
model_version = Model.create(
    ml_client,
    name="text-classification-model",
    version="1",
    path="azureml://datastores/workspaceblobstore/paths/models/text_classification/model.pkl" # Path to your saved model artifact
)

print(f"Model registered: {model_version.id}")

# Create an online endpoint
endpoint_name = "text-classification-endpoint"
endpoint = OnlineEndpoint(
    name=endpoint_name,
    description="Online endpoint for text classification",
    auth_mode="key"
)
ml_client.online_endpoints.begin_create_or_update(endpoint).result()

# Create an online deployment
deployment_name = "text-classification-deployment"
online_deployment = OnlineDeployment(
    name=deployment_name,
    endpoint_name=endpoint_name,
    model=model_version, # Use the registered model
    instance_type="Standard_DS3_v2",
    instance_count=1
)
ml_client.online_deployments.begin_create_or_update(online_deployment).result()

# Set traffic to the new deployment
endpoint.traffic = {deployment_name: 100}
ml_client.online_endpoints.begin_create_or_update(endpoint).result()
print(f"Model deployed to endpoint: {endpoint_name}")

You'll need to create a scoring.py script for your deployment which loads the model and defines the prediction logic. This script should be placed in the same directory as your model artifact or specified accordingly.

Next Steps

Congratulations! You've successfully deployed a text classification model on Azure AI Machine Learning. Here are some next steps:

Integrate with Applications: Use the endpoint's scoring URI and API key to integrate predictions into your applications.
Monitor Performance: Monitor your deployed model's performance and retrain as needed.
Explore Advanced Techniques: Investigate more sophisticated NLP models like transformers for improved accuracy.
Batch Inference: For large datasets, consider batch inference for more efficient processing.

Explore More Tutorials