AI & Machine Learning

MSDN Community Learning Resources

Scikit-learn Classification Basics

Welcome to our introductory tutorial on classification using Scikit-learn, a cornerstone library for machine learning in Python. Classification is a type of supervised learning where we predict a discrete label or category.

What is Classification?

In classification problems, our goal is to assign an input instance to one of several predefined categories. Examples include:

Getting Started with Scikit-learn

First, ensure you have Scikit-learn and its dependencies (NumPy and SciPy) installed. If not, you can install them using pip:

pip install scikit-learn numpy scipy

A Simple Classification Example: Iris Dataset

We'll use the famous Iris dataset, which is included with Scikit-learn, to demonstrate a basic classification task. This dataset contains 150 samples of Iris flowers, each with four features (sepal length, sepal width, petal length, petal width) and three possible species (setosa, versicolor, virginica).

1. Importing Libraries and Loading Data

Let's start by importing the necessary modules and loading the dataset.

from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import accuracy_score, classification_report, confusion_matrix import pandas as pd import numpy as np # Load the Iris dataset iris = load_iris() X = iris.data y = iris.target

2. Data Preparation: Splitting and Scaling

It's crucial to split our data into training and testing sets. The training set is used to train the model, while the testing set is used to evaluate its performance on unseen data. We also often scale our features to ensure that features with larger values don't disproportionately influence the model.

# Split data into training (80%) and testing (20%) sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y) # Initialize StandardScaler scaler = StandardScaler() # Fit the scaler on the training data and transform both training and testing data X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test)

3. Choosing and Training a Classifier: K-Nearest Neighbors (KNN)

For this example, we'll use the K-Nearest Neighbors (KNN) algorithm. KNN is a simple, instance-based learning algorithm where a new data point is classified based on the majority class of its 'k' nearest neighbors in the feature space.

# Initialize the KNN classifier with k=3 neighbors knn = KNeighborsClassifier(n_neighbors=3) # Train the model knn.fit(X_train_scaled, y_train)

4. Making Predictions

Once the model is trained, we can use it to make predictions on our test set.

# Make predictions on the test set y_pred = knn.predict(X_test_scaled)

5. Evaluating the Model

We assess the performance of our classifier using various metrics. Accuracy is a common metric, but it's important to also consider precision, recall, and the F1-score, especially for imbalanced datasets. The confusion matrix provides a detailed breakdown of correct and incorrect classifications.

# Calculate accuracy accuracy = accuracy_score(y_test, y_pred) print(f"Accuracy: {accuracy:.4f}") # Display a classification report print("\nClassification Report:") print(classification_report(y_test, y_pred, target_names=iris.target_names)) # Display the confusion matrix print("\nConfusion Matrix:") conf_matrix = confusion_matrix(y_test, y_pred) print(conf_matrix) # Optional: Displaying results in a DataFrame for better readability results_df = pd.DataFrame({ 'Actual': iris.target_names[y_test], 'Predicted': iris.target_names[y_pred] }) print("\nSample Predictions:") print(results_df.head())

Tip: The `random_state` parameter in `train_test_split` ensures that the data splitting is reproducible. Using `stratify=y` is important for classification tasks to maintain the proportion of classes in both the training and testing sets.

Other Common Classification Algorithms in Scikit-learn

Scikit-learn offers a wide array of classification algorithms, including:

Each algorithm has its own strengths and weaknesses, and the best choice often depends on the specific problem and dataset.

Next Steps

This tutorial covered the fundamental steps of building a classification model. To deepen your understanding, consider exploring:

Continue your learning journey with our Regression Basics tutorial!