Scikit-learn Classification Basics

Welcome to our introductory tutorial on classification using Scikit-learn, a cornerstone library for machine learning in Python. Classification is a type of supervised learning where we predict a discrete label or category.

What is Classification?

In classification problems, our goal is to assign an input instance to one of several predefined categories. Examples include:

Spam detection (spam vs. not spam)
Image recognition (cat, dog, bird)
Medical diagnosis (disease present or absent)
Customer churn prediction (churn or not churn)

Getting Started with Scikit-learn

First, ensure you have Scikit-learn and its dependencies (NumPy and SciPy) installed. If not, you can install them using pip:

pip install scikit-learn numpy scipy

A Simple Classification Example: Iris Dataset

We'll use the famous Iris dataset, which is included with Scikit-learn, to demonstrate a basic classification task. This dataset contains 150 samples of Iris flowers, each with four features (sepal length, sepal width, petal length, petal width) and three possible species (setosa, versicolor, virginica).

1. Importing Libraries and Loading Data

Let's start by importing the necessary modules and loading the dataset.

                
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import pandas as pd
import numpy as np

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
                
            

2. Data Preparation: Splitting and Scaling

It's crucial to split our data into training and testing sets. The training set is used to train the model, while the testing set is used to evaluate its performance on unseen data. We also often scale our features to ensure that features with larger values don't disproportionately influence the model.

                
# Split data into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Initialize StandardScaler
scaler = StandardScaler()

# Fit the scaler on the training data and transform both training and testing data
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

3. Choosing and Training a Classifier: K-Nearest Neighbors (KNN)

For this example, we'll use the K-Nearest Neighbors (KNN) algorithm. KNN is a simple, instance-based learning algorithm where a new data point is classified based on the majority class of its 'k' nearest neighbors in the feature space.

                
# Initialize the KNN classifier with k=3 neighbors
knn = KNeighborsClassifier(n_neighbors=3)

# Train the model
knn.fit(X_train_scaled, y_train)

4. Making Predictions

Once the model is trained, we can use it to make predictions on our test set.

                
# Make predictions on the test set
y_pred = knn.predict(X_test_scaled)

5. Evaluating the Model

We assess the performance of our classifier using various metrics. Accuracy is a common metric, but it's important to also consider precision, recall, and the F1-score, especially for imbalanced datasets. The confusion matrix provides a detailed breakdown of correct and incorrect classifications.

                
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

# Display a classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

# Display the confusion matrix
print("\nConfusion Matrix:")
conf_matrix = confusion_matrix(y_test, y_pred)
print(conf_matrix)

# Optional: Displaying results in a DataFrame for better readability
results_df = pd.DataFrame({
    'Actual': iris.target_names[y_test],
    'Predicted': iris.target_names[y_pred]
})
print("\nSample Predictions:")
print(results_df.head())
                
            

Tip: The `random_state` parameter in `train_test_split` ensures that the data splitting is reproducible. Using `stratify=y` is important for classification tasks to maintain the proportion of classes in both the training and testing sets.

Other Common Classification Algorithms in Scikit-learn

Scikit-learn offers a wide array of classification algorithms, including:

Each algorithm has its own strengths and weaknesses, and the best choice often depends on the specific problem and dataset.

Next Steps

This tutorial covered the fundamental steps of building a classification model. To deepen your understanding, consider exploring:

Different evaluation metrics and their use cases.
Hyperparameter tuning to optimize model performance.
Cross-validation techniques for more robust evaluation.
Other classification algorithms.

Continue your learning journey with our Regression Basics tutorial!