Data Science with Python

Supervised Learning with Python

Supervised learning is a type of machine learning algorithm that learns from labeled data. This means that the algorithm is trained on a dataset where the input features and the corresponding correct output are both provided. The goal is to learn a mapping function that can predict the output for new, unseen input data.

Supervised Learning Diagram

A typical supervised learning process.

Key Concepts

Types of Supervised Learning

Supervised learning problems are typically divided into two main categories:

1. Classification

In classification, the target variable is a category. The algorithm learns to assign data points to predefined classes.

Common algorithms include:

2. Regression

In regression, the target variable is a continuous value. The algorithm learns to predict a numerical output.

Common algorithms include:

Popular Python Libraries

The most prominent library for supervised learning in Python is Scikit-learn (sklearn). It provides efficient tools for data analysis and machine learning, including a wide range of supervised learning algorithms, as well as tools for preprocessing data, model selection, and evaluation.

Example: Logistic Regression for Binary Classification

This example demonstrates how to use Scikit-learn for a simple binary classification task.


from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import make_classification

# Generate synthetic data for demonstration
X, y = make_classification(n_samples=100, n_features=10, n_informative=5, n_redundant=0, random_state=42)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the Logistic Regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

# Predict for a new sample
new_sample = [[0.5, -0.2, 1.1, 0.0, -0.8, 0.3, 1.5, -0.1, 0.6, -0.4]] # Example new data
prediction = model.predict(new_sample)
print(f"Prediction for new sample: {prediction[0]}")
                

Example: Linear Regression for Regression

This example demonstrates how to use Scikit-learn for a simple linear regression task.


from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.datasets import make_regression
import numpy as np

# Generate synthetic data for demonstration
X, y = make_regression(n_samples=100, n_features=1, noise=10, random_state=42)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")

# Predict for a new sample
new_sample = np.array([[1.5]]) # Example new data
prediction = model.predict(new_sample)
print(f"Prediction for new sample: {prediction[0]:.2f}")
                

Model Evaluation

After training a supervised learning model, it's crucial to evaluate its performance on unseen data. Different metrics are used depending on whether it's a classification or regression task.

Scikit-learn provides extensive tools for model evaluation within the sklearn.metrics module.

Next Steps

To deepen your understanding, explore the various algorithms available in Scikit-learn, learn about feature engineering, hyperparameter tuning, and cross-validation techniques. Understanding how to choose the right algorithm and evaluate its performance is key to building effective machine learning models.

Continue to the Unsupervised Learning section to explore algorithms that learn from unlabeled data.