Supervised Learning with Python

Supervised learning is a type of machine learning algorithm that learns from labeled data. This means that the algorithm is trained on a dataset where the input features and the corresponding correct output are both provided. The goal is to learn a mapping function that can predict the output for new, unseen input data.

A typical supervised learning process.

Key Concepts

Labeled Data: Datasets with both input features and target outputs.
Features: The input variables used to make predictions.
Target Variable: The output variable that we want to predict.
Training: The process of learning from the labeled data.
Prediction: Using the trained model to estimate the target variable for new data.

Types of Supervised Learning

Supervised learning problems are typically divided into two main categories:

1. Classification

In classification, the target variable is a category. The algorithm learns to assign data points to predefined classes.

Binary Classification: Two possible outcomes (e.g., spam or not spam, malignant or benign).
Multiclass Classification: More than two possible outcomes (e.g., classifying images of cats, dogs, and birds).

Common algorithms include:

Logistic Regression
Support Vector Machines (SVM)
Decision Trees
Random Forests
K-Nearest Neighbors (KNN)
Naive Bayes

2. Regression

In regression, the target variable is a continuous value. The algorithm learns to predict a numerical output.

Predicting house prices based on features like size, location, and number of rooms.
Forecasting stock prices.
Estimating a person's age based on their facial features.

Common algorithms include:

Linear Regression
Polynomial Regression
Ridge Regression
Lasso Regression
Support Vector Regression (SVR)
Decision Trees (for regression)
Random Forests (for regression)

Popular Python Libraries

The most prominent library for supervised learning in Python is Scikit-learn (sklearn). It provides efficient tools for data analysis and machine learning, including a wide range of supervised learning algorithms, as well as tools for preprocessing data, model selection, and evaluation.

Example: Logistic Regression for Binary Classification

This example demonstrates how to use Scikit-learn for a simple binary classification task.


from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import make_classification

# Generate synthetic data for demonstration
X, y = make_classification(n_samples=100, n_features=10, n_informative=5, n_redundant=0, random_state=42)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the Logistic Regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

# Predict for a new sample
new_sample = [[0.5, -0.2, 1.1, 0.0, -0.8, 0.3, 1.5, -0.1, 0.6, -0.4]] # Example new data
prediction = model.predict(new_sample)
print(f"Prediction for new sample: {prediction[0]}")

Example: Linear Regression for Regression

This example demonstrates how to use Scikit-learn for a simple linear regression task.


from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.datasets import make_regression
import numpy as np

# Generate synthetic data for demonstration
X, y = make_regression(n_samples=100, n_features=1, noise=10, random_state=42)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")

# Predict for a new sample
new_sample = np.array([[1.5]]) # Example new data
prediction = model.predict(new_sample)
print(f"Prediction for new sample: {prediction[0]:.2f}")

Model Evaluation

After training a supervised learning model, it's crucial to evaluate its performance on unseen data. Different metrics are used depending on whether it's a classification or regression task.

Classification Metrics: Accuracy, Precision, Recall, F1-Score, ROC AUC.
Regression Metrics: Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), R-squared.

Scikit-learn provides extensive tools for model evaluation within the sklearn.metrics module.

Next Steps

To deepen your understanding, explore the various algorithms available in Scikit-learn, learn about feature engineering, hyperparameter tuning, and cross-validation techniques. Understanding how to choose the right algorithm and evaluate its performance is key to building effective machine learning models.

Continue to the Unsupervised Learning section to explore algorithms that learn from unlabeled data.