Bagging, short for Bootstrap Aggregating, is a powerful ensemble machine learning technique used to improve the stability and accuracy of machine learning algorithms. It reduces variance and helps prevent overfitting, especially for algorithms that are prone to it, like decision trees.
The core idea behind bagging is to train multiple instances of the same base learning algorithm on different subsets of the training data. These subsets are created using a process called bootstrapping, where samples are drawn with replacement from the original dataset. This means some data points might be selected multiple times, while others might not be selected at all for a particular subset.
For regression problems, the final prediction is typically the average of the predictions from all base learners. For classification problems, the final prediction is usually determined by a majority vote among the base learners.
Scikit-learn provides convenient implementations for bagging, most notably through the BaggingClassifier and BaggingRegressor classes.
Let's illustrate with a classification example using a decision tree as the base learner.
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score
# Generate a synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=42)
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Initialize the base learner (e.g., a Decision Tree)
base_tree = DecisionTreeClassifier(random_state=42)
# Initialize the Bagging Classifier
# n_estimators: The number of base estimators in the ensemble.
# max_samples: The number of samples to draw from X to train each base estimator.
# max_features: The number of features to draw from X to train each base estimator.
bagging_clf = BaggingClassifier(base_estimator=base_tree,
n_estimators=10,
max_samples=0.8,
max_features=0.8,
random_state=42)
# Train the Bagging Classifier
bagging_clf.fit(X_train, y_train)
# Make predictions
y_pred = bagging_clf.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Bagging Classifier Accuracy: {accuracy:.4f}")
# You can also access individual estimators
# print(f"Number of estimators: {bagging_clf.n_estimators_}")
# print(f"First estimator's feature importances: {bagging_clf.estimators_[0].feature_importances_}")
Similarly, for regression tasks, you can use BaggingRegressor.
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error
# Generate a synthetic dataset for regression
X, y = make_regression(n_samples=1000, n_features=20, n_informative=15, random_state=42)
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Initialize the base learner (e.g., a Decision Tree Regressor)
base_tree_reg = DecisionTreeRegressor(random_state=42)
# Initialize the Bagging Regressor
bagging_reg = BaggingRegressor(base_estimator=base_tree_reg,
n_estimators=10,
max_samples=0.8,
max_features=0.8,
random_state=42)
# Train the Bagging Regressor
bagging_reg.fit(X_train, y_train)
# Make predictions
y_pred_reg = bagging_reg.predict(X_test)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred_reg)
print(f"Bagging Regressor Mean Squared Error: {mse:.4f}")
The BaggingClassifier and BaggingRegressor classes offer several parameters to control the bagging process:
base_estimator: The machine learning estimator to be used as the base learner. Defaults to a decision tree.n_estimators: The number of base estimators to train. More estimators generally lead to better performance but increase computation time.max_samples: The number of samples to draw from the training dataset for each base estimator. Can be an integer or a float (percentage).max_features: The number of features to draw from the dataset for training each base estimator. Can be an integer or a float (percentage).bootstrap: Whether samples are drawn with replacement (True) or without replacement (False). Defaults to True.bootstrap_features: Whether features are drawn with replacement (True) or without replacement (False). Defaults to False.oob_score: Whether to use out-of-bag samples to estimate the generalization error. This can provide an unbiased estimate of the model's performance without needing a separate validation set.oob_score=True, scikit-learn automatically calculates an estimate of the generalization error. This is particularly useful when you have limited data and want to avoid splitting it further for validation. For classifiers, it computes the average accuracy; for regressors, it computes the R-squared score.
Bagging is a good choice when:
Bagging is a fundamental ensemble technique that forms the basis for more advanced methods like Random Forests. Understanding bagging is crucial for mastering ensemble learning in machine learning.