Scikit-learn: Classification
Classification is a supervised learning task where the goal is to assign an input data point to one of several predefined categories or classes. Scikit-learn provides a comprehensive suite of algorithms and tools for building and evaluating classification models.
Key Concepts in Classification
- Features: Input variables used to make predictions.
- Labels/Classes: The target categories to predict.
- Training Set: Data used to train the model.
- Testing Set: Data used to evaluate the trained model's performance.
- Overfitting: When a model performs too well on training data but poorly on unseen data.
- Underfitting: When a model is too simple to capture the underlying patterns in the data.
Common Classification Algorithms in Scikit-learn
1. Logistic Regression
Despite its name, Logistic Regression is a classification algorithm used for binary (two-class) classification problems. It models the probability of a given input belonging to a particular class.
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Assuming X contains features and y contains labels
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
2. Support Vector Machines (SVM)
SVMs find an optimal hyperplane that best separates data points of different classes in a high-dimensional space. They are effective for both linear and non-linear classification.
from sklearn.svm import SVC
svm_model = SVC(kernel='linear') # or 'rbf' for non-linear
svm_model.fit(X_train, y_train)
y_pred_svm = svm_model.predict(X_test)
accuracy_svm = accuracy_score(y_test, y_pred_svm)
print(f"SVM Accuracy: {accuracy_svm:.2f}")
3. Decision Trees
Decision trees create a flowchart-like structure where internal nodes represent feature tests, branches represent outcomes, and leaf nodes represent class labels.
from sklearn.tree import DecisionTreeClassifier
dt_model = DecisionTreeClassifier(max_depth=5) # Limit depth to prevent overfitting
dt_model.fit(X_train, y_train)
y_pred_dt = dt_model.predict(X_test)
accuracy_dt = accuracy_score(y_test, y_pred_dt)
print(f"Decision Tree Accuracy: {accuracy_dt:.2f}")
4. Random Forests
An ensemble method that builds multiple decision trees during training and outputs the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. It generally offers better accuracy and robustness than a single decision tree.
from sklearn.ensemble import RandomForestClassifier
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print(f"Random Forest Accuracy: {accuracy_rf:.2f}")
5. K-Nearest Neighbors (KNN)
KNN is a non-parametric, instance-based learning algorithm. It classifies a data point based on the majority class among its 'k' nearest neighbors in the feature space.
from sklearn.neighbors import KNeighborsClassifier
knn_model = KNeighborsClassifier(n_neighbors=5)
knn_model.fit(X_train, y_train)
y_pred_knn = knn_model.predict(X_test)
accuracy_knn = accuracy_score(y_test, y_pred_knn)
print(f"KNN Accuracy: {accuracy_knn:.2f}")
Evaluating Classification Models
Beyond accuracy, several metrics are crucial for evaluating classification models:
- Confusion Matrix: A table summarizing prediction results.
- Precision: The ability of the classifier not to label as positive a sample that is negative.
- Recall (Sensitivity): The ability of the classifier to find all the positive samples.
- F1-Score: The harmonic mean of precision and recall.
- ROC Curve and AUC: Receiver Operating Characteristic curve and Area Under the Curve, useful for visualizing and comparing binary classifiers.
from sklearn.metrics import confusion_matrix, classification_report
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
Scikit-learn offers a rich ecosystem for handling various classification tasks. Experiment with different algorithms, tune hyperparameters, and use appropriate evaluation metrics to build robust predictive models.
For more in-depth information, refer to the official Scikit-learn documentation on Classification.