Pandas Data Manipulation
Welcome to the Pandas Data Manipulation module. Pandas is a powerful, open-source library built on top of NumPy, providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. This module will guide you through the essential concepts and operations for manipulating data with Pandas.
What is Pandas?
Pandas introduces two primary data structures:
- Series: A one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.).
- DataFrame: A two-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dictionary of Series objects.
Getting Started
First, ensure you have Pandas installed. If not, you can install it using pip:
pip install pandas numpy
Then, import the library in your Python script:
import pandas as pd
import numpy as np
Creating DataFrames
You can create DataFrames from various sources, including dictionaries, lists of dictionaries, and NumPy arrays.
Example: DataFrame from a dictionary
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [25, 30, 22, 35, 28],
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix'],
'Salary': [70000, 80000, 60000, 90000, 75000]
}
df = pd.DataFrame(data)
print(df)
Output:
Name Age City Salary
0 Alice 25 New York 70000
1 Bob 30 Los Angeles 80000
2 Charlie 22 Chicago 60000
3 David 35 Houston 90000
4 Eve 28 Phoenix 75000
Data Inspection
Once you have a DataFrame, you'll want to inspect its contents and structure.
df.head(): Displays the first 5 rows.df.tail(): Displays the last 5 rows.df.info(): Provides a concise summary of the DataFrame, including index dtype and column dtypes, non-null values and memory usage.df.describe(): Generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excludingNaNvalues.df.shape: Returns a tuple representing the dimensionality of the DataFrame (rows, columns).df.columns: Returns an Index object containing the column labels.df.index: Returns an Index object representing the row labels.
Example: Inspecting the DataFrame
print("First 3 rows:\n", df.head(3))
print("\nDataFrame Info:\n")
df.info()
print("\nDescriptive Statistics:\n", df.describe())
print("\nShape:", df.shape)
Data Selection and Indexing
Pandas provides powerful ways to select and index data.
- Selecting Columns: Use square brackets or dot notation.
- Selecting Rows: Use `.loc[]` (label-based) or `.iloc[]` (integer-location based).
- Boolean Indexing: Filter rows based on conditions.
Example: Selecting Data
# Select 'Name' column
print("Names:\n", df['Name'])
# Select 'Name' and 'Age' columns
print("\nNames and Ages:\n", df[['Name', 'Age']])
# Select rows by label (index) using .loc
print("\nRow with index 2:\n", df.loc[2])
# Select rows from index 1 to 3 (exclusive of 3) using .loc
print("\nRows 1 to 2:\n", df.loc[1:2])
# Select rows by integer position using .iloc
print("\nFirst row (index 0):\n", df.iloc[0])
# Select rows from index 0 to 2 (exclusive of 2) using .iloc
print("\nFirst 2 rows:\n", df.iloc[0:2])
# Select rows where Age > 30
print("\nEmployees older than 30:\n", df[df['Age'] > 30])
Data Cleaning and Manipulation
Real-world data is often messy. Pandas offers tools to handle missing data, duplicates, and transform data.
- Handling Missing Data:
df.isnull()ordf.isna(): Detect missing values.df.dropna(): Remove rows or columns with missing values.df.fillna(): Fill missing values with a specified value or method.
- Handling Duplicates:
df.duplicated(): Identify duplicate rows.df.drop_duplicates(): Remove duplicate rows.
- Renaming Columns: Use
df.rename(). - Applying Functions: Use
df.apply()to apply a function along an axis.
Example: Cleaning and Transforming Data
# Add a row with missing age
df_missing = df.copy()
df_missing.loc[5] = ['Frank', np.nan, 'Miami', 85000]
print("DataFrame with missing value:\n", df_missing)
# Fill missing Age with the mean age
mean_age = df_missing['Age'].mean()
df_filled = df_missing.fillna({'Age': mean_age})
print("\nDataFrame after filling missing Age:\n", df_filled)
# Rename a column
df_renamed = df_filled.rename(columns={'Salary': 'Annual_Salary'})
print("\nDataFrame with renamed column:\n", df_renamed)
# Apply a function to calculate bonus
df_renamed['Bonus'] = df_renamed['Annual_Salary'].apply(lambda x: x * 0.1)
print("\nDataFrame with Bonus column:\n", df_renamed)
Grouping and Aggregation
Pandas' `groupby()` method is essential for splitting data into groups based on some criteria and then computing aggregate statistics.
Example: Grouping and Aggregation
data_agg = {
'Category': ['A', 'B', 'A', 'B', 'A', 'C', 'C', 'A', 'B'],
'Value': [10, 15, 12, 18, 11, 20, 22, 13, 16]
}
df_agg = pd.DataFrame(data_agg)
# Group by 'Category' and calculate the mean of 'Value'
grouped_mean = df_agg.groupby('Category')['Value'].mean()
print("Mean Value per Category:\n", grouped_mean)
# Group by 'Category' and calculate sum and count
agg_results = df_agg.groupby('Category')['Value'].agg(['sum', 'count', 'mean', 'min', 'max'])
print("\nAggregation results per Category:\n", agg_results)
Conclusion
This module covered the fundamentals of Pandas data manipulation, including creating DataFrames, inspecting data, selecting and indexing, cleaning data, and performing group-by operations. Mastering these skills is crucial for any data scientist working with Python.
Continue to the next module to explore more advanced data analysis techniques!