Pandas Data Manipulation

Welcome to the Pandas Data Manipulation module. Pandas is a powerful, open-source library built on top of NumPy, providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. This module will guide you through the essential concepts and operations for manipulating data with Pandas.

What is Pandas?

Pandas introduces two primary data structures:

Series: A one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.).
DataFrame: A two-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dictionary of Series objects.

Getting Started

First, ensure you have Pandas installed. If not, you can install it using pip:

pip install pandas numpy

Then, import the library in your Python script:

import pandas as pd
import numpy as np

Creating DataFrames

You can create DataFrames from various sources, including dictionaries, lists of dictionaries, and NumPy arrays.

Example: DataFrame from a dictionary

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [25, 30, 22, 35, 28],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix'],
    'Salary': [70000, 80000, 60000, 90000, 75000]
}
df = pd.DataFrame(data)
print(df)

Output:

      Name  Age         City  Salary
0    Alice   25     New York   70000
1      Bob   30  Los Angeles   80000
2  Charlie   22      Chicago   60000
3    David   35      Houston   90000
4      Eve   28      Phoenix   75000

Data Inspection

Once you have a DataFrame, you'll want to inspect its contents and structure.

df.head(): Displays the first 5 rows.
df.tail(): Displays the last 5 rows.
df.info(): Provides a concise summary of the DataFrame, including index dtype and column dtypes, non-null values and memory usage.
df.describe(): Generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.
df.shape: Returns a tuple representing the dimensionality of the DataFrame (rows, columns).
df.columns: Returns an Index object containing the column labels.
df.index: Returns an Index object representing the row labels.

Example: Inspecting the DataFrame

print("First 3 rows:\n", df.head(3))
print("\nDataFrame Info:\n")
df.info()
print("\nDescriptive Statistics:\n", df.describe())
print("\nShape:", df.shape)

Data Selection and Indexing

Pandas provides powerful ways to select and index data.

Selecting Columns: Use square brackets or dot notation.
Selecting Rows: Use `.loc[]` (label-based) or `.iloc[]` (integer-location based).
Boolean Indexing: Filter rows based on conditions.

Example: Selecting Data

# Select 'Name' column
print("Names:\n", df['Name'])

# Select 'Name' and 'Age' columns
print("\nNames and Ages:\n", df[['Name', 'Age']])

# Select rows by label (index) using .loc
print("\nRow with index 2:\n", df.loc[2])

# Select rows from index 1 to 3 (exclusive of 3) using .loc
print("\nRows 1 to 2:\n", df.loc[1:2])

# Select rows by integer position using .iloc
print("\nFirst row (index 0):\n", df.iloc[0])

# Select rows from index 0 to 2 (exclusive of 2) using .iloc
print("\nFirst 2 rows:\n", df.iloc[0:2])

# Select rows where Age > 30
print("\nEmployees older than 30:\n", df[df['Age'] > 30])

Data Cleaning and Manipulation

Real-world data is often messy. Pandas offers tools to handle missing data, duplicates, and transform data.

Handling Missing Data:
- df.isnull() or df.isna(): Detect missing values.
- df.dropna(): Remove rows or columns with missing values.
- df.fillna(): Fill missing values with a specified value or method.
Handling Duplicates:
- df.duplicated(): Identify duplicate rows.
- df.drop_duplicates(): Remove duplicate rows.
Renaming Columns: Use df.rename().
Applying Functions: Use df.apply() to apply a function along an axis.

Example: Cleaning and Transforming Data

# Add a row with missing age
df_missing = df.copy()
df_missing.loc[5] = ['Frank', np.nan, 'Miami', 85000]

print("DataFrame with missing value:\n", df_missing)

# Fill missing Age with the mean age
mean_age = df_missing['Age'].mean()
df_filled = df_missing.fillna({'Age': mean_age})
print("\nDataFrame after filling missing Age:\n", df_filled)

# Rename a column
df_renamed = df_filled.rename(columns={'Salary': 'Annual_Salary'})
print("\nDataFrame with renamed column:\n", df_renamed)

# Apply a function to calculate bonus
df_renamed['Bonus'] = df_renamed['Annual_Salary'].apply(lambda x: x * 0.1)
print("\nDataFrame with Bonus column:\n", df_renamed)

Grouping and Aggregation

Pandas' `groupby()` method is essential for splitting data into groups based on some criteria and then computing aggregate statistics.

Example: Grouping and Aggregation

data_agg = {
    'Category': ['A', 'B', 'A', 'B', 'A', 'C', 'C', 'A', 'B'],
    'Value': [10, 15, 12, 18, 11, 20, 22, 13, 16]
}
df_agg = pd.DataFrame(data_agg)

# Group by 'Category' and calculate the mean of 'Value'
grouped_mean = df_agg.groupby('Category')['Value'].mean()
print("Mean Value per Category:\n", grouped_mean)

# Group by 'Category' and calculate sum and count
agg_results = df_agg.groupby('Category')['Value'].agg(['sum', 'count', 'mean', 'min', 'max'])
print("\nAggregation results per Category:\n", agg_results)

Conclusion

This module covered the fundamentals of Pandas data manipulation, including creating DataFrames, inspecting data, selecting and indexing, cleaning data, and performing group-by operations. Mastering these skills is crucial for any data scientist working with Python.

Continue to the next module to explore more advanced data analysis techniques!