MSDN Learn Paths

Your Gateway to Modern Development Skills

Pandas Data Manipulation

Welcome to the Pandas Data Manipulation module. Pandas is a powerful, open-source library built on top of NumPy, providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. This module will guide you through the essential concepts and operations for manipulating data with Pandas.

What is Pandas?

Pandas introduces two primary data structures:

Getting Started

First, ensure you have Pandas installed. If not, you can install it using pip:

pip install pandas numpy

Then, import the library in your Python script:

import pandas as pd
import numpy as np

Creating DataFrames

You can create DataFrames from various sources, including dictionaries, lists of dictionaries, and NumPy arrays.

Example: DataFrame from a dictionary

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [25, 30, 22, 35, 28],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix'],
    'Salary': [70000, 80000, 60000, 90000, 75000]
}
df = pd.DataFrame(data)
print(df)

Output:

      Name  Age         City  Salary
0    Alice   25     New York   70000
1      Bob   30  Los Angeles   80000
2  Charlie   22      Chicago   60000
3    David   35      Houston   90000
4      Eve   28      Phoenix   75000

Data Inspection

Once you have a DataFrame, you'll want to inspect its contents and structure.

Example: Inspecting the DataFrame

print("First 3 rows:\n", df.head(3))
print("\nDataFrame Info:\n")
df.info()
print("\nDescriptive Statistics:\n", df.describe())
print("\nShape:", df.shape)

Data Selection and Indexing

Pandas provides powerful ways to select and index data.

Example: Selecting Data

# Select 'Name' column
print("Names:\n", df['Name'])

# Select 'Name' and 'Age' columns
print("\nNames and Ages:\n", df[['Name', 'Age']])

# Select rows by label (index) using .loc
print("\nRow with index 2:\n", df.loc[2])

# Select rows from index 1 to 3 (exclusive of 3) using .loc
print("\nRows 1 to 2:\n", df.loc[1:2])

# Select rows by integer position using .iloc
print("\nFirst row (index 0):\n", df.iloc[0])

# Select rows from index 0 to 2 (exclusive of 2) using .iloc
print("\nFirst 2 rows:\n", df.iloc[0:2])

# Select rows where Age > 30
print("\nEmployees older than 30:\n", df[df['Age'] > 30])

Data Cleaning and Manipulation

Real-world data is often messy. Pandas offers tools to handle missing data, duplicates, and transform data.

Example: Cleaning and Transforming Data

# Add a row with missing age
df_missing = df.copy()
df_missing.loc[5] = ['Frank', np.nan, 'Miami', 85000]

print("DataFrame with missing value:\n", df_missing)

# Fill missing Age with the mean age
mean_age = df_missing['Age'].mean()
df_filled = df_missing.fillna({'Age': mean_age})
print("\nDataFrame after filling missing Age:\n", df_filled)

# Rename a column
df_renamed = df_filled.rename(columns={'Salary': 'Annual_Salary'})
print("\nDataFrame with renamed column:\n", df_renamed)

# Apply a function to calculate bonus
df_renamed['Bonus'] = df_renamed['Annual_Salary'].apply(lambda x: x * 0.1)
print("\nDataFrame with Bonus column:\n", df_renamed)

Grouping and Aggregation

Pandas' `groupby()` method is essential for splitting data into groups based on some criteria and then computing aggregate statistics.

Example: Grouping and Aggregation

data_agg = {
    'Category': ['A', 'B', 'A', 'B', 'A', 'C', 'C', 'A', 'B'],
    'Value': [10, 15, 12, 18, 11, 20, 22, 13, 16]
}
df_agg = pd.DataFrame(data_agg)

# Group by 'Category' and calculate the mean of 'Value'
grouped_mean = df_agg.groupby('Category')['Value'].mean()
print("Mean Value per Category:\n", grouped_mean)

# Group by 'Category' and calculate sum and count
agg_results = df_agg.groupby('Category')['Value'].agg(['sum', 'count', 'mean', 'min', 'max'])
print("\nAggregation results per Category:\n", agg_results)

Conclusion

This module covered the fundamentals of Pandas data manipulation, including creating DataFrames, inspecting data, selecting and indexing, cleaning data, and performing group-by operations. Mastering these skills is crucial for any data scientist working with Python.

Continue to the next module to explore more advanced data analysis techniques!