Pandas Basics for Data Science with Python

Welcome to the fundamental tutorial on Pandas, a cornerstone library for data manipulation and analysis in Python. This guide will introduce you to the core concepts and functionalities of Pandas, empowering you to handle structured data efficiently.

What is Pandas?

Pandas is an open-source Python library designed for data manipulation and analysis. It provides data structures and functions that make working with structured data (like tables, time series, and matrices) easy and intuitive. Its two primary data structures are:

Series: A one-dimensional labeled array capable of holding any data type.
DataFrame: A two-dimensional labeled data structure with columns of potentially different types. You can think of it as a spreadsheet or a SQL table.

Installation

If you don't have Pandas installed, you can install it using pip:

pip install pandas

Core Data Structures: Series and DataFrame

1. Series

A Series is like a column in a table. It's a one-dimensional array with an associated array of labels, called the index.

Creating a Series

import pandas as pd

# From a list
data_list = [10, 20, 30, 40, 50]
series_from_list = pd.Series(data_list)
print("Series from list:\n", series_from_list)

# With a custom index
data_dict = {'a': 100, 'b': 200, 'c': 300}
series_from_dict = pd.Series(data_dict)
print("\nSeries from dictionary with custom index:\n", series_from_dict)

2. DataFrame

A DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. It's the most commonly used Pandas object.

Creating a DataFrame

import pandas as pd

# From a dictionary of lists
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 28],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}
df = pd.DataFrame(data)
print("DataFrame from dictionary:\n", df)

# From a list of dictionaries
data_rows = [
    {'Name': 'Eve', 'Age': 22, 'City': 'Miami'},
    {'Name': 'Frank', 'Age': 40, 'City': 'Seattle'}
]
df_from_rows = pd.DataFrame(data_rows)
print("\nDataFrame from list of dictionaries:\n", df_from_rows)

Loading Data

Pandas excels at reading data from various file formats. Common ones include CSV, Excel, and SQL databases.

Reading CSV Files

Use the read_csv() function to load data from a CSV file.

Reading a CSV

# Assuming you have a file named 'data.csv' in the same directory
# df_csv = pd.read_csv('data.csv')
# print(df_csv.head()) # Display the first 5 rows

Note: The example above is commented out as it requires an actual file. Replace 'data.csv' with your file path.

Reading Excel Files

Use the read_excel() function. You might need to install the openpyxl or xlrd library:

pip install openpyxl

Reading an Excel File

# Assuming you have an Excel file named 'data.xlsx'
# df_excel = pd.read_excel('data.xlsx')
# print(df_excel.head())

Note: This example is also commented out.

Inspecting Data

Once data is loaded, it's crucial to inspect it to understand its structure and content.

df.head(): Shows the first N rows (default is 5).
df.tail(): Shows the last N rows (default is 5).
df.info(): Provides a concise summary of the DataFrame, including index dtype and columns, non-null values and memory usage.
df.describe(): Generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset's distribution, excluding NaN values.
df.shape: Returns a tuple representing the dimensionality of the DataFrame (rows, columns).
df.columns: Returns a list of column names.
df.index: Returns the index of the DataFrame.

Inspecting a DataFrame

print("First 3 rows:\n", df.head(3))
print("\nDataFrame Info:\n")
df.info()
print("\nDescriptive Statistics:\n", df.describe())
print("\nDataFrame Shape:", df.shape)
print("\nColumn Names:", df.columns)

Data Selection and Indexing

Pandas offers powerful ways to select and index data within DataFrames.

Selecting Columns

You can select a single column as a Series or multiple columns as a new DataFrame.

Selecting Columns

# Select a single column (returns a Series)
ages = df['Age']
print("Ages column (Series):\n", ages.head())

# Select multiple columns (returns a DataFrame)
name_city = df[['Name', 'City']]
print("\nName and City columns (DataFrame):\n", name_city.head())

Selecting Rows

Use .loc[] for label-based indexing and .iloc[] for integer-position-based indexing.

Selecting Rows using .loc and .iloc

# Select row by label (index) - assuming default integer index
row_1 = df.loc[1]
print("Row at index 1 (using .loc):\n", row_1)

# Select rows by integer position
rows_0_to_2 = df.iloc[0:3] # Selects rows from index 0 up to (but not including) 3
print("\nFirst 3 rows (using .iloc):\n", rows_0_to_2)

# Select specific rows and columns
specific_selection = df.loc[df['Age'] > 28, ['Name', 'City']]
print("\nNames and Cities of people older than 28:\n", specific_selection)

Data Cleaning and Manipulation

Pandas provides tools for handling missing data, filtering, sorting, and transforming data.

Handling Missing Data

Missing data is often represented as NaN (Not a Number).

df.isnull(): Returns a boolean DataFrame indicating where values are NaN.
df.dropna(): Removes rows or columns with NaN values.
df.fillna(value): Fills NaN values with a specified value.

Handling Missing Data

# Create a DataFrame with missing values for demonstration
data_with_nan = {
    'col1': [1, 2, None, 4],
    'col2': [None, 5, 6, 7]
}
df_nan = pd.DataFrame(data_with_nan)
print("DataFrame with NaN:\n", df_nan)

print("\nIs Null:\n", df_nan.isnull())

# Drop rows with any NaN values
df_dropped_rows = df_nan.dropna()
print("\nDataFrame after dropping rows with NaN:\n", df_dropped_rows)

# Fill NaN values with 0
df_filled = df_nan.fillna(0)
print("\nDataFrame after filling NaN with 0:\n", df_filled)

Filtering Data

You can filter rows based on conditions.

Filtering Data

# Filter rows where Age is greater than 30
older_than_30 = df[df['Age'] > 30]
print("People older than 30:\n", older_than_30)

# Filter by multiple conditions (Age > 25 AND City is 'New York')
complex_filter = df[(df['Age'] > 25) & (df['City'] == 'New York')]
print("\nPeople older than 25 in New York:\n", complex_filter)

Sorting Data

Sort DataFrames by one or more columns.

Sorting Data

# Sort by Age in ascending order
df_sorted_age = df.sort_values(by='Age')
print("DataFrame sorted by Age:\n", df_sorted_age)

# Sort by City in descending order
df_sorted_city_desc = df.sort_values(by='City', ascending=False)
print("\nDataFrame sorted by City (descending):\n", df_sorted_city_desc)

Grouping and Aggregation

Pandas' groupby() function is powerful for splitting data into groups based on some criteria and then applying a function (like sum, mean, count) to each group independently.

Grouping and Aggregation

# Example DataFrame with more data
data_sales = {
    'Category': ['Electronics', 'Clothing', 'Electronics', 'Home Goods', 'Clothing', 'Electronics'],
    'Sales': [1200, 300, 1500, 800, 450, 1100]
}
df_sales = pd.DataFrame(data_sales)
print("Sales Data:\n", df_sales)

# Group by 'Category' and calculate the sum of 'Sales' for each category
category_sales_sum = df_sales.groupby('Category')['Sales'].sum()
print("\nTotal Sales per Category:\n", category_sales_sum)

# Group by 'Category' and calculate the average 'Sales'
category_sales_mean = df_sales.groupby('Category')['Sales'].mean()
print("\nAverage Sales per Category:\n", category_sales_mean)

# Multiple aggregations
category_agg = df_sales.groupby('Category')['Sales'].agg(['sum', 'mean', 'count'])
print("\nMultiple Aggregations per Category:\n", category_agg)

Conclusion

You've now covered the essential Pandas functionalities for data science in Python. These include understanding Series and DataFrames, loading data, inspecting it, selecting subsets, cleaning missing values, filtering, sorting, and performing group-by operations.

Key Takeaway: Practice is key! The best way to master Pandas is to work with real-world datasets and apply these concepts. Experiment with different functions and data sources.

Continue your journey by exploring more advanced topics like merging/joining DataFrames, time series analysis, and data visualization with libraries like Matplotlib and Seaborn, which integrate seamlessly with Pandas.