Understanding the Pandas DataFrame
The DataFrame is the primary data structure in the Pandas library. It's a two-dimensional, size-mutable, tabular data structure with labeled axes (rows and columns). Think of it as a spreadsheet or an SQL table, or a dictionary of Series objects.
Creating a DataFrame
You can create a DataFrame in several ways. Here are some common methods:
1. From a Dictionary of Lists or NumPy Arrays
This is one of the most common ways to create a DataFrame. The keys of the dictionary become the column names.
import pandas as pd
import numpy as np
data = {
'col1': [1, 2, 3, 4],
'col2': ['A', 'B', 'C', 'D'],
'col3': np.array([10.5, 20.2, 30.7, 40.1])
}
df = pd.DataFrame(data)
print(df)
Output:
col1 col2 col3
0 1 A 10.5
1 2 B 20.2
2 3 C 30.7
3 4 D 40.1
2. From a List of Dictionaries
Each dictionary in the list represents a row. Keys will be inferred as columns.
data_list = [
{'name': 'Alice', 'age': 30, 'city': 'New York'},
{'name': 'Bob', 'age': 25, 'city': 'Los Angeles'},
{'name': 'Charlie', 'age': 35, 'city': 'Chicago'}
]
df_list = pd.DataFrame(data_list)
print(df_list)
Output:
name age city
0 Alice 30 New York
1 Bob 25 Los Angeles
2 Charlie 35 Chicago
3. From a NumPy 2D array
You can also create a DataFrame from a NumPy array, specifying column and index names if needed.
numpy_array = np.array([[10, 20], [30, 40], [50, 60]])
df_numpy = pd.DataFrame(numpy_array, columns=['X', 'Y'], index=['row1', 'row2', 'row3'])
print(df_numpy)
Output:
X Y
row1 10 20
row2 30 40
row3 50 60
Basic DataFrame Operations
Once you have a DataFrame, you can perform various operations:
Viewing Data
df.head(n): Returns the firstnrows (default is 5).df.tail(n): Returns the lastnrows (default is 5).df.info(): Provides a concise summary of the DataFrame, including data types and non-null values.df.describe(): Generates descriptive statistics (count, mean, std, min, max, quartiles).
Column Selection
You can select columns using bracket notation or dot notation (if the column name is a valid Python identifier and doesn't conflict with DataFrame methods).
# Select a single column
print(df['col1'])
# Select multiple columns
print(df[['col1', 'col3']])
Row Selection
Use `.loc` for label-based indexing and `.iloc` for integer-location based indexing.
# Select row by label (index)
print(df.loc[0])
# Select rows by integer position
print(df.iloc[1:3]) # Rows with index 1 and 2
Filtering Data
Use boolean indexing to filter rows based on conditions.
# Filter rows where col1 is greater than 2
filtered_df = df[df['col1'] > 2]
print(filtered_df)
Adding and Deleting Columns
# Add a new column
df['new_col'] = [100, 200, 300, 400]
print(df)
# Delete a column
df_dropped = df.drop('new_col', axis=1) # axis=1 indicates column
print(df_dropped)
Data Handling Operations
df.dropna(): Removes rows or columns with missing values.df.fillna(value): Fills missing values with a specified value.df.isnull(): Returns a DataFrame of boolean values indicating missing data.df.duplicated(): Returns a boolean Series indicating duplicate rows.df.drop_duplicates(): Removes duplicate rows.
Working with Real-World Data
Pandas excels at reading data from various sources like CSV, Excel, SQL databases, and more.
Reading from CSV
# Assuming you have a file named 'data.csv'
# df_csv = pd.read_csv('data.csv')
# print(df_csv.head())
Common Data Cleaning Tasks
In data science, cleaning data is crucial. DataFrames provide tools for:
- Handling missing values (NaN).
- Correcting data types.
- Removing duplicates.
- Renaming columns.
- Renaming index.
- String manipulation on columns.
Example: Renaming Columns
df_renamed = df_list.rename(columns={'name': 'Full Name', 'age': 'Years Old'})
print(df_renamed.head(2))
Output:
Full Name Years Old city
0 Alice 30 New York
1 Bob 25 Los Angeles
The Pandas DataFrame is a powerful and versatile tool that forms the backbone of many data analysis and machine learning workflows in Python. Continue exploring its capabilities to unlock the full potential of your data.