Pandas DataFrame - MSDN Python Data Science & ML

Understanding the Pandas DataFrame

The DataFrame is the primary data structure in the Pandas library. It's a two-dimensional, size-mutable, tabular data structure with labeled axes (rows and columns). Think of it as a spreadsheet or an SQL table, or a dictionary of Series objects.

Creating a DataFrame

You can create a DataFrame in several ways. Here are some common methods:

1. From a Dictionary of Lists or NumPy Arrays

This is one of the most common ways to create a DataFrame. The keys of the dictionary become the column names.

import pandas as pd
import numpy as np

data = {
    'col1': [1, 2, 3, 4],
    'col2': ['A', 'B', 'C', 'D'],
    'col3': np.array([10.5, 20.2, 30.7, 40.1])
}

df = pd.DataFrame(data)
print(df)

Output:

   col1 col2  col3
0     1    A  10.5
1     2    B  20.2
2     3    C  30.7
3     4    D  40.1

2. From a List of Dictionaries

Each dictionary in the list represents a row. Keys will be inferred as columns.

data_list = [
    {'name': 'Alice', 'age': 30, 'city': 'New York'},
    {'name': 'Bob', 'age': 25, 'city': 'Los Angeles'},
    {'name': 'Charlie', 'age': 35, 'city': 'Chicago'}
]

df_list = pd.DataFrame(data_list)
print(df_list)

Output:

      name  age         city
0    Alice   30     New York
1      Bob   25  Los Angeles
2  Charlie   35      Chicago

3. From a NumPy 2D array

You can also create a DataFrame from a NumPy array, specifying column and index names if needed.

numpy_array = np.array([[10, 20], [30, 40], [50, 60]])
df_numpy = pd.DataFrame(numpy_array, columns=['X', 'Y'], index=['row1', 'row2', 'row3'])
print(df_numpy)

Output:

      X   Y
row1  10  20
row2  30  40
row3  50  60

Basic DataFrame Operations

Once you have a DataFrame, you can perform various operations:

Viewing Data

df.head(n): Returns the first n rows (default is 5).
df.tail(n): Returns the last n rows (default is 5).
df.info(): Provides a concise summary of the DataFrame, including data types and non-null values.
df.describe(): Generates descriptive statistics (count, mean, std, min, max, quartiles).

Column Selection

You can select columns using bracket notation or dot notation (if the column name is a valid Python identifier and doesn't conflict with DataFrame methods).

# Select a single column
print(df['col1'])

# Select multiple columns
print(df[['col1', 'col3']])

Row Selection

Use `.loc` for label-based indexing and `.iloc` for integer-location based indexing.

# Select row by label (index)
print(df.loc[0])

# Select rows by integer position
print(df.iloc[1:3]) # Rows with index 1 and 2

Filtering Data

Use boolean indexing to filter rows based on conditions.

# Filter rows where col1 is greater than 2
filtered_df = df[df['col1'] > 2]
print(filtered_df)

Adding and Deleting Columns

# Add a new column
df['new_col'] = [100, 200, 300, 400]
print(df)

# Delete a column
df_dropped = df.drop('new_col', axis=1) # axis=1 indicates column
print(df_dropped)

Data Handling Operations

df.dropna(): Removes rows or columns with missing values.
df.fillna(value): Fills missing values with a specified value.
df.isnull(): Returns a DataFrame of boolean values indicating missing data.
df.duplicated(): Returns a boolean Series indicating duplicate rows.
df.drop_duplicates(): Removes duplicate rows.

Working with Real-World Data

Pandas excels at reading data from various sources like CSV, Excel, SQL databases, and more.

Reading from CSV

# Assuming you have a file named 'data.csv'
# df_csv = pd.read_csv('data.csv')
# print(df_csv.head())

Common Data Cleaning Tasks

In data science, cleaning data is crucial. DataFrames provide tools for:

Handling missing values (NaN).
Correcting data types.
Removing duplicates.
Renaming columns.
Renaming index.
String manipulation on columns.

Example: Renaming Columns

df_renamed = df_list.rename(columns={'name': 'Full Name', 'age': 'Years Old'})
print(df_renamed.head(2))

Output:

  Full Name  Years Old         city
0     Alice         30     New York
1       Bob         25  Los Angeles

The Pandas DataFrame is a powerful and versatile tool that forms the backbone of many data analysis and machine learning workflows in Python. Continue exploring its capabilities to unlock the full potential of your data.