Pandas Documentation
Welcome to the official documentation for Pandas, a powerful and widely-used Python library for data manipulation and analysis. Pandas provides easy-to-use data structures and data analysis tools for the Python programming language.
Introduction
Pandas is built on top of the NumPy library and integrates well with other scientific computing libraries in Python. It is especially well-suited for tabular data (like spreadsheets, SQL tables, or CSV files) and time series data. Pandas offers a variety of data structures, the most important of which are:
- Series: A one-dimensional labeled array capable of holding any type of data.
- DataFrame: A two-dimensional labeled data structure with columns of potentially different types. You can think of it as a spreadsheet or SQL table, or a dictionary of Series objects.
Pandas aims to be the de facto standard for any data analysis task in Python. Its features include:
- Easy handling of missing data.
- Automatic and explicit data alignment.
- Reshaping and pivoting of datasets.
- Labeling, aligning, and manipulating complex data sets.
- Powerful, flexible, and expressive tools for data analysis.
- Intelligent data alignment: data sets can be added or merged in a way that is aware of the indices.
- Time series functionality.
Installation
To install Pandas, you can use pip, the Python package installer:
pip install pandas
For a full list of installation options and instructions, please refer to the official installation guide.
Getting Started
Let's dive into some basic examples to get you started with Pandas.
Creating a Series
A Series is a one-dimensional array-like object containing a sequence of values and an associated array of data labels, called its index.
import pandas as pd
import numpy as np
# Create a Series from a list
s = pd.Series([1, 3, 5, np.nan, 6, 8])
print(s)
Output:
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
Creating a DataFrame
A DataFrame is a two-dimensional labeled data structure with columns of potentially different types. It's like a spreadsheet.
dates = pd.date_range('20230101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
print(df)
Output (will vary due to random numbers):
A B C D
2023-01-01 0.123456 -0.789012 1.345678 -0.901234
2023-01-02 -0.456789 0.234567 -1.567890 0.345678
2023-01-03 1.123456 -0.567890 0.789012 -1.012345
2023-01-04 -0.987654 1.234567 -0.123456 0.567890
2023-01-05 0.345678 -0.123456 1.901234 -0.789012
2023-01-06 -1.012345 0.567890 -0.345678 1.234567
Reading Data
Pandas can read data from various file formats like CSV, Excel, SQL databases, and more.
# Reading from a CSV file
df_csv = pd.read_csv('your_data.csv')
# Reading from an Excel file
df_excel = pd.read_excel('your_data.xlsx')
Ensure your data files are in the same directory as your script, or provide the full path to the file.
Basic Operations
Pandas provides a rich set of operations for data manipulation.
Viewing Data
You can inspect your data using methods like head(), tail(), and info().
# Display the first 5 rows
print(df.head())
# Display the last 3 rows
print(df.tail(3))
# Get a concise summary of the DataFrame
df.info()
Selection and Indexing
Accessing specific parts of your DataFrame is crucial for analysis.
# Select a column by label
print(df['A'])
# Select multiple columns
print(df[['A', 'B']])
# Select rows by label
print(df.loc[dates[0]])
# Select rows by integer position
print(df.iloc[3])
# Select specific data points
print(df.loc[dates[0]:dates[2], ['A', 'B']])
Data Alignment
Pandas automatically aligns data based on labels, which is very powerful when performing operations between DataFrames or Series.
s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
s2 = pd.Series([10, 20, 30], index=['b', 'c', 'd'])
print(s1 + s2)
Output:
a NaN
b 12.0
c 23.0
d NaN
dtype: float64
Notice how the operation only includes labels present in both Series, and results in NaN for unmatched labels.
Hierarchical Indexing
Pandas supports "multi-level" or "hierarchical" indexing, which allows you to have multiple levels of row and column labels. This is particularly useful for working with higher-dimensional data.
from collections import defaultdict
data = defaultdict(lambda: defaultdict(int))
for i in range(3):
for j in range(2):
data[i][j] = i + j
# Create a DataFrame with multi-index
df_multi = pd.DataFrame(data)
print(df_multi)
# Reindex to create a hierarchical index
df_multi = df_multi.reindex(index=['x', 'y', 'z'], columns=['a', 'b', 'c'])
print("\nReindexed DataFrame:")
print(df_multi)
Combining DataFrames
Pandas offers robust tools for combining DataFrames.
pd.concat()
Concatenates pandas objects along a particular axis. By default, it concatenates along rows (axis=0).
df1 = pd.DataFrame({'A': ['A0', 'A1'], 'B': ['B0', 'B1']})
df2 = pd.DataFrame({'A': ['A2', 'A3'], 'B': ['B2', 'B3']})
result = pd.concat([df1, df2])
print(result)
pd.merge()
Merges two DataFrames based on one or more keys. Similar to SQL joins.
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K4'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
result_merge = pd.merge(left, right, on='key', how='inner')
print(result_merge)
DataFrame.join()
Joins columns of another DataFrame. By default, joins on index.
left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']},
index=['K0', 'K1', 'K2'])
right = pd.DataFrame({'C': ['C0', 'C1', 'C2'],
'D': ['D0', 'D1', 'D2']},
index=['K0', 'K1', 'K2'])
result_join = left.join(right)
print(result_join)
Resampling
Pandas' time series functionality includes powerful resampling capabilities. Resampling is the process of changing the frequency of your time series data. For example, converting daily data to monthly data (downsampling) or converting monthly data to daily data (upsampling).
# Example: Downsampling to monthly frequency, taking the mean
# Assume 'df' has a DatetimeIndex
# df_monthly = df.resample('M').mean()
# Example: Upsampling to daily frequency, filling with forward fill
# df_daily = df.resample('D').ffill()
Resampling requires your DataFrame to have a DatetimeIndex.
Time Series
Pandas is excellent for time series analysis. It provides tools for generating date ranges, time zone handling, shifting, lagging, and more.
# Create a DatetimeIndex
dates_ts = pd.to_datetime(['2023-01-01', '2023-01-02', '2023-01-03'])
ts = pd.Series(np.random.randn(len(dates_ts)), index=dates_ts)
print(ts)
Categorical Data
Pandas offers specialized categorical data types that are useful for columns with a limited number of possible values (e.g., 'Male', 'Female', 'Other'). This can lead to significant memory savings and performance improvements.
df_cat = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar'],
'B': ['one', 'one', 'two', 'three'],
'C': np.random.randn(4),
'D': np.random.randn(4)})
df_cat['category'] = df_cat['B'].astype('category')
print(df_cat['category'])
Advanced Grouping
The groupby operation involves one or more of the following operations:
- Splitting the object into pieces based on some criterion.
- Applying a function to each piece independently.
- Combining the results into a data structure.
df_group = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B': ['one', 'one', 'one', 'two',
'two', 'two', 'one', 'two'],
'C': np.random.randn(8),
'D': np.random.randn(8)})
grouped = df_group.groupby('A')
print(grouped['C'].mean())
Plotting
Pandas has basic plotting capabilities built-in, which are powered by Matplotlib. You can easily create plots directly from Series and DataFrames.
# Requires matplotlib to be installed: pip install matplotlib
# import matplotlib.pyplot as plt
# ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
# ts = ts.cumsum()
# ts.plot()
# plt.show() # To display the plot
For more advanced visualizations, consider using libraries like Seaborn or Plotly, which integrate well with Pandas DataFrames.
Performance Considerations
While Pandas is generally fast, there are some common pitfalls that can lead to suboptimal performance:
- Looping over rows: Avoid using
.iterrows()or.itertuples()for large datasets. Vectorized operations are significantly faster. - String concatenation: Use
.str.cat()instead of repeated string concatenation in loops. - Data types: Use appropriate data types (e.g.,
categoryfor low-cardinality strings,int32instead ofint64if applicable) to reduce memory usage and potentially speed up operations. - Reindexing frequently: Repeatedly modifying indices can be costly.
API Reference
For a comprehensive list of all Pandas functions, classes, and their parameters, please consult the official Pandas API documentation.