Microsoft Developer Network - Python for Data Science and Machine Learning
Pandas is an open-source Python library that provides high-performance, easy-to-use data structures and data analysis tools. It's built on top of the NumPy library and is essential for anyone working with data in Python.
Pandas offers two primary data structures: Series (1D labeled array) and DataFrame (2D labeled data structure with columns of potentially different types).
Easily read data from and write data to various file formats such as CSV, Excel, SQL databases, JSON, and more.
Handle missing data, filter, select, merge, reshape, and transform datasets with intuitive syntax.
Automatic data alignment based on labels, simplifying operations across datasets with different index orders.
Robust capabilities for working with time series data, including date range generation and frequency conversion.
Many of Pandas' core algorithms are implemented in C or Cython, offering significant performance advantages.
To start using Pandas, you first need to install it. If you're using Anaconda, Pandas is usually included. Otherwise, you can install it using pip:
pip install pandas
Once installed, you can import it into your Python script:
import pandas as pd
Here's a simple example of creating and manipulating a DataFrame:
import pandas as pd
data = {'col1': [1, 2, 3, 4],
'col2': ['A', 'B', 'C', 'D']}
df = pd.DataFrame(data)
print(df)
# Select a column
print("\nSelecting 'col1':")
print(df['col1'])
# Filter rows
print("\nRows where col1 > 2:")
print(df[df['col1'] > 2])
col1 col2
0 1 A
1 2 B
2 3 C
3 4 D
Selecting 'col1':
0 1
1 2
2 3
3 4
Name: col1, dtype: int64
Rows where col1 > 2:
col1 col2
2 3 C
3 4 D