MSDN - Python for Data Science and Machine Learning

Pandas: Powerful Data Manipulation and Analysis

Pandas is an open-source Python library that provides high-performance, easy-to-use data structures and data analysis tools. It's built on top of the NumPy library and is essential for anyone working with data in Python.

Key Features

Data Structures

Pandas offers two primary data structures: Series (1D labeled array) and DataFrame (2D labeled data structure with columns of potentially different types).

Data Input/Output

Easily read data from and write data to various file formats such as CSV, Excel, SQL databases, JSON, and more.

Data Cleaning and Preparation

Handle missing data, filter, select, merge, reshape, and transform datasets with intuitive syntax.

Data Alignment

Automatic data alignment based on labels, simplifying operations across datasets with different index orders.

Time Series Functionality

Robust capabilities for working with time series data, including date range generation and frequency conversion.

Performance

Many of Pandas' core algorithms are implemented in C or Cython, offering significant performance advantages.

Getting Started with Pandas

To start using Pandas, you first need to install it. If you're using Anaconda, Pandas is usually included. Otherwise, you can install it using pip:

pip install pandas

Once installed, you can import it into your Python script:

import pandas as pd

Core Concepts

DataFrame: The workhorse of Pandas. Think of it like a spreadsheet or a SQL table. It has rows and columns.
Series: A single column of a DataFrame, or a standalone 1D array with an index.
Index: A way to label the rows of a DataFrame or Series, allowing for fast lookups and alignment.

Example: Creating a DataFrame

Here's a simple example of creating and manipulating a DataFrame:

import pandas as pd

data = {'col1': [1, 2, 3, 4],
        'col2': ['A', 'B', 'C', 'D']}
df = pd.DataFrame(data)

print(df)

# Select a column
print("\nSelecting 'col1':")
print(df['col1'])

# Filter rows
print("\nRows where col1 > 2:")
print(df[df['col1'] > 2])

Output:

   col1 col2
0     1    A
1     2    B
2     3    C
3     4    D

Selecting 'col1':
0    1
1    2
2    3
3    4
Name: col1, dtype: int64

Rows where col1 > 2:
   col1 col2
2     3    C
3     4    D