Pandas Series: A Fundamental Data Structure

Welcome to the section on Pandas Series. A Series is a one-dimensional labeled array capable of holding any type of data (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively called the index.

What is a Pandas Series?

A Pandas Series is essentially a column in a NumPy array or a dictionary. It's a fundamental data structure in the Pandas library, providing a powerful and flexible way to work with one-dimensional data. Each element in a Series has an associated index, which allows for easy data retrieval and manipulation.

Creating a Series

You can create a Series from various data structures, including Python lists, NumPy arrays, and dictionaries.

From a Python List

When you create a Series from a list, Pandas automatically assigns a default integer index starting from 0.

import pandas as pd
import numpy as np

# Creating a Series from a list
data_list = [10, 20, 30, 40, 50]
s_list = pd.Series(data_list)
print(s_list)

0 10 1 20 2 30 3 40 4 50 dtype: int64

From a NumPy Array

Similar to lists, NumPy arrays also get a default integer index.

# Creating a Series from a NumPy array
data_numpy = np.array([1.1, 2.2, 3.3, 4.4, 5.5])
s_numpy = pd.Series(data_numpy)
print(s_numpy)

0 1.1 1 2.2 2 3.3 3 4.4 4 5.5 dtype: float64

From a Dictionary

When creating a Series from a dictionary, the dictionary keys are used as the Series index.

# Creating a Series from a dictionary
data_dict = {'a': 100, 'b': 200, 'c': 300, 'd': 400}
s_dict = pd.Series(data_dict)
print(s_dict)

a 100 b 200 c 300 d 400 dtype: int64

Customizing the Index

You can provide your own index when creating a Series.

# Creating a Series with a custom index
data = [1, 2, 3, 4, 5]
index_labels = ['X', 'Y', 'Z', 'W', 'V']
s_custom_index = pd.Series(data, index=index_labels)
print(s_custom_index)

X 1 Y 2 Z 3 W 4 V 5 dtype: int64

Accessing Series Elements

You can access elements in a Series using their index label or their integer position.

Using Index Labels

print(s_dict['b'])
print(s_custom_index['Z'])

200 3

Using Integer Positions (iloc)

The iloc accessor is used for integer-location based indexing.

print(s_list.iloc[2]) # Accessing the element at index position 2
print(s_custom_index.iloc[0]) # Accessing the first element

30 1

Series Attributes and Methods

Pandas Series come with a rich set of attributes and methods for data analysis.

Common Attributes

.index: Returns the index of the Series.
.values: Returns the data as a NumPy array.
.dtype: Returns the data type of the Series.
.shape: Returns a tuple representing the dimensionality of the Series.
.name: Returns the name of the Series.

Common Methods

.head(n): Returns the first n elements.
.tail(n): Returns the last n elements.
.describe(): Generates descriptive statistics (count, mean, std, min, max, etc.).
.mean(): Computes the mean.
.sum(): Computes the sum.
.value_counts(): Returns a Series containing counts of unique values.

Let's look at an example using .describe():

# Using .describe() on a Series of numbers
print(s_list.describe())

count 5.000000 mean 30.000000 std 15.811388 min 10.000000 25% 20.000000 50% 30.000000 75% 40.000000 max 50.000000 dtype: float64

Operations on Series

You can perform various arithmetic operations on Series, and Pandas will align the data based on the index.

Element-wise Operations

s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
s2 = pd.Series([10, 20, 30], index=['b', 'c', 'd'])

# Addition
print("s1 + s2:")
print(s1 + s2)

# Multiplication
print("\ns1 * 2:")
print(s1 * 2)

s1 + s2: a NaN b 22.0 c 33.0 d NaN dtype: float64 s1 * 2: a 2 b 4 c 6 dtype: int64

Notice that when adding s1 and s2, values with non-matching indices result in NaN (Not a Number).

Filtering Series

You can filter Series based on conditions applied to its values.

# Filtering elements greater than 25
print("Elements greater than 25 in s_list:")
print(s_list[s_list > 25])

# Filtering using index labels
print("\nElements with index 'a' or 'c' in s_dict:")
print(s_dict.loc[['a', 'c']])

Elements greater than 25 in s_list: 2 30 3 40 4 50 dtype: int64 Elements with index 'a' or 'c' in s_dict: a 100 c 300 dtype: int64

Handling Missing Data

Pandas uses NaN to represent missing data. You can detect and handle missing values.

# Series with missing values
data_missing = [10, 20, np.nan, 40, 50]
s_missing = pd.Series(data_missing)

print("Original Series with NaN:")
print(s_missing)

print("\nChecking for NaN:")
print(s_missing.isnull())

print("\nDropping NaN values:")
print(s_missing.dropna())

Original Series with NaN: 0 10.0 1 20.0 2 NaN 3 40.0 4 50.0 dtype: float64 Checking for NaN: 0 False 1 False 2 True 3 False 4 False dtype: bool Dropping NaN values: 0 10.0 1 20.0 3 40.0 4 50.0 dtype: float64

Summary

The Pandas Series is a fundamental building block for data manipulation in Python. Understanding how to create, access, operate on, and filter Series is crucial for anyone working with data in the Python ecosystem.

In the next section, we will explore Pandas DataFrames, which are two-dimensional labeled data structures with columns of potentially different types, analogous to a spreadsheet or SQL table.

MSDN Python Data Science & ML