NumPy Basics for Data Science

Your essential guide to getting started with NumPy.

Introduction to NumPy

Welcome to this foundational guide on NumPy, the cornerstone library for numerical computation in Python. In the realm of data science, efficient manipulation and analysis of numerical data are paramount. NumPy provides a powerful and flexible N-dimensional array object, along with a vast collection of mathematical functions to operate on these arrays.

What is NumPy?

NumPy stands for Numerical Python. It is an open-source library that is used for data manipulation and analysis. It provides a high-performance multidimensional array object and tools for working with these arrays. NumPy is the fundamental package for scientific computing with Python. It forms the basis of many other scientific libraries like SciPy, Pandas, and Scikit-learn.

NumPy Arrays

The core of NumPy is the ndarray object. This object is a table of elements of the same type, indexed by a tuple of non-negative integers. The number of dimensions corresponds to the rank of the array. NumPy arrays are significantly more efficient than Python lists for numerical operations, especially for large datasets.

Creating Arrays

There are several ways to create NumPy arrays:

From Python Lists

You can convert Python lists into NumPy arrays using the np.array() function.


import numpy as np

# 1-dimensional array
a = np.array([1, 2, 3, 4, 5])
print(a)

# 2-dimensional array
b = np.array([[1, 2, 3], [4, 5, 6]])
print(b)
                    

Zeros and Ones

Create arrays filled with zeros or ones.


# Array of zeros with shape (3, 4)
zeros_array = np.zeros((3, 4))
print(zeros_array)

# Array of ones with shape (2, 3)
ones_array = np.ones((2, 3))
print(ones_array)
                    

arange()

Similar to Python's range(), but returns a NumPy array.


# Array from 0 to 9
sequence_array = np.arange(10)
print(sequence_array)

# Array from 2 to 10 with a step of 2
stepped_array = np.arange(2, 11, 2)
print(stepped_array)
                    

linspace()

Create an array with a specified number of evenly spaced values between a start and end point.


# 5 evenly spaced values between 0 and 1
linear_spaced = np.linspace(0, 1, 5)
print(linear_spaced)
                    

Random Arrays

Generate arrays with random numbers.


# Array of random floats between 0 and 1, shape (2, 3)
random_floats = np.random.rand(2, 3)
print(random_floats)

# Array of random integers between 0 and 10 (exclusive of 10), shape (3, 2)
random_ints = np.random.randint(0, 10, size=(3, 2))
print(random_ints)
                    

Array Attributes

NumPy arrays have several useful attributes that provide information about the array:

ndim

The number of dimensions (axes) of the array.


arr = np.array([[1, 2], [3, 4]])
print(arr.ndim) # Output: 2
                    

shape

The dimensions of the array, as a tuple of integers. The tuple contains the size of the array along each dimension.


arr = np.array([[1, 2, 3], [4, 5, 6]])
print(arr.shape) # Output: (2, 3)
                    

size

The total number of elements in the array.


arr = np.array([[1, 2, 3], [4, 5, 6]])
print(arr.size) # Output: 6
                    

dtype

The data type of the elements in the array.


arr = np.array([1, 2, 3], dtype=np.int16)
print(arr.dtype) # Output: int16

arr_float = np.array([1.0, 2.5, 3.1])
print(arr_float.dtype) # Output: float64
                    

Array Operations

NumPy supports vectorized operations, meaning you can perform operations on entire arrays without explicit looping. This is a major source of its performance advantage.

Arithmetic Operations

Element-wise arithmetic operations are straightforward.


a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

print(a + b)  # Element-wise addition: [5 7 9]
print(a - b)  # Element-wise subtraction: [-3 -3 -3]
print(a * b)  # Element-wise multiplication: [4 10 18]
print(a / b)  # Element-wise division: [0.25 0.4  0.5 ]

print(a * 3)  # Scalar multiplication: [3 6 9]
                    

Broadcasting

Broadcasting is a powerful mechanism that allows NumPy to work with arrays that have different shapes during arithmetic operations. NumPy expands the smaller array to match the shape of the larger array.


arr = np.array([[1, 2, 3], [4, 5, 6]])
scalar = 2

# Broadcasting scalar to the array
print(arr + scalar)
# Output:
# [[3 4 5]
#  [6 7 8]]

vec = np.array([10, 20, 30])

# Broadcasting vector to the array rows
print(arr + vec)
# Output:
# [[11 22 33]
#  [14 25 36]]
                    

Note: Broadcasting follows specific rules. For arrays to be broadcastable, the dimension sizes must either be equal, one of them is 1, or one of them does not exist (is a scalar).

Indexing and Slicing

Accessing and manipulating parts of NumPy arrays is similar to Python lists but more powerful, especially with multi-dimensional arrays.


arr = np.array([10, 20, 30, 40, 50, 60])

# Accessing elements
print(arr[2])     # Output: 30 (3rd element)
print(arr[-1])    # Output: 60 (last element)

# Slicing
print(arr[1:4])   # Elements from index 1 up to (but not including) 4: [20 30 40]
print(arr[:3])    # Elements from the beginning up to (but not including) index 3: [10 20 30]
print(arr[3:])    # Elements from index 3 to the end: [40 50 60]
print(arr[::2])   # Every second element: [10 30 50]

# 2D array indexing and slicing
matrix = np.array([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]])

# Accessing element at row 1, column 2
print(matrix[1, 2]) # Output: 6

# Getting a row
print(matrix[0, :]) # First row: [1 2 3]

# Getting a column
print(matrix[:, 1]) # Second column: [2 5 8]

# Slicing a sub-matrix
print(matrix[0:2, 1:3])
# Output:
# [[2 3]
#  [5 6]]
                    

Tip: NumPy slicing returns views into the original array, not copies. Modifying a view will modify the original array.

Conclusion

This introduction has covered the fundamental aspects of NumPy, including its array object, creation methods, attributes, basic operations, and indexing/slicing. Mastering these basics will provide a solid foundation for more advanced data science tasks using Python. NumPy's efficiency and expressiveness make it an indispensable tool for any data scientist or numerical programmer.

Continue your journey by exploring NumPy's extensive documentation and practicing these concepts with real-world data.