NumPy - Python Data Science & Machine Learning

NumPy: The Foundation of Numerical Computing in Python

NumPy (Numerical Python) is the fundamental package for scientific computing with Python. It provides a high-performance multidimensional array object and tools for working with these arrays. NumPy is the bedrock upon which many other scientific and data analysis libraries in Python are built, including Pandas, SciPy, and Scikit-learn.

Why NumPy?

NumPy offers several advantages over standard Python lists for numerical operations:

Speed: NumPy operations are implemented in C, making them significantly faster than equivalent Python code, especially for large datasets.
Memory Efficiency: NumPy arrays use less memory than Python lists for storing numerical data.
Convenience: It provides a rich set of mathematical functions and broadcasting capabilities that simplify complex operations.
Vectorization: NumPy enables element-wise operations on entire arrays, eliminating the need for explicit loops.

Creating NumPy Arrays

The core of NumPy is the ndarray object, a powerful N-dimensional array. You can create arrays from Python lists or tuples:

import numpy as np

# From a Python list
a = np.array([1, 2, 3, 4, 5])
print(a)
# Output: [1 2 3 4 5]

# From a list of lists (2D array)
b = np.array([[1, 2, 3], [4, 5, 6]])
print(b)
# Output:
# [[1 2 3]
#  [4 5 6]]

# Creating arrays with specific values
zeros_array = np.zeros((2, 3))  # Array of zeros
print(zeros_array)
# Output:
# [[0. 0. 0.]
#  [0. 0. 0.]]

ones_array = np.ones((3, 2))   # Array of ones
print(ones_array)
# Output:
# [[1. 1.]
#  [1. 1.]
#  [1. 1.]]

range_array = np.arange(0, 10, 2) # Array with a range
print(range_array)
# Output: [0 2 4 6 8]

linspace_array = np.linspace(0, 1, 5) # Array with evenly spaced values
print(linspace_array)
# Output: [0.   0.25 0.5  0.75 1.  ]

Array Attributes

NumPy arrays have several useful attributes:

ndim: number of dimensions
shape: size of the array in each dimension (a tuple)
size: total number of elements in the array
dtype: data type of the elements

import numpy as np

arr = np.array([[1, 2, 3], [4, 5, 6]])

print(f"Dimensions: {arr.ndim}")    # Output: Dimensions: 2
print(f"Shape: {arr.shape}")      # Output: Shape: (2, 3)
print(f"Size: {arr.size}")        # Output: Size: 6
print(f"Data type: {arr.dtype}")  # Output: Data type: int64 (or similar)

Basic Operations

NumPy supports element-wise arithmetic operations.

import numpy as np

a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

print(f"a + b = {a + b}")  # Output: a + b = [5 7 9]
print(f"a * b = {a * b}")  # Output: a * b = [ 4 10 18]
print(f"a - b = {a - b}")  # Output: a - b = [-3 -3 -3]
print(f"a / b = {a / b}")  # Output: a / b = [0.25 0.4  0.5 ]

# Scalar operations
print(f"a * 2 = {a * 2}")  # Output: a * 2 = [2 4 6]

Indexing and Slicing

Accessing and manipulating parts of an array is straightforward.

import numpy as np

arr = np.array([10, 20, 30, 40, 50, 60])

# Get the element at index 2
print(arr[2]) # Output: 30

# Get elements from index 1 to 3 (exclusive)
print(arr[1:4]) # Output: [20 30 40]

# For 2D arrays
arr_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# Get the element at row 1, column 2
print(arr_2d[1, 2]) # Output: 6

# Get the first row
print(arr_2d[0, :]) # Output: [1 2 3]

# Get the second column
print(arr_2d[:, 1]) # Output: [2 5 8]

# Get a sub-array
print(arr_2d[0:2, 1:3])
# Output:
# [[2 3]
#  [5 6]]

Broadcasting

Broadcasting is a powerful mechanism that allows NumPy to work with arrays of different shapes when performing arithmetic operations. For broadcasting to occur, the arrays must be compatible, meaning that for each dimension, the size of the dimension in the arrays must be either equal, or one of them must be 1.

import numpy as np

# Broadcasting a scalar
arr = np.array([[1, 2, 3], [4, 5, 6]])
scalar = 10
print(arr + scalar)
# Output:
# [[11 12 13]
#  [14 15 16]]

# Broadcasting a 1D array to a 2D array
row_vector = np.array([10, 20, 30])
print(arr + row_vector)
# Output:
# [[11 22 33]
#  [14 25 36]]

Advanced Functions

NumPy provides a vast library of mathematical functions:

Aggregation: sum(), mean(), std(), min(), max()
Linear Algebra: dot(), transpose(), linalg.inv(), linalg.eig()
Random Number Generation: np.random.rand(), np.random.randn(), np.random.randint()

import numpy as np

arr = np.array([[1, 2, 3], [4, 5, 6]])

print(f"Sum of all elements: {np.sum(arr)}") # Output: Sum of all elements: 21
print(f"Sum along axis 0 (columns): {np.sum(arr, axis=0)}") # Output: Sum along axis 0 (columns): [5 7 9]
print(f"Mean: {np.mean(arr)}") # Output: Mean: 3.5

matrix = np.array([[1, 0], [0, 1]])
print(f"Dot product:\n{np.dot(matrix, matrix)}")
# Output:
# [[1 0]
#  [0 1]]

Vectorized Operations

NumPy's ability to perform operations on entire arrays without explicit loops is called vectorization. This is a key performance advantage.

import numpy as np
import time

# Using a Python loop
list1 = list(range(1000000))
list2 = list(range(1000000))
start_time = time.time()
result_list = [x + y for x, y in zip(list1, list2)]
end_time = time.time()
print(f"Loop time: {end_time - start_time:.4f} seconds")

# Using NumPy
arr1 = np.arange(1000000)
arr2 = np.arange(1000000)
start_time = time.time()
result_arr = arr1 + arr2
end_time = time.time()
print(f"NumPy time: {end_time - start_time:.4f} seconds")
# Expected output will show NumPy is significantly faster

NumPy is an indispensable tool for anyone working with data in Python. Its efficiency and extensive functionality make it the go-to library for numerical operations, array manipulation, and as a building block for more complex data science tasks.