Skip to content

PyEarth: A Python Introduction to Earth Science

Lecture 2: NumPy and Pandas


Review of Lecture 1:

  • Introduction to Python, Jupyter, and Chatbots
  • Basic data types and structures
  • Control flow, loops, and functions

Introduction to NumPy

  • NumPy: Numerical Python
  • Fundamental package for scientific computing in Python
  • Provides support for large, multi-dimensional arrays and matrices
  • Offers a wide range of mathematical functions

Why NumPy?

  • Efficient: Optimized for performance
  • Versatile: Supports various data types
  • Integrates well with other libraries
  • Essential for data analysis and scientific computing

Linear Algebra and NumPy?

Let's solve the classic "chickens and rabbits in the same cage" problem:

  • There are 35 heads and 94 legs in a cage of chickens and rabbits.
  • How many chickens and rabbits are there?

Linear Algebra and NumPy?

Let's solve the classic "chickens and rabbits in the same cage" problem:

  • There are 35 heads and 94 legs in a cage of chickens and rabbits.
  • How many chickens and rabbits are there?

We can use linear algebra to solve this system of equations: 1. x + y = 35 (total heads) 2. 2x + 4y = 94 (total legs)

Where x = number of chickens, y = number of rabbits


Matrix Representation

We can represent this system of equations in matrix form:

\[ \begin{bmatrix} 1 & 1 \\ 2 & 4 \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix} = \begin{bmatrix} 35 \\ 94 \end{bmatrix} \]

Or more concisely:

\[ A\vec{x} = \vec{b} \]

Where: - \(A\) is the coefficient matrix - \(\vec{x}\) is the vector of unknowns (chickens and rabbits) - \(\vec{b}\) is the constant vector


Solving with NumPy

import numpy as np

# Define the coefficient matrix A and the constant vector b
A = np.array([[1, 1],   # Coefficients for heads equation
              [2, 4]])  # Coefficients for legs equation
b = np.array([35, 94])  # Constants (total heads and legs)

# Solve the system of equations
solution = np.linalg.solve(A, b)

print(f"Chickens: {int(solution[0])}")
print(f"Rabbits: {int(solution[1])}")

Creating NumPy Arrays

import numpy as np

# From a list
arr1 = np.array([1, 2, 3, 4, 5])

# Using NumPy functions
arr2 = np.arange(0, 10, 2)  # [0, 2, 4, 6, 8]
arr3 = np.linspace(0, 1, 5)  # [0, 0.25, 0.5, 0.75, 1]
arr4 = np.zeros((3, 3))  # 3x3 array of zeros
arr5 = np.ones((2, 4))  # 2x4 array of ones
arr6 = np.random.rand(3, 3)  # 3x3 array of random values

Useful NumPy Functions

  1. Array operations:
  2. np.reshape(): Reshape an array
  3. np.concatenate(): Join arrays
  4. np.split(): Split an array

  5. Mathematical operations:

  6. np.sum(), np.mean(), np.std(): Basic statistics
  7. np.min(), np.max(): Find minimum and maximum values
  8. np.argmin(), np.argmax(): Find indices of min/max values

Useful NumPy Functions (cont.)

  1. Linear algebra:
  2. np.dot(): Matrix multiplication
  3. np.linalg.inv(): Matrix inverse
  4. np.linalg.eig(): Eigenvalues and eigenvectors

  5. Array manipulation:

  6. np.transpose(): Transpose an array
  7. np.sort(): Sort an array
  8. np.unique(): Find unique elements

How to Find NumPy Functions

  1. GPT, Claude, and other AI assistants
  2. Use Python's built-in help function: python import numpy as np help(np.array)
  3. Use IPython/Jupyter Notebook's tab completion and ? operator: python np.array?

NumPy vs. Basic Python: Speed Comparison

Let's compare the speed of calculating the mean of a large array:

import numpy as np
import time

# Create large arrays
size = 10000000
data = list(range(size))
np_data = np.array(data)

# Python list comprehension
start = time.time()
result_py = [x**2 + 2*x + 1 for x in data]
end = time.time()
print(f"Python time: {end - start:.6f} seconds")

# NumPy vectorized operation
start = time.time()
result_np = np_data**2 + 2*np_data + 1
end = time.time()
print(f"NumPy time: {end - start:.6f} seconds")

# NumPy is significantly faster due to its optimized C implementation. 

Real-world Example: Analyzing Earthquake Data

We'll use NumPy to analyze earthquake data:

import numpy as np

# Load earthquake data (magnitude and depth)
# the first coloumn is utc datetime
earthquakes = np.loadtxt("data/earthquakes.csv", delimiter=",", skiprows=1, usecols=(1, 2, 3, 4), dtype=float)

# Calculate average magnitude and depth
avg_depth = np.mean(earthquakes[:, 2])
avg_magnitude = np.mean(earthquakes[:, 3])

# Find the strongest earthquake
strongest_idx = np.argmax(earthquakes[:, 3])
strongest_magnitude = earthquakes[strongest_idx, 3]
strongest_depth = earthquakes[strongest_idx, 2]

print(f"Average magnitude: M{avg_magnitude:.2f}")
print(f"Average depth: {avg_depth:.2f} km")
print(f"Strongest earthquake: Magnitude {strongest_magnitude:.2f} at depth {strongest_depth:.2f} km")

Introduction to Pandas

  • Pandas: Python Data Analysis Library
  • Built on top of NumPy
  • Provides high-performance, easy-to-use data structures and tools
  • Essential for data manipulation and analysis

Why Pandas?

  • Handles structured data efficiently
  • Powerful data alignment and merging capabilities
  • Integrates well with other libraries
  • Excellent for handling time series data
  • Built-in tools for reading/writing various file formats

Pandas Data Structures

  1. Series: 1D labeled array
  2. DataFrame: 2D labeled data structure with columns of potentially different types
import pandas as pd

# Create a Series
s = pd.Series([1, 3, 5, np.nan, 6, 8])

# Create a DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3, 4],
    'B': pd.date_range('20230101', periods=4),
    'C': pd.Series(1, index=range(4), dtype='float32'),
    'D': np.array([3] * 4, dtype='int32'),
    'E': pd.Categorical(["test", "train", "test", "train"]),
    'F': 'foo'
})

Useful Pandas Functions

  1. Data loading and saving:
  2. pd.read_csv(), pd.read_excel(), pd.read_sql()
  3. df.to_csv(), df.to_excel(), df.to_sql()

  4. Data inspection:

  5. df.head(), df.tail(): View first/last rows
  6. df.info(): Summary of DataFrame
  7. df.describe(): Statistical summary

  8. Data selection:

  9. df['column']: Select a column
  10. df.loc[]: Label-based indexing
  11. df.iloc[]: Integer-based indexing

Useful Pandas Functions (cont.)

  1. Data manipulation:
  2. df.groupby(): Group data
  3. df.merge(): Merge DataFrames
  4. df.pivot(): Reshape data

  5. Data cleaning:

  6. df.dropna(): Drop missing values
  7. df.fillna(): Fill missing values
  8. df.drop_duplicates(): Remove duplicate rows

  9. Time series functionality:

  10. pd.date_range(): Create date ranges
  11. df.resample(): Resample time series data

How to Find Pandas Functions

  1. GPT, Claude, and other AI assistants
  2. Use Python's built-in help function: python import pandas as pd help(pd.DataFrame)
  3. Use IPython/Jupyter Notebook's tab completion and ? operator: python pd.DataFrame?

Pandas vs. NumPy

  • Pandas is built on top of NumPy
  • Pandas adds functionality for handling structured data
  • Pandas excels at:
  • Handling missing data
  • Data alignment
  • Merging and joining datasets
  • Time series functionality
  • NumPy is better for:
  • Large numerical computations
  • Linear algebra operations
  • When you need ultimate performance

Real-world Example: Revisit the Earthquake Data

We'll use Pandas to analyze earthquake data this time:

import pandas as pd

# Load earthquake data
df = pd.read_csv("data/earthquakes.csv")

# Calculate average magnitude and depth
avg_depth = df['depth'].mean()
avg_magnitude = df['magnitude'].mean()

# Find the strongest earthquake
strongest_idx = df['magnitude'].idxmax()
strongest_magnitude = df.loc[strongest_idx, 'magnitude']
strongest_depth = df.loc[strongest_idx, 'depth']

print(f"Average magnitude: M{avg_magnitude:.2f}")
print(f"Average depth: {avg_depth:.2f} km")
print(f"Strongest earthquake: Magnitude {strongest_magnitude:.2f} at depth {strongest_depth:.2f} km")

Real-world Example: Analyzing Temperature Data

We'll use Pandas to analyze temperature data:

import pandas as pd

# Load temperature data
df = pd.read_csv("data/global_temperature.csv")

# Convert date column to datetime
df["date"] = pd.to_datetime(df["date"])

# Set date as index
df.set_index("date", inplace=True)

# Find the hottest and coldest days
hottest_day = df["temperature"].idxmax()
coldest_day = df["temperature"].idxmin()

print(f"Hottest day: {hottest_day.date()} ({df.loc[hottest_day, 'temperature']:.1f}°C)")
print(f"Coldest day: {coldest_day.date()} ({df.loc[coldest_day, 'temperature']:.1f}°C)")

# Calculate monthly average temperatures
yearly_avg = df.resample("Y").mean()

# Plot monthly average temperatures
yearly_avg["temperature"].plot(figsize=(12, 6))

plt.title("Yearly Average Temperatures")
plt.ylabel("Temperature (°C)")
plt.show()

Conclusion

  • NumPy and Pandas are essential tools for data analysis in Python
  • NumPy excels at numerical computations and array operations
  • Pandas is great for structured data manipulation and analysis
  • Both libraries integrate well with other scientific Python tools
  • Practice and explore these libraries to become proficient in data analysis!