PyEarth: A Python Introduction to Earth Science

Lecture 4: Scikit-learn: Supervised Learning: Regression

What is Linear Regression?

A statistical method to model the relationship between variables
Predicts a dependent variable based on one or more independent variables
Assumes a linear relationship between variables

Examples: CO2 Concentration

bg right:65% width:800px

Global Temperature Anomalies

Arctic Sea Ice Extent

bg right:65% width:800px

Sea Level Rise

How to quantify the relationship between variables?
How to predict future values and interpret the results?
How to use Python libraries to model and analyze data?

Basic Statistics: Mean and Standard Deviation

Mean: Average of a set of numbers $\(\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i$\)
Standard Deviation: Measure of spread in a dataset $\(\sigma = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2}$\)

Linear Regression

\[y = Ax + b + \epsilon\]

$y$: Dependent variable
$x$: Independent variable
$A$: Slope
$b$: Intercept
$\epsilon$: Error term

Earthquake Example: 2019 Ridgecrest Earthquakes

July 4, 2019: M6.4 earthquake
July 5, 2019: M7.1 earthquake
Location: Ridgecrest, California, USA

Link: WIKI

Ground Motion Data

Earthquake information: USGS ShakeMap: PGA

bg right:50% w:550px

Accessing Peak Ground Acceleration (PGA) Data

import json
import pandas as pd
import numpy as np
import fsspec

# M6.4 earthquake
json_url = "https://earthquake.usgs.gov/product/shakemap/ci38443183/atlas/1594160017984/download/stationlist.json"
# M7.1 earthquake
# json_url = "https://earthquake.usgs.gov/product/shakemap/ci38457511/atlas/1594160054783/download/stationlist.json"

with fsspec.open(json_url) as f:
    data = json.load(f)

def parse_data(data):
    rows = []
    for line in data["features"]:
        rows.append({
            "station_id": line["id"],
            "longitude": line["geometry"]["coordinates"][0],
            "latitude": line["geometry"]["coordinates"][1],
            "pga": line["properties"]["pga"], # unit: %g
            "pgv": line["properties"]["pgv"], # unit: cm/s
            "distance": line["properties"]["distance"],
        }
    )
    return pd.DataFrame(rows)

data = parse_data(data)
data = data[(data["pga"] != "null") & (data["pgv"] != "null")]
data = data[~data["station_id"].str.startswith("DYFI")]
data = data.dropna()
data = data.sort_values(by="distance", ascending=True)
data["logR"] = data["distance"].apply(lambda x: np.log10(float(x)))
data["logPGA"] = data["pga"].apply(lambda x: np.log10(float(x)))
data["logPGV"] = data["pgv"].apply(lambda x: np.log10(float(x)))
data

Visualizing Peak Ground Acceleration (PGA) Data

import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import cartopy.feature as cfeature

fig, ax = plt.subplots(figsize=(10, 6), subplot_kw={'projection': ccrs.PlateCarree()})

scatter = ax.scatter(data['longitude'], data['latitude'], c=data['pga'], 
                     cmap='viridis', transform=ccrs.PlateCarree())

ax.add_feature(cfeature.STATES)
ax.add_feature(cfeature.COASTLINE)
ax.set_extent([-120, -116, 34, 37], crs=ccrs.PlateCarree())

plt.colorbar(scatter, label='PGA %g')
plt.title('PGA Distribution - Ridgecrest Earthquake')
plt.show()

Ground Motion Prediction Equations (GMPEs)

\[\log(PGA) = a \log(R) + bM + c\]

PGA: Peak Ground Acceleration
R: Distance from earthquake source
M: Earthquake magnitude
a, b, c: Model parameters

Visualizing log(PGA) vs. log(R)

plt.figure(figsize=(10, 6))
plt.scatter(data['logR'], data['logPGA'])
plt.xlabel('log(R)')
plt.ylabel('log(PGA)')
plt.title('log(PGA) vs. log(R)')
plt.show()

Fitting a Linear Model using Scikit-learn

from sklearn.linear_model import LinearRegression

X = data[['logR']].values
y = data['logPGA'].values

model = LinearRegression()
model.fit(X, y)

print(f"Slope (a): {model.coef_[0]:.4f}")
print(f"Intercept (b): {model.intercept_:.4f}")

Visualizing the Linear Regression Model

plt.figure(figsize=(10, 6))
plt.scatter(X, y, color='blue', label='Data')
plt.plot(X, model.predict(X), color='red', label='Regression Line')
plt.xlabel('log(Distance)')
plt.ylabel('log(PGA)')
plt.title('Linear Regression Model')
plt.legend()
plt.show()

Interpreting the Results

Slope (a): Represents the rate of decay of PGA with distance
Intercept (b): Related to the earthquake's magnitude and local site conditions

Let's predict the shaking in Los Angeles:

la_distance = 177  # km (approximate)
la_logR = np.log10(la_distance)
la_prediction = model.predict([[la_logR]])
print(f"Predicted log(PGA) in LA: {la_prediction[0]:.4f}")
print(f"Predicted PGA in LA: {10**la_prediction[0]:.4f} %g")

Compare the prediction with the data

USGS PGA data: USGS

Why is the prediction different from the observed PGA in Los Angeles?

M7.08 Ridgecrest Simulation

Add the prediction to the plot

plt.figure(figsize=(10, 6))
plt.scatter(X, y, color='blue', label='Data')
plt.plot(X, model.predict(X), color='red', label='Regression Line')
plt.scatter(la_logR, la_prediction, color='green', label='LA Prediction')
plt.xlabel('log(Distance)')
plt.ylabel('log(PGA)')
plt.title('Linear Regression Model with LA Prediction')
plt.legend()
plt.show()

Residual Analysis

y_pred = model.predict(X)
residuals = y - y_pred

plt.figure(figsize=(10, 6))
plt.scatter(X, residuals)
plt.xlabel('log(Distance)')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.axhline(y=0, color='r', linestyle='--')
plt.show()

Evaluating Model Quality: R-squared (R²)

R² measures the proportion of variance in the dependent variable explained by the independent variable(s) $$ R^2 = 1 - \frac{\sum (y_i - \hat{y}_i)^2}{\sum (y_i - \bar{y})^2} $$
Ranges from 0 to 1, with 1 indicating a perfect fit

from sklearn.metrics import r2_score

r2 = r2_score(y, y_pred)
print(f"R-squared: {r2:.4f}")

# Calculate R-squared manually
ss_res = np.sum((y - y_pred)**2)
ss_tot = np.sum((y - y.mean())**2)
r2_manual = 1 - ss_res / ss_tot
print(f"R-squared (manual): {r2_manual:.4f}")

Understanding R²

R² = 0: Model explains none of the variance in the dependent variable
R² = 1: Model explains all the variance in the dependent variable
R² < 0: Model performs worse than a horizontal line

Parameter Uncertainty

Standard error of the slope (a):

\[SE_a = \sqrt{\sigma^2 / \sum (x_i - \bar{x})^2}\]

where $\sigma^2 = \sum (y_i - \hat{y}_i)^2 / (n - p)$, $n$ is the number of observations, and $p$ is the number of predictors

Standard error of the intercept (b):

\[SE_b = \sqrt{\sigma^2 (1/n + \bar{x}^2 / \sum (x_i - \bar{x})^2)}\]

Calculating Parameter Uncertainty

import numpy as np
from scipy import stats

n = len(X)
p = len(model.coef_) + 1
dof = n - p

sigma = np.sum((y - y_pred)**2) / dof

se_a = np.sqrt(sigma / np.sum((X - X.mean())**2))
se_b = np.sqrt(sigma * (1/n + (X.mean()**2) / np.sum((X - X.mean())**2)) )

print(f"Standard error of slope (a): {se_a:.4f}")
print(f"Standard error of intercept (b): {se_b:.4f}")

Summary

Linear regression is a powerful tool for modeling relationships in Earth Science data
We applied it to model ground motion attenuation in earthquakes
Important considerations:
Model assumptions
Residual analysis
Model evaluation (R²)
Parameter uncertainty
Prediction intervals