Statistics 3: More about Tests, Confidence Intervals, Goodness of Fit, and Model Validation

Package/module refs:

More about the t-Test

  • Far more to t-tests than evaluating vague differences in distributions
  • What about testing if one distribution is significantly greater than/less than another?
In [1]:
from scipy.stats import norm # scipy's normal distribution module
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

fig = plt.figure(figsize=(10, 4))
x = np.linspace(-4, 4, 101)
y = norm.pdf(x) # probability density function

pct95_1 = x < norm.ppf(0.025); pct975_1 = x < norm.ppf(0.0125); pct99_1 = x < norm.ppf(0.005)
pct95_2 = x > norm.ppf(0.975); pct975_2 = x > norm.ppf(0.9875); pct99_2 = x > norm.ppf(0.995)

ax = plt.subplot(111)
ax.plot(x, y)
opacity = 0.25
shading = "k"
sections = [pct95_1, pct95_2, pct975_1, pct975_2, pct99_1, pct99_2]
for section in sections:
    ax.fill_between(x[section], np.zeros(sum(section)), y[section], color=shading, alpha=opacity)

ax.text(0.1, 0.9, "Two-tailed t-Test", transform=ax.transAxes)
ax.set_ylim(0, 0.41)
plt.show()
../_images/lectures_stats_day3_2_0.png

numpy.zeros(N) produces an array of length N filled with zeros.

One and Two-Tailed t-Tests

  • Two-tailed: the basic test of difference between two samples (or one sample from the population)
  • Left-tailed: testing to see if \(\mu_1 < \mu_2\) (reject the hypothesis that \(\mu_1 > \mu_2\)
  • Right-tailed: testing to see if \(\mu_1 > \mu_2\) (reject the hypothesis that \(\mu_1 < \mu_2\)
In [2]:
fig = plt.figure(figsize=(10, 4))
x = np.linspace(-4, 4, 101)
y = norm.pdf(x)

pct95 = x < norm.ppf(0.05); pct975 = x < norm.ppf(0.025); pct99 = x < norm.ppf(0.01)
sections = [pct95, pct975, pct99]
ax = plt.subplot(111)
ax.plot(x, y)
for section in sections:
    ax.fill_between(x[section], np.zeros(sum(section)), y[section], color='k', alpha=0.25)

ax.text(0.1, 0.9, "Left-tailed t-Test", transform=ax.transAxes)
ax.set_ylim(0, 0.41)
plt.show()
../_images/lectures_stats_day3_5_0.png

Example

Is the temperature in Seattle during May significantly cooler than during September?

In [3]:
import pandas as pd
import os

data_dir = "../downloads/seattle_weather/"
weather_files = [
    data_dir + f for f in os.listdir(data_dir) if f.startswith("MonthlyHistory")]
seattle_weather = pd.read_csv(weather_files[0])
rename_these = {}
for col in seattle_weather.columns:
    rename_these[col] = col.replace(" ", "_")
rename_these["Mean_TemperatureF"] = "mean_temp"
seattle_weather.rename(columns=rename_these, inplace=True)

for f in weather_files[1:]:
    indata = pd.read_csv(f)
    rename_these = {}
    for col in indata.columns:
        rename_these[col] = col.replace(" ", "_")
        if col == "PDT":
            rename_these[col] = "PST"
        if col == "Mean TemperatureF":
            rename_these[col] = "mean_temp"
    indata.rename(columns=rename_these, inplace=True)
    seattle_weather = pd.concat((seattle_weather, indata), ignore_index=True)

seattle_weather["year"] = seattle_weather.PST.map(lambda x: float(x.split("-")[0]))
seattle_weather["month"] = seattle_weather.PST.map(lambda x: float(x.split("-")[1]))
seattle_weather["day"] = seattle_weather.PST.map(lambda x: float(x.split("-")[2]))
In [4]:
mays = seattle_weather.month == 5
septembers = seattle_weather.month == 9
mean_temp = "mean_temp"

plt.figure(figsize=(10, 4))
plt.hist(seattle_weather[mays][mean_temp], bins=np.arange(40, 76, 2), alpha=0.5, label="Mays")
plt.hist(seattle_weather[septembers][mean_temp], bins=np.arange(40, 76, 2), alpha=0.5, label="Septembers")
plt.legend()
plt.xlabel("Mean Temperature ($^\circ$F)")
plt.ylabel("N days")
plt.show()
../_images/lectures_stats_day3_8_0.png
In [5]:
from scipy.stats import ttest_ind

t_stat, p_value = ttest_ind(seattle_weather[mays]["mean_temp"],
                            seattle_weather[septembers]["mean_temp"])
print("Results:\n\tt-statistic: %.5f\n\tp-value: %g" % (t_stat, p_value / 2))
Results:
        t-statistic: -15.68996
        p-value: 8.98161e-48

We can reject the null hypothesis that between 2002 and 2012 the temperature in Seattle wasn’t cooler in May than it was in September. We can reject that hypothesis at the p < 0.01 level. Hell, at the p < 0.00000000001 level. We can tell that it was cooler because the t-statistic is negative (so sample 2’s values are consistently larger than sample 1’s values)

Mann-Whitney U-test

  • Performs similarly to the t-test
  • For when data is significantly not normally distributed
  • Procedure:
    • Take both sets of data and rank them together
    • Take the sum of the ranks from the first sample, and compare it to the sizes of each sample
\[U = N_1N_2 + \displaystyle\frac{N_1(N_1 + 1)}{2} - R_1\]
  • \(R_1\) is the set of ranks for sample 1
  • If \(U > 0\), sample 1’s distribution is to the left of (higher than) sample 2
  • If \(U > 0 \text{ and } p < \alpha\) then sample 1 is significantly larger than sample 2
  • Ref: scipy.stats.mannwhitneyu
    • Addressed in the ref, but if you’re doing a one-tailed test, multiply the output p-value by 2

Example: Subway Ridership on Rainy and Non-rainy days

Hypothesis: the number of entries per hour in NYC subways is generally different between rainy and non-rainy days

Null: there is no difference between rainy and non-rainy days

In [6]:
subway_data = pd.read_csv("../downloads/turnstile_data_master_with_weather.csv")
rain = subway_data.rain == 1.0
norain = ~rain

plt.figure(figsize=(10, 4))
plt.hist(subway_data.ENTRIESn_hourly[norain], bins=np.arange(0, 20000, 500), color="b")
plt.hist(subway_data.ENTRIESn_hourly[rain], bins=np.arange(0, 20000, 500), color="r")
plt.xlabel("Subway Entries per Hour")
plt.ylabel("N Occurrences")
plt.yscale("log")
plt.show()
../_images/lectures_stats_day3_13_0.png
In [7]:
from scipy.stats import mannwhitneyu

u, p = mannwhitneyu(subway_data[rain].ENTRIESn_hourly, subway_data[norain].ENTRIESn_hourly)
print("Results:\n\tU-statistic: %.5f\n\tp-value: %g" % (u, p * 2))
Results:
        U-statistic: 1949994921.00000
        p-value: 0.0999997

We cannot reject the null hypothesis that there was no difference between ridership. While there is a difference, it is not statistically significant.

Goodness of Fit

  • Evaluate your model of the data
  • Bias vs variance in model fitting
  • The chi-squared statistic:
\[\chi^2_\text{dof} = \displaystyle\frac{1}{N - k}\sum_{i=1}^N\frac{(x - \text{model})^2}{\sigma^2}\]
  • k degrees of freedom correspond to model constraints
  • The goal:
    • have the difference between my data and my model be, on average, the same as the standard deviation (or uncertainty) of the data
    • simpler goal: get \(\chi^2_\text{dof} \approx 1\)

Example: Divining the mean temperature by brute force

I want to test a number of temperatures to see which one is closest to the average of the data.

In this exercise, my model is the temperature that I’m testing, so I’m testing a range of models against my data.

In [8]:
# the data. clearly it peaks between 60 and 64 degrees
plt.figure(figsize=(10, 4))
plt.hist(seattle_weather[septembers]["mean_temp"], bins=np.arange(50, 76, 2), alpha=0.5)
plt.xlabel("Mean Temperature in Seattle Septembers")
plt.ylabel("N days")
plt.show()
../_images/lectures_stats_day3_18_0.png
In [9]:
# function for chi-squared statistic
def chi_sq(data, model, std, dof=1):
    return sum(((data - model)/std)**2) / (len(data) - dof)
In [10]:
september_temps = seattle_weather[seattle_weather.month == 9]["mean_temp"]
resolution = 5.

fig = plt.figure(figsize=(16, 4))
fig.subplots_adjust(wspace=0)

for ii in range(1, 4):
    ax = plt.subplot(1, 3, ii)
    test_temps = np.arange(50, 76, 5/ii)
    chi_vals = np.array([chi_sq(september_temps, test, np.std(september_temps, ddof=1)) for test in test_temps])
    ax.plot(test_temps, chi_vals)
    ax.plot([september_temps.mean(), september_temps.mean()], [0, 12], linestyle="--", color='k')
    ax.plot([test_temps[chi_vals.argmin()], test_temps[chi_vals.argmin()]], [0, 12], linestyle="--", color='r')
    ax.text(0.9, 0.1, "min $\chi^2: %.3f$" % min(chi_vals), transform=ax.transAxes, horizontalalignment="right")
    ax.set_xlim(49, 73)
    if ii == 1: ax.set_ylabel("$\chi^2_{dof}$", fontsize=12)
    else: ax.set_yticklabels([])
    ax.set_xlabel("Tests for Average Temperature ($^\circ$F)")

plt.show()
../_images/lectures_stats_day3_20_0.png

Note: the model with the best fit you can calculate is not always the best model for the data you have