Statistics 1: Descriptive Statistics

Package/module refs:

  • pandas for storing your data
  • numpy for storing data and fast descriptive statistics, quantiles, and lots of modules dealing with random numbers
  • scipy.stats for the mode, and other things
  • matplotlib.pyplot for visualizing your data

External Resources:

  • Khan Academy
  • Think Bayes

Progression:

  • Average/Mean (sample vs population)
  • Median + difference from the mean
  • Mode
  • Standard Deviation & variance (sample vs population & N - 1)
  • Standard error
  • Quantiles

The Problem

  • Tons of data with a number of characteristics
  • Want to talk intelligently about data without seeing every single entry
  • Need to know about what’s most common, the ranges of distributions, etc.

The Solution (to start)

DESCRIPTIVE STATISTICS!

Averages/Means - Getting a Feeling for the Data

  • For some NUMERICAL measurements, average/mean gives you a sense of the “middle” of your data. May not represent an actual data point, but moreso where the middle of that data set is.
  • Consider the set of height measurements (in feet):
Person Height
Dante Fierro 5.9
Sun Quan 5.5
Spike Spiegel 6.1
Commander Shepard 6.0
The Dragonborn 7.2
  • The mean height is calculated as:

Averages/Means - Do It With Code!

In [1]:
# the height data
heights = [5.9, 5.5, 6.1, 6.0, 7.2]

# explicit sum
avg1 = (5.9 + 5.5 + 6.1 + 6.0 + 7.2) / 5
print("Explicit Average: (5.9 + 5.5 + 6.1 + 6.0 + 7.2) / 5 = {0}\n".format(avg1))

# actually use some Python
avg2 = sum(heights) / len(heights)
print("Better Average: sum(heights) / len(heights) = {0}\n".format(avg2))

# explicitly using the list and a loop
avg3 = 0
for ii in range(len(heights)):
    avg3 += heights[ii]
avg3 = avg3 / len(heights)
print("Explicitly with the list {0}\n".format(avg3))

# get more sophisticated with Numpy
import numpy as np

avg4 = np.mean(heights)
print("Numpy Average: np.mean(heights) = {0}".format(avg4))
Explicit Average: (5.9 + 5.5 + 6.1 + 6.0 + 7.2) / 5 = 6.14

Better Average: sum(heights) / len(heights) = 6.14

Explicitly with the list 6.14

Numpy Average: np.mean(heights) = 6.14

Just stick with Numpy for your means...

Convention for using the numpy package is to import it as np. Since in your analysis you may use any number of numpy modules, and some of those modules have names that would overwrite python built-ins (e.g. sum vs np.sum), just import numpy as np instead of pulling over all the things.

In [2]:
from time import time

np.random.seed(42)

number_measurements = [10, 100, 1000, 10000,
                       100000, 1000000, 10000000, 100000000]
basic_time_taken = []
numpy_time_taken = []

for ii in range(len(number_measurements)):
    dummy_measurements = np.random.random(size=number_measurements[ii])
    t0 = time()
    dummy_calc = sum(dummy_measurements) / len(dummy_measurements)
    basic_time_taken.append(time() - t0)

    t0 = time()
    nummy_calc = np.mean(dummy_measurements)
    numpy_time_taken.append(time() - t0)

del dummy_measurements
In [3]:
import matplotlib.pyplot as plt
%matplotlib inline

plt.figure(figsize=(16, 4))
plt.plot(number_measurements, basic_time_taken,
         label="Straight Python", linewidth=5)
plt.plot(number_measurements, numpy_time_taken,
         label="Glorious Numpy", linewidth=5)
plt.plot([1E1, 1E8], [max(basic_time_taken), max(
    basic_time_taken)], linestyle="--", color='k')
plt.plot([1E1, 1E8], [max(numpy_time_taken), max(
    numpy_time_taken)], linestyle="--", color='k')
plt.text(30, 1.1 * max(basic_time_taken), "Basic: %.2f seconds" %
         (max(basic_time_taken)))
plt.text(30, 1.1 * max(numpy_time_taken), "Numpy: %.2f seconds" %
         (max(numpy_time_taken)))
plt.xscale("log")
plt.yscale("log")
plt.xlabel("N measurements")
plt.ylabel("Time (s)")
plt.legend(loc="best")
plt.show()
../_images/lectures_stats_day1_9_0.png

Always visualize your data

The above example is just the visualization of a time test, but the principle follows that whenever you have data that you want to know the behavior of, visualize it. Here is a basic tutorial on plotting data with matplotlib. It does scatter plots instead of the line plots above, but the principle is the same.

  • plt.figure: sets up your figure space, and lets you control how big your figure is
  • plt.title: puts a title on your figure
  • plt.plot: plots a line (but can also plot points; use “scatter” for that)
  • plt.text: allows you to place text on your graph, provided coordinates and text
  • plt.xscale/plt.yscale: switches between a linear scale (1, 2, 3, 4, 5) and a logarithmic scale (1, 10, 100, 1000)
  • plt.xlim/plt.ylim: sets the limits on the x or y axis.
  • plt.legend: lets you produce a legend for your data if you provide labels

Averages/Means - Limitations

Massive, often bogus assumption that your data is distributed “normally” around some value:

In [4]:
some_data = np.random.normal(size=1000000)
H, edges = np.histogram(some_data, bins=100)

plt.figure(figsize=(10, 4))
plt.title('The "Normal" Distribution')
plt.plot(edges[:-1], H)
plt.xlim(-4, 4)
plt.xlabel("Some Measurement"); plt.ylabel("Frequency of Measurement")
plt.show()
../_images/lectures_stats_day1_12_0.png

np.histogram - a valuable tool

A histogram shows the counts of some range of values for values in a data set. np.histogram takes a list, or array-like object and the number or set of bins for your data as arguments. It returns histogrammed data (a numpy array of frequency counts), as well as the edges of each of the bins in that histogram. If there are N bins in your histogram, the edges will be of length N + 1.

Averages/Means - Limitations

  • Sample averages can be very sensitive to outliers (values significantly outside the norm)
  • Also sensitive to the size of data set
  • Doesn’t accurately describe data with multiple peaks and otherwise non-standard distributions
  • Recall our collections of height
In [5]:
heights = [5.9, 5.5, 6.1, 6.0, 7.2]
print("No outliers average: %.2f\n" % np.mean(heights))

heights = [5.9, 5.5, 6.1, 6.0, 7.2, 1.0]
print("Data + baby average: %.2f\n" % np.mean(heights))

heights = [5.9, 5.5, 6.1, 6.0, 7.2, 10.0]
print("Data + sasquatch average: %.2f" % np.mean(heights))
No outliers average: 6.14

Data + baby average: 5.28

Data + sasquatch average: 6.78

Median - The Literal Middle

  • Sort data from small to large
  • The median = the exact middle
  • If even number of data points, median is average of middle two
  • Also known as the 50th percentile (more later)
  • Advantage: insensitive to outliers
In [6]:
some_values = [0., 1., 2., 3., 4., 5., 6., 7.]
counts = [18, 36, 22, 58, 12, 6, 100]
data = []

for ii in range(len(counts)):
    for jj in range(counts[ii]):
        data.append(some_values[ii])

print("The average: %.2f" % np.mean(data))
print("The actual middle: %.2f" % np.median(data))
The average: 3.70
The actual middle: 3.00

Standard Deviation - The Spread of the Data

  • Based on the average (\(\mu\)) of the data.

  • “On average, how far is each data point from the mean?”

  • Two types to be aware of: population and sample

    • population standard deviation: For when you have every possible measurement for some data set or you’re only interested in the sample you have and don’t wish to generalize, e.g. ALL the ages of adults in the United States
    • sample standard deviation: For when you only have a sampling of the total population. When in doubt, assume the sample standard deviation (the for sufficiently large numbers, \(N - 1 \approx N\)), e.g. the ages of American males in the Pacific North West

    **note: \(\bar{x}\) is the sample mean, which follows the same equation as population mean

In [7]:
heights = [5.9, 5.5, 6.1, 6.0, 7.2, 5.1, 5.3, 6.0, 5.8, 6.0]
print("Sample average: %.2f" % np.mean(heights))
print("Sample standard deviation: %.2f" % np.std(heights, ddof=1))
print("Improper standard deviation: %.2f\n" % np.std(heights))

large_num_heights = np.random.random(size=100000)*1.5 + 5.0
print("Sample average: %.2f" % np.mean(large_num_heights))
print("Sample standard deviation: %.2f" % np.std(large_num_heights, ddof=1))
print("Less improper standard deviation: %.2f" % np.std(large_num_heights))
Sample average: 5.89
Sample standard deviation: 0.57
Improper standard deviation: 0.54

Sample average: 5.75
Sample standard deviation: 0.43
Less improper standard deviation: 0.43
In [8]:
some_data = np.random.normal(size=1000000)
H, edges = np.histogram(some_data, bins=100)

plt.figure(figsize=(10, 4))
plt.title('The "Normal" Distribution with Mean & St. Dev.')
plt.plot(edges[:-1], H)
plt.plot([np.mean(some_data), np.mean(some_data)],
         [0, max(H)], linestyle="--", color="r")
plt.fill_between([np.mean(some_data) - np.std(some_data),
                  np.mean(some_data) + np.std(some_data)],
                 [0, 0],
                 [1.1 * max(H), 1.1 * max(H)], linestyle="--",
                 color='k', alpha=0.25)
plt.xlim(-4, 4)
plt.ylim(0, 1.05 * max(H))
plt.xlabel("Some Measurement")
plt.ylabel("Frequency of Measurement")
plt.show()
../_images/lectures_stats_day1_22_0.png

plt.fill_between: lets you fill in some space on a plot. You provide the range of x values, as well as the vertical sections you wish to fill between (in this case y=0 up to y=around 40000). For every x value you provide (in this case, two), you provide y values for the lower limit and the upper limit.

Standard Deviation - The 68-95-99 rule

  • 1 standard deviation from the mean: 68% of data in distribution
  • 2 standard deviations: 95%
  • 3 standard deviations: 99.7%
  • Applies solely to Normally-distributed data (more on that later)
In [9]:
some_data = np.random.normal(size=1000000)
H, edges = np.histogram(some_data, bins=100)

plt.figure(figsize=(10, 4))
plt.title('The "Normal" Distribution with Mean & St. Devs.')
plt.plot(edges[:-1], H)
plt.plot([np.mean(some_data), np.mean(some_data)], [0, max(H)], linestyle="--", color="r")
plt.fill_between([np.mean(some_data) - 3*np.std(some_data), np.mean(some_data) + 3*np.std(some_data)],
                 [0, 0],
                 [1.1*max(H), 1.1*max(H)], linestyle="--", color='k', alpha=0.25)
plt.fill_between([np.mean(some_data) - 2*np.std(some_data), np.mean(some_data) + 2*np.std(some_data)],
                 [0, 0],
                 [1.1*max(H), 1.1*max(H)], linestyle="--", color='k', alpha=0.25)
plt.fill_between([np.mean(some_data) - np.std(some_data), np.mean(some_data) + np.std(some_data)],
                 [0, 0],
                 [1.1*max(H), 1.1*max(H)], linestyle="--", color='k', alpha=0.25)
plt.xlim(-4, 4)
plt.ylim(0, 1.05*max(H))
plt.xlabel("Some Measurement"); plt.ylabel("Frequency of Measurement")
plt.show()
../_images/lectures_stats_day1_25_0.png

Mode - The most frequent thing(s)

  • Nothing fancy here; the mode is whatever number(s) pop up most often
In [10]:
import scipy.stats as sp

some_values = [0., 1., 2., 3., 4., 5., 6., 7.]
counts1 = [18, 36, 22, 58, 12, 6, 100]
data1 = []

for ii in range(len(counts1)):
    for jj in range(counts1[ii]):
        data.append(some_values[ii])

print("The mode: %g" % sp.mode(data)[0])
The mode: 6

scipy’s stats module contains tons of things for statistics, including functions for reproducing some of the distributions we’ll see later

Quantiles - Values some fraction of the way into something

  • If I’m some percentage through ordered data, what value do I get?

  • 50th percentile = median = \(q_{50}\)

  • One quartile: \(q_{25}\)

  • Interquartile range (IQR) - the difference between \(q_{75}\) and \(q_{25}\). \(IQR\) gives you a sense of the spread of the data when the data isn’t normally distributed

    \[IQR = q_{75} - q_{25}\]
In [11]:
some_values = [0., 1., 2., 3., 4., 5., 6., 7.]
counts = [18, 36, 22, 58, 12, 6, 100]
data = []

for ii in range(len(counts)):
    for jj in range(counts[ii]):
        data.append(some_values[ii])

print("The average: %.2f" % np.mean(data))
print("The actual middle: %.2f\n" % np.median(data))

print("The standard deviation: %.2f" % np.std(data))
q75, q25 = np.percentile(data, [75, 25])
print("The interquartile width: %.2f" % (q75 - q25))
The average: 3.70
The actual middle: 3.00

The standard deviation: 2.13
The interquartile width: 4.00

Descriptive Statistic Module References