Nathaniel Dake Blog

1. Introduction to Statistics

I have been meaning to get to an introductory statistics post for quite some time now! Statistics play an incredibly important role in modern day machine learning. For instance, while it is a far less "sexy" description, modern day machine learning can most often be reduced to variations of statistical learning, where as statistical model can be defined as follows:

A statistical model is a mathematical model which is modified or trained by the input of data points. Statistical models are often but not always probabilistic.

Now, most of this post will probably be familiar in some capacity, but the main goal is not to introduce new concepts. Rather, it is to make concepts that we will be using frequently more concrete.

1. Population vs. Sample

One of the first concepts that you must have a good understanding of when interacting with statistics and statistical learning is that of a population and a sample. Many analysis and algorithms rest upon the correct usage of these two different groups, so having a good intuition surrounding them both is paramount.

Simply put, a population is defined as follows:

Population: A population includes all of the elements that make up a specified group.

An example of a population in the context of US voting would be all of the citizens who will be voting in the United States. If you are familiar with a set theory, you can think of this as the universe.

A sample, can then be defined as:

Sample: A subgroup of the population, which can be used to describe the entire population.

In this case, if we selected 100 voters randomly from each state (5,000 voters in total), that could constitute as our sample. Continuing our set theory analogy, we can think of this as a subset.

1.1 Why Sample

In general, the main reasons that we would sample from a population and work with that subgroup are as follows:

  • Impossible to get data on the entire population
  • Computationaly infeasible to run analysis on the entire population

Now, the main benefit of sampling and being able to work with that subgroup is that we can then make inferences about the entire population. By gathering a sample we can use techniques and metrics described in the coming sections in order to make inferences and generate opinions about the entire population.

1.2 Sampling Visualization

To make this all a bit more concrete, let's say that we have the following data set: 1000 people whom we know their height and income. This can be visualized a 2-d plot below:

In [1]:
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
from matplotlib import rc, animation
import numpy as np
import pandas as pd
from IPython.core.display import HTML

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

fig, ax = plt.subplots(figsize=(8,5))

rng = np.random.RandomState(0)
x = rng.randn(1000)/3 + 5.5
y = rng.randn(1000)/0.1 + 80
# colors = rng.rand(100)
colors = x*y

plt.scatter(x, y, c=colors, alpha=0.3,
            cmap='viridis')
ax.set_xlabel('Height', fontsize=20)
ax.set_ylabel('Income', fontsize=20)
plt.show()

Now, let's say that we want to sample from this population (for one of the reasons listed above). This would look like:

In [3]:
fig, ax = plt.subplots(figsize=(8,5))

ax.scatter(x, y, c=colors, alpha=0.3,
            cmap='viridis')

sample_idxs = np.random.randint(0, 1000, size=100)
x_sample = x[sample_idxs]
y_sample = y[sample_idxs]
sample_colors = x_sample*y_sample
ax.scatter(x_sample, y_sample, c='red', alpha=0.7)

ax.set_xlabel('Height', fontsize=20)
ax.set_ylabel('Income', fontsize=20)

plt.show()

And if we no longer overlay the original population in the background, we can look at only our freshly sampled data set:

In [4]:
fig, ax = plt.subplots(figsize=(8,5))
ax.scatter(x_sample, y_sample, c='red', alpha=0.7)
ax.set_xlabel('Height', fontsize=20)
ax.set_ylabel('Income', fontsize=20)
plt.show()

When describing populations and samples, there is additional vocabulary that we must be aware of; specifically:

  • A statistic is used to describe a sample.
  • A parameter is used to describe a population.

Now, we will pause our discussion surrounding sampling here for now, and move on to build up our vocabulary in several other areas of statistics before coming back. Once we have a deeper understanding of a few other areas, it will strengthen the need for sampling and it's connection to the population.

2. Descriptive Statistics

With a good understanding of the differences between a population and a sample, we can now move onto descriptive statistics. Simply put:

Descriptive statistics entail the numbers and calculations that we use to summarize raw data.

This is best illustrated by an example. Imagine that we are political entity that is trying to make decisions concerning public spending. To do this, we deem it important to have a good handle on the income of the population that we are overseeing. There are many different ways that we could go about doing this. Let's say that we are over seeing New York City, to make this more concrete. As of 2019 the population of NYC is listed at 8.6 million people. So, imagine for a second that we are fortunate enough to have income data on every single person (an unrealistic assumption, but one we will hold for now). If we want to understand the income of our population, we could go through each individual row, from the first all the way up the 8.6 millionth, observing the income of each individual person.

Clearly, I don't think anyone in their right mind would find the above a realistic or well thought out approach! However, it is important to keep in mind that that is our entire goal when dealing with descriptive statistics; the data above can be thought of as the ground truth and we are trying to find simpler ways of representing it so that we as humans can make sense of it. The key here is that there are different ways to represent it, and depending on which we chose we can end up with very different inferences. A few common describe raw data, that nearly everyone has heard of at some point, are through the use of the mean, median, and mode. Each of these has its own spin on how to determine the "middle" of a distribution.

2.1 Mean

Let's reduce our example from the above to something a bit easier to hold in our brains. Say that our population is now a group of 100 people sitting in a bar, and we are trying to make sense of their income to determine if public spending needs to be increased. To make this example easy, assume that they all make 40,000 dollars. In this case we can calculate what is known as the mean, or average, to try and summarize all of the raw incomes:

$$\frac{1}{N} \sum_{i=1}^N x_i$$

Which in this case would just be:

$$\frac{40000 + 40000 + ... + 40000}{100} = 40000$$

In this case the mean provided a great summary of our data set! We originally were dealing with 100 numbers (individual incomes) and we reduced that to a single descriptor, the mean of all of the incomes. Now, of course this was a trivial case since all of the incomes were identical, but it would apply equally well if the incomes were with a range of ~5,000 dollars.

Now, consider the following: imagine that Bill Gates then walks into the bar and is included in our analysis. I think we can still agree that the general population makes approximately 40,000 dollars. Intuitively, that is the summary of the population that we want to arrive at. However, how will having a billionare included in this calculation change things? Well, we can determine the new mean to be:

$$\frac{40000 + 40000 + ... + 40000 + 1,000,000,000}{101} = 10,039,600$$

And just like that we can see the potential draw back and ease of abuse of descriptive statistics! By introducing a simplification we risk misleading ourselves or others about the actual underlying data. Here, we had one outlier, bill gates (an outlier is an observation point that is distant from other observations), and that caused our mean income to explode. Based on this value, if you didn't have any access to the underlying data or knowledge of the population you may think that everyone in the bar makes approximately 10 million dollars.

This brings up two important points, one related to the mean, and one related to statistics in general. As for the mean, we have just shown that it is very sensitive to outliers. This can be a drawback depending on the situation (and can be remedied by using the median, defined the in the next section). Concerning statistics in general, we have seen the potential errors that can occur if make incorrect assumptions (assuming no outliers are present and using the mean), or if we use incorrect statistics and process's all together.

Statistics are incredibly powerful. In today's world, with incredibly large amounts of data being generated every second, we need ways to reduce that data into bits of knowledge that we was humans can comfortably work with. However, that reduction process requires great care and understanding, or else we can get into rocky water quickly.

2.2 Median

Now, we just saw the short comings of the mean in the bar scenario above. There is a way to remedy this situation, however! What if we were to find the point that divides our distribution in half, so that half of the people lie on one side, and half on the other? In this case, we would order our 101 people in order of ascending income, which would look like:

Person 1 Person 2 Person 3 ... Person 50 Median Person 51 ... Person 100 Bill Gates
40,000 40,000 40,000 ... 40,000 40,000 40,000 ... 40,000 1,000,000,000

Ahah! By splitting the data set down the middle (after it was sorted correctly in ascending order) we were able to determine the median, which in this case gives the exact description of the underlying phenomenon that we were looking for. Even with our outlier, Bill Gates, included we still summarize the room correctly. Even if Warren Buffet came in and sat down next to Bill, out median would prove resilient and not change!

Note, that in some cases, we want to determine the effect of outliers. If several more billionares entered the room, you can imagine that eventually we would want to have their presence be reflected in the descriptive statistic we use to summarize the room. In this case the mean would be useful to include, if only to contrast with the median. By knowing the mean and median differ greatly we gain insight into the potential of outliers, or that we are just dealing with a wide spread (the concept of spread will be explored further in later sections).

2.3 Mode

Finally, we have the statistic known as the mode. The mode is the value in a data set that occurs most often. In other words, it is the value that is most likely to be sampled. In the bar example, the mode would be 40,000 dollars as well.

A visual comparison of the above in the setting of a continuous distribution can be seen below.

2.4 Standard Deviation

Now, so far we have been discussing statistics that entirely relate to the the middle of a data set. However, there is another statistics that can help us describe what may otherwise just be a jumble of numbers: the standard deviation. Before digging into what the standard deviation is, let's see what it necessary.

Consider the following two groups:

  1. 100 runners participating the in the NYC marathon.
  2. 100 people on the next incoming flight to JFK.

We will assume that the mean weight for each group is 150 pounds. Now, if you have ever been in or to a marathon, you may know that most athletes have relatively similar builds and weights. However, the standard plane may consist of babies weighing 10 lbs, and football players weighing over 300 lbs. We intuitvely know that the the group of 100 on the plane is more spread out, visualized like so:

Above, we see that the marathon runners are tightly bunched around their mean, while the airplane passengers are far more dispered. If all I told you was that we were dealing with two groups that each had a mean weight of 150 lbs, you may think that they were very similar. That is where the standard deviation comes in!

The standard deviation is a statistic whose purpose is to provide transparency into the dispersion of a data set.

In the case of the marathon runners group, there would be a low standard deviation, while the airplane passengers would have a high standard deviation. That is the beauty of the standard deviation: it allows us to assign a single number to this dispersion around the mean!

As a small note pertaining to the above visualization, if you have ever seen a histogram and wondered exactly why it was necessary, now you know! When dealing with counts among a single dimension (in this case counts of people along the dimension weight) you quickly run into an issue of many overlapping data points. By allowing the count to become a second dimension (our y axis) we provide a much cleaner visualization.

Now, we can finally go over the exact formula for standard deviation, but I must stress that it is of far less importance than the intuition behind it! With that said, in order to define standard deviation, it helps to define a closely related statistic: variance. Variance is closely related to standard deviation, with it's exact definition being:

Variance: How far are a set of random numbers spread out from their average value.

We can define variance as:

$$Var(X) = \sigma^2 = \frac{1}{n} \sum_{i-1}^n \big(x_i - \mu\big)^2$$

Where $\mu$ is the mean of $X$. Now, the standard deviation is just the square root of the variance:

$$Std(X) = \sigma = \sqrt{Var(X)} = \sqrt{\frac{1}{n} \sum_{i-1}^n \big(x_i - \mu\big)^2}$$

3. Random Variables

Now, in statistics (as was the case with probability) random variables play a crucial role. Random variables, also known as stochastic variables can be defined as:

Random Variables: a variable whose possible values are outcomes of a random phenomenom.

More concretely, a random variable is a defined as a function that maps the outcomes of an unpredictable process to numerical quantities, typically real numbers. It is a variable (specifically a dependent variable), in the sense that it depends on the outcome of an underlying process providing the input to this function, and it is random in the sense that the underlying process is assumed to be random.

A random variable has a probability distribution, which specifies the probability of its values. Random variables can be discrete or continuous.

For those who would like a mathematical representation (and I do recommend gaining comfort with these representations over time) we can define a function as:

$$f: X \rightarrow Y$$

And subsequently define a random variable as:

$$X : \Omega \rightarrow E$$

Where $\Omega$ represents the set of potential outcomes and $E$ represents the measureable space. Generally, $E$ is real valued, $\Re$. An example could be a die roll, where our potential outcomes are the numbers 1-6, and the measureable space is the set of real numbers from 0-1. This can be shown as:

$$\{\omega : X(\omega) = outcome \} = P(X=outcome)$$$$X(1) = \frac{1}{6}$$$$X(2) = \frac{1}{6}$$$$X(3) = \frac{1}{6}$$$$X(4) = \frac{1}{6}$$$$X(5) = \frac{1}{6}$$$$X(6) = \frac{1}{6}$$

Where all of the above can also be written with the following (more familiar) notation:

$$P(X = 1) = \frac{1}{6} $$$$P(X = 2) = \frac{1}{6} $$$$P(X = 3) = \frac{1}{6} $$$$P(X = 4) = \frac{1}{6} $$$$P(X = 5) = \frac{1}{6} $$$$P(X = 6) = \frac{1}{6} $$

3.1 Expected Value and the Law of Large Numbers

Now, with the basics of what a random variable represents (remember, a function), we can now look at some concepts that build upon this knowledge. Some of the things we are about to talk about are highly related to probability theory; I again will just make a point to state that statistics and probability have many common threads and overlaps, so we must wade into these non deterministic waters for now.

The first extension of a random variable is it's expected value. Intuitively, this can be thought of as the long-run average of repetitions of the same experiment it represents. In other words, whatever random process the random variable is meant to represent (rolling a six sided die, flipping a coin, playing the lottery, drawing from a deck of cards), if we repeated that process many times we would slowly converge on an average or mean value.

The expected value is, simply put, the sum of all possible outcomes, each weighted by it's respected probability of occuring.

Mathematically, the expected value can be represented as:

$$E[X] = \sum_{i=1}^k x_i p_i = x_1p_1 + x_2p_2 + ... + x_kp_k$$

Before digging into the details, let's look at 2 examples, one that is particulary relevant for anyone with a gambling problem.

3.1.1 Expected Value of a Die

The expected value of a die is relatively easy to calculate. Based on our equation presented earlier, we can show that if $X$ is a random variable representing the process of rolling a die:

$$E[X] = 1\cdot\frac{1}{6} + 2\cdot\frac{1}{6} + 3\cdot\frac{1}{6} + 4\cdot\frac{1}{6} + 5\cdot\frac{1}{6} + 6\cdot\frac{1}{6} = 3.5$$

Above, we took each outcome (number on the die after being rolled) and multiplied it by it's chance of occuring (one sixth in the case of a standard die). This results in an expected value of 3.5. Because all probabilties add up to 1 I should note that this a weighted average. Now, you may be wondering/looking for proof that this happens in practice. The following simulation should demonstrate that it indeed does:

In [5]:
sample_from_outcomes = np.random.choice
potential_outcomes = [1,2,3,4,5,6]

die_outcomes = sample_from_outcomes(potential_outcomes, 10000)
print('Average after 10,000 simulated die rolls: ', die_outcomes.mean())
Average after 10,000 simulated die rolls:  3.4669
In [6]:
moving_average = []
for i, outcome in enumerate(die_outcomes):
    current_average = die_outcomes[:i+1].mean()
    moving_average.append(current_average)
    
moving_average_array = np.array(moving_average)

import seaborn as sns
sns.set(style="white", palette="husl")
sns.set_context("talk")
sns.set_style("ticks")

fig, ax = plt.subplots(figsize=(10,6))

ax.axhline(y=3.5, color='b')
x = np.linspace(0,10000,1)
plt.plot(moving_average_array)
ax.set_xlabel('Number of trials', fontsize=20)
ax.set_ylabel('Average', fontsize=20, )
plt.title('Average dice roll vs. Number of trials', fontsize=20)
ax.legend(['Theoretical Expected Value', 'Observed Averages'])

plt.ylim(1,6)
plt.xlim(-100,10000)
plt.show()

Above I simulated the rolling of a dice 10,000 times, and after each role calculated the average. Clearly, we do converge nicely to the expected value of 3.5!

3.1.2 Expected Value of a Lottery Ticket

One fantastic use case for the expected value is in the case of financial decision making. For instance, let's say that you think playing the lottery is a worthwhile investment based on the chances of winning it big. Well, the expected value can help guide your decision from a numerical view point.

We can model the process of purchasing a lottery ticket and playing (scratching the ticket to see what you won) as a random variable. A random variable, as we know, has an expected value. In this case, that is just the sum of all different outcomes, each weighted by it's probability. Just as in the dice example, the expected value will represent the average outcome that you will converge to in the long run, if you perform the process behind the random variable many times.

So, let's say that the lottery ticket defines the outcomes and probabilities associated as follows:

Outcome (winnings) Probability
\$2 $\frac{1}{15}$
\$4 $\frac{1}{42.86}$
\$5 $\frac{1}{75}$
\$10 $\frac{1}{200}$
\$25 $\frac{1}{300}$
\$50 $\frac{1}{1589.40}$
\$100 $\frac{1}{8000}$
\$200 $\frac{1}{16000}$
\$500 $\frac{1}{48000}$
\$1000 $\frac{1}{40000}$

And that the lottery ticket costs \$1. Now, we can calculate the expected value of the ticket based on the summation of each outcome times it's respective probability:

$$E[X] = 2 \cdot \frac{1}{15} + 4 \cdot \frac{1}{42.86} + 5 \cdot \frac{1}{75} + 10 \cdot \frac{1}{200} + 25 \cdot \frac{1}{300} + 50 \cdot \frac{1}{1589.40} + 100 \cdot \frac{1}{8000} + 200 \cdot \frac{1}{16000} + 500 \cdot \frac{1}{48000} +1000 \cdot \frac{1}{4000} = 0.51 $$

We arrive at an expected value of \$0.51. What can we make of this result? Well, we are essentially being given concrete financial advice! In the long run, if buy many lottery tickets, we can expect to win \$0.51 on average. However, the cost of this ticket is \$1 every time we play. This means that in the long run we can expect to lose money by playing the lottery! We can visualize this just as we visualized the rolling of a die:

In [7]:
sample_from_outcomes = np.random.choice
potential_outcomes = [0, 2, 4, 5, 10, 25, 50, 100, 200, 500, 1000]
outcome_probabilities = [0.8874723872122715, (1/15), (1/42.86), (1/75), (1/200), (1/300), (1/1589.40), (1/8000), 
                        (1/16000), (1/48000), (1/40000)]

outcomes = sample_from_outcomes(potential_outcomes, size=100000, p=outcome_probabilities)
print('Average after 100,000 simulated die rolls: ', outcomes.mean())
Average after 100,000 simulated die rolls:  0.53735
In [8]:
moving_average = []
for i, outcome in enumerate(outcomes):
    current_average = outcomes[:i+1].mean()
    moving_average.append(current_average)
    
moving_average_array = np.array(moving_average)

fig, ax = plt.subplots(figsize=(10,6))

ax.axhline(y=1, color='b')
ax.axhline(y=0.51, color='g')
x = np.linspace(0,100000,1)
plt.plot(moving_average_array)
ax.set_xlabel('Number of trials', fontsize=20)
ax.set_ylabel('Average Amount Won', fontsize=20, )
plt.title('Average Amount Won vs. Number of trials', fontsize=20)
ax.legend(['Cost of Lottery Ticket','Theoretical Expected Value', 'Observed average winnings'])

plt.show()

The visual above is very enlightening! It can clearly be seen that the expected value of the ticket (a random variable) converges to ~\$0.51 over time, while the cost of the ticket is \$1. If you don't want to lose money on the long, avoid the lottery.

3.1.3 Law of Large Numbers

So far I have been frequently referring the concept of performing an experiment many times and that a random variable will converge to it's expected value in the long run. This is formalized via the Law of Large Numbers. The law of large numbers states:

The results obtained from a large number of trials concerning a random variable should be close to the expected value, and will tend to become closer as more trials are performed.

The law of large numbers if important because it guarantees stable long term results for the averages of some random events. For example, while a casino (or the lottery) may lose money in a single spin or ticket, the earnings will tend towards a predictable percentage over a large number of spins.

This same effect is used extensively in the insurance industry, only here it is referred to as expected loss. The idea is the same though! To determine the expected loss of insuring a 19 year old male driver, they calculate (from historical data and mathematical models) the probability he will get into an accident, and then place a value on what it will take to fix the car in that place. Let's say (to keep things simple), they determine he has a 10% chance of getting into an accident this year, and that the accident will cost \$3,000 to fix. The expected loss in that case is:

$$3000 * 0.10 = 300$$

So, the insurance company will then set a monthly rate that guarantees that they charge more than that expected value on the whole of the year. Of course this is incredibly dumbed down, but was just meant to show how wide spread the expected value (and loss) really is! For a full proof of the Law of Large Numbers, please take a look at the appendix of this post.

3.2 Variance

We now should have an intuitive grasp on what the expected value is, which allows us to dig into the details a bit further when it comes to variance and standard deviation. To begin, we can define variance with the new vocabulary we have learned concerning the expected value:

Variance: Variance is the expectation of the squared deviation of a random variable from its mean.

Now the above brings in a very interesting idea: that is the expectation of an expression, not simply a random variable:

$$Var(X) = E \big[(X - \mu)^2 \big]$$

Above we see that we are trying to find the expected value of the expression $(X - \mu)^2$. In other words, if we repeatedly evaluated that expression, what would it's average/mean be in the long run. Now, before we get into the derivation, I just want to make sure the following is clear: $X$ is our random variable, and $\mu$ is it's mean value. So, to build on what we intuitively mentioned earlier, the variance is representing the spread of our data. If the mean of $X$ is 150, but the data is very spread out (airline passengers) then we will end up with a larger variance. This intuively made sense before, and it still does with our updated definition.

With that said, let's start dissecting the above formula. First and foremost, it should be immediately clear that $\mu$, the mean, is simply the expected value of $X$! So we can replace it in the equation to get:

$$Var(X) = E \big[(X - E[X])^2 \big]$$

We can then expand the equation as follows:

$$Var(X) = E \big[X^2 - 2XE[X] + E[X]^2 \big]$$

Now with the help of two basic properties of the expected value, we can reduce this even further. The properties are:

  1. The expected value of a constant is simply the constant. For example, if $X = c$ then $E[X] = c$. Keep in mind that in our general case, $E[X]$ is equal to the mean of $X$, $\mu_X$. In other words, $E[X]$ is itself a constant. So, based on previous argument, if we take the expectation of an expectation, we simply get the inner expectation: $E\big[E[X]\big] = E[X]$.
  2. The expected value demonstrates linearity. This can be shown via:
    $$E[X+Y] = E[X] + E[Y]$$
    and: $$E[aX] = aE[X]$$

With this two properties in our toolbelt, we can further reduce our equation to:

$$Var(X) = E[X^2] - 2E[X]E[X] + E[X]^2$$

Which can be reduce to the final desired form:

$$Var(X) = E[X^2] - E[X]^2$$

In other words, we see that the variance of $X$ is equivalent to the mean of the square of $X$ minus the square of the mean of $X$. Now, if we think back to our earlier definition of variance, we recall it was:

$$Var(X) = \frac{1}{n} \sum_{i-1}^n \big(x_i - \mu\big)^2$$

Can we get our updated version that is in terms of expected values to reduce to this version? Yes! The derivation is as follows:

$$Var(X) = E \big[(X - \mu)^2 \big]$$$$Var(X) = \frac{1}{n} \sum_i^n (x_i - \mu)^2$$

These two viewpoints of variance will come up frequently, so it is good to be familiar with both!

4. Distributions

Now we are read to move on to distributions. A good additional resource to utilize while trying to internalize distributions would be my introduction to probability post, which touches on a distributions from a more probabilistic point of view.

To start, as always, we are going to go through an example. Imagine that we are walking through Times Square in NYC and we want to get an idea of how tall people are. Let's say we collect 1000 measurements in total, the first 10 of which are shown below:

In [12]:
mu, sigma = 5.83, 0.4
normal_dist = np.random.normal(mu, sigma, 1000)
heights = np.around(normal_dist, decimals=2)
indices = np.arange(1,1000)
df_heights = pd.DataFrame({
     'Height (ft)':heights   
})
df_heights.index.rename('Person', inplace=True)
In [13]:
display(df_heights.head(10))
Height (ft)
Person
0 5.06
1 6.33
2 5.46
3 5.54
4 6.40
5 6.59
6 5.63
7 6.00
8 5.55
9 5.89

Now, up until this point we have been using statistics to make a large amount of raw data more digestable. In other words, and this is such a crucial component of statistics I must make it clear, we have been intentionally reducing the amount of information we are dealing with in some way in order to make sense of it! Instead of looking at 1000 individual heights, we perform aggregations and reductions (such as finding the mean of all of the heights, or the standard deviation to see how far spread out they are). We can gather those statistics for our data set easily:

In [14]:
print('Mean height: ', np.around(heights.mean(), decimals=2), 'ft')
print('Median height: ', np.around(np.median(heights), decimals=2), 'ft')
print('Standard deviation in height: ', np.around(heights.std(), decimals=2), 'ft')
Mean height:  5.84 ft
Median height:  5.83 ft
Standard deviation in height:  0.41 ft

However, there are additional methods that can be used in order to make sense of all our raw data! On of the most common is what is known as a distribution. What exactly is a distribution? Well, let's start by thinking of different ways to categorize/reduce the data that we are dealing with. Maybe we think that: "Instead of having 1000 measurements, let's just categorize every person as above or below a certain height. Then, we would end up with only 3 numbers: the threshold height (say our mean height, 5.86 ft), the number of people above the height, and the number below". Visually, this would look like:

In [15]:
def normal_dist_experiment(num_bins, y_label, title, normed=False, plot_continuous_dist=False):
    fig = plt.figure(figsize=(8,5))

    count, bins, ignored = plt.hist(
        heights,
        num_bins,
        density=normed,
        color='forestgreen',
        alpha=0.5,
        linewidth=1,
        edgecolor='black'
    )
    if plot_continuous_dist:
        plt.plot(
            bins,
            1/(sigma * np.sqrt(2 * np.pi)) *np.exp( - (bins - mu)**2 / (2 * sigma**2) ),
            linewidth=2,
            color='r'
        )
    plt.xlabel(f'Height (ft)')
    plt.ylabel(y_label)
    plt.title(title)
    plt.show()
    
normal_dist_experiment(2, 'Number of People in group', f'Number of people in group with {2} bins')

Excellent, so we can see that from our height range of ~4.6 to 5.86 ft we have ~440 people, and in the range 5.86 to 7.2 ft we have ~560 people. The total number of people is of course 1000. In other words, we have found a reduced set of numbers to describe our data, along with a visualization.

Now, we can also use this same visualization with a different set of units on the y axis. We can update the y axis to reflect the probability that a person will have a height of less than or greater than 5.86. This would look like:

In [16]:
normal_dist_experiment(2, 'Probability of being in group', f'Probability group with {2} bins', normed=True)

What we see here is that out visual looks nearly identical, expect for the y axis. We now have probabilities on the y axis, representing the probability that a random person will be above or below a height of 5.86 ft. Note, that the summed area under each of the two bins must equal 1, since we need to ensure that this is a valid probability distribution.

Now, I will admit that in this case 2 groups (bins) isn't the most valuable way to look at our data. What if we increased the number of bins that we are grouping our heights into? Let's try using 5 groups now:

In [17]:
normal_dist_experiment(5, 'Number of People in group', f'Number of people in group with {5} bins')
In [18]:
normal_dist_experiment(5, 'Probability of being in group', f'Probability group with {5} bins', normed=True)

Interesting, it appears as though most of our data is centered around the mean (5.86), with heights become less likely the further we are from the mean. What if we increased the number of bins to 10?

In [19]:
normal_dist_experiment(10, 'Number of People in group', f'Number of people in group with {10} bins')
In [20]:
normal_dist_experiment(10, 'Probability of being in group', f'Probability group with {10} bins', normed=True)

Hmm, a distinct shape seems to be appearing. What about 25 bins?

In [21]:
normal_dist_experiment(25, 'Number of People in group', f'Number of people in group with {25} bins')
In [22]:
normal_dist_experiment(25, 'Probability of being in group', f'Probability group with {25} bins', normed=True)

What this visualization is doing is giving us a sense of how likely it is that we are going to measure someone really tall, really short, or close the average height. In other words, just as our single number statistics, such as a the mean, gave us a way of reducing our raw data into a more managable form, this visualization is doing the same thing. The visualization is known as a histogram.

Now, as we make the bins smaller and smaller we are able to get a more accurate and precise estimate of how heights are distributed. This brings us to our informal defintion of a distribution:

A distribution allows us to see how a population is distributed.

In the case of height, we want to get a feel for how the heights are distributed. Are they evenly spread out? This would mean it is equally likely to be 5.86 ft tall as it is to be 7.2 ft tall. Clearly this is not the case. Rather in our case it appears that the heights are distributed in a shape resembling a bell, with the a height being at the being the most likely.

What if we keep increasing then number of bins? Eventually, you can imagine increasing then number of bins to $\infty$, at which point our plot would switch from being a set of discrete bins, to a continuous curve:

In [23]:
normal_dist_experiment(
    100,
    'Probability of being in group',
    f'Probability group with {100} bins',
    normed=True,
    plot_continuous_dist=True
)

This curve, shown in red, tells us the exact same thing that the histogram tells us! That is that there is a low probability that we will measure someone short than 5ft and taller than 6.5 ft tall, while there is a high probability that we will measure someone in the range of 5.7 to 6.2 ft tall.

Both this curve and the histogram are distributions. They show us how the probabilities of measurements are distributed. We find this distribution of probabilities based on counting/grouping of measurements!

So far we have been talking about how heights are distributed, but in reality there are tons of different distributions with tons of different shapes! In general, there is the concept of an underlying function that generates our data. This is slightly abstract, but the idea is that any data set was created/generated by an underlying process. This process can be based on the laws of nature, evolution, financial markets, etc. We can often model this process mathematically. That means that we can say: "Ahah, heights are generated by an underlying function that results in a bell shaped output."

Why is this important? Well, if we know the underlying data generating function, we can get by with fewer measurements and extrapolate based on the ground truth/underlying function. An example may make this more concrete. Sometimes, we may have a large amount of raw data, and that alone can be used to estimate and visualize our distribution:

Now, let's say for instance that we know that height generation has an underlying function/process that results in bell shaped heights. Also, instead of having 1000 measurements, we only managed to get 35. Well, we can take the mean and standard deviation of those 35 measurements, and plug them into our generative function/model, and estimate a reliable final distribution:

I bring this up for two reasons:

  1. It occurs incredibly often in statistical learning. We frequently will pick a certain model, and it should be clear that it a human made decision to pick that model. There are often inherent assumptions that are being made when picking that model, and they must be considered and thought through.
  2. If we know the underlying data generation process we significantly make our lives easier. We often do not have extensive or complete data, and being able to combine this lacking data with the generative process/model allows us to get incredibly far with out having to gather a ton more data.

4.1 Gaussian/Normal Distribution

We have spoken a great deal about distributions at this point, and in particular I made reference to one that has a bell shape. This distribution has a specific name, that is the Normal Distribution or Gaussian Distribution. It is a continuous probability distribution and represents many random variables that are seen in the social and natural sciences (such as height in our previous example). There are two reasons for the Gaussian distribution's great importance:

  1. Because many random variables (remember, random variables are functions) are functions that result in normally distributed data, this distribution can be used very frequently in the natural and social sciences.
  2. It is used heavily in the central limit theorem, which will discuss in the next section.

The normal distribution is wonderful in that it can be parameterized by only two values, the mean and standard deviation of the data. Keep in mind that this was determined by experimentation that took place during the past 300 years. This experimentation resulted in the normal distribution equation:

$$f(x \mid \mu, \sigma^2) = \frac{1}{\sqrt{2 \pi \sigma^2}} exp(-\frac{(x-\mu)^2}{2\sigma^2})$$

Where $\mu$ is the mean/expectation of the distribution, $\sigma$ is the standard deviation, and $\sigma^2$ is the variance. Remember, this above formula was derived from experimentation over time, and through the use of a lot of data. For those familiar with phyiscs, this is same type of process that was used to determine Newton's law of Universal Gravitation:

$$F = G \frac{m_1 m_2}{r^2}$$

Where $G$ represents the gravitational constant, $6.674*10^{-11} \frac{m^3}{kg \cdot s^2}$. This constant was determined via experiment, just as the leading term $\frac{1}{\sqrt{2 \pi \sigma^2}}$ was in the normal distribution equation. This is incredibly important to keep in mind, especially if you are new to the field of mathematical modeling.

Below, we can see several different curves that we generated based on different values of $\mu$ and $\sigma^2$:

In [24]:
from scipy.stats import norm
fig = plt.figure(figsize=(8,5))
means = [0, 0, 0, -2]
variances = [0.3, 1, 3, 0.5]
x_axis = np.arange(-5, 5, 0.001)
legend = []
for mu, var in zip(means, variances):
    plt.plot(x_axis, norm.pdf(x_axis,mu,var))
    legend.append(f'$\mu$ = {mu}, $\sigma^2$ = {var}')
plt.xlabel('X')
plt.ylabel('Probability Density')
plt.title('Normal/Gaussian Distribution')
plt.legend(legend)
plt.show()

What this graph actually represents is how likely it is to observed $x$ taking on specific value, based on the corresponding parameterized normal distribution that represents the world $x$ is occuring in.

5. Central Limit Theorem

I mentioned above that one of the main uses for the gaussian distribution was in the central limit theorem. The central limit theorem is one of the most important ideas in statistics, and truly in all of mathematics.

The Central Limit Theorem is what allows us to go from working with a sample to making inferences about the underlying population.

To use a concrete example, think back to our dice experiments. We simulated rolling a dice 1000 times, and saw that our expected value was indeed 3.5 We can think of our 1000 die rolls as our population. We know that the expected value (mean) of our population is 3.5. However, what if we don't have data pertaining to 1000 die rolls? What do we do in that case? Enter the Central Limit Theorem. The central limit theorem works as follows:

  1. We chose a number $n$, which is much less than 1000. Let's say we chose 10. This will represent our sample size.
  2. We roll the die 10 times, getting 10 different outcomes. We take the average of those outcomes. Let's say this average is 3.2. We will refer to this average as $\mu_{x1}$.
  3. We then repeat this process, $k$ times. Let's say we repeat it 50 times. This means that we will end up with 50 means, $\mu_{x1}...\mu_{x50}$.
  4. The central limit theorem states that the distribution of these means is going to be normally distributed. The mean of this normal distribution will converge to the theoretical expected value of the population as the sample size increases (currently the sample size is 10, we could increase it to 20).

Keep in mind that the rolling of die is not a normally distributed process. It follows a uniform distribution, where each outcome is equally likely. Hence, the fact that the distribution of the sample means ends up being uniform proves that the population distribution does not need to be normal. This make this process that much more powerful.

From a more technical standpoint, we can say that the sampling distribution of the sampling means approaches a normal distribution as the sample size gets largerno matter what the shape of the population distribution. This fact holds especially true for sample sizes over 30. All this is saying is that as you take more samples, especially large ones, your graph of the sample means will look more like a normal distribution.

Many real-world observations can be approximated by, and tested against, the same expected pattern: the normal distribution. In this familiar symmetric bell-shaped pattern, most observations are close to average, and there are fewer observations further from the average. The size of flowers, the physiological response to a drug, the breaking force in a batch of steel cables — these and other observations often fit a normal distribution.

There are, however, many important things we would like to measure and test that do not follow a normal distribution. Household income doesn’t – high values are much further from the average than low values are.

But even when raw data does not fit a normal distribution, there is often a normal distribution lurking within it. This makes it possible to still use the normal distribution to test ideas about non-normal data.

Let's take a look at our die example in the following set of simulations. We first have a sample size of 5 die rolls, and for each sample of 5 die rolls we take the average value. This process commences for 1000 iterations, and after each set of 5 rols we generate a new histogram with that average included:

In [86]:
n = 1000 # 1000 simulations (1000 sample means drawn)

def generate_sampled_means(sample_size):
    avg = []
    for i in range(1,n):
        a = np.random.randint(1,7,sample_size)
        avg.append(np.average(a))
    return avg
In [96]:
# Function that will plot the histogram, where current is the latest figure
def clt(current, sampled_means, sample_size, ylimit):
    plt.cla()

    if current == 1000: 
        a.event_source.stop()

    bins = np.arange(1,6,0.2)
    plt.hist(sampled_means[0:current], bins=bins, alpha=0.5, color='forestgreen', linewidth=1, edgecolor='black')

    plt.gca().set_title('Distribution of mean of die rolls')
    plt.gca().set_xlabel(f'Sample means, sample size: {sample_size} die rolls')
    plt.gca().set_ylabel('Frequency')
    plt.ylim(0,ylimit)
    plt.xlim(1,6)
    plt.axvline(x=3.5, ymax=275, color="r") 
    plt.legend(['Theoretical Expected Value'])
    plt.annotate('Die roll = {}'.format(current), [1.5,250])
In [97]:
def generate_clt_animation_html(sampled_means, sample_size):
    fig = plt.figure(figsize=(10,6))
    a = HTML(
        animation.FuncAnimation(
            fig=fig,
            func=clt,
            frames=np.arange(0,len(sampled_means)),
            fargs=(sampled_means, sample_size),
            interval=50,
            repeat_delay=2000
        ).to_html5_video()
    )
    plt.close()
    return a
    
def generate_clt_animation_gif(sampled_means, sample_size, ylimit, file_name):
        fig = plt.figure(figsize=(10,6))
        a = animation.FuncAnimation(
                fig=fig,
                func=clt,
                frames=np.arange(0,len(sampled_means)),
                fargs=(sampled_means, sample_size, ylimit),
                interval=50,
                repeat_delay=2000
            )
        a.save(file_name, writer='imagemagick')
        plt.close()
In [93]:
# Sample size of 5
sampled_means = generate_sampled_means(5)
# generate_clt_animation_html(sampled_means, '5')
# generate_clt_animation_gif(sampled_means, '5', 275, 'clt-sample-size-5.gif')
In [94]:
sampled_means = generate_sampled_means(10)
# generate_clt_animation_html(sampled_means, '10')
# generate_clt_animation_gif(sampled_means, '10', 275, 'clt-sample-size-10.gif')
In [98]:
sampled_means = generate_sampled_means(50)
# generate_clt_animation_html(sampled_means, '50')
# generate_clt_animation_gif(sampled_means, '50', 350, 'clt-sample-size-50.gif')

We can see that this does look like a normal distriubtion, one that has a larger variance and is subsequently much wider. The mean still is approximately our expected value of 3.5. If we increase our sample size to 10 die rolls, we can see that the variance decreases, and we are more confident that our expected value is 3.5. This idea of confidence will be discussed in the next section.

Finally, we can increase the sample size to 50 rolls. Again, we see a normal distribution, but now the variance is much smaller and we have a much more confidence in where the true population expected value will fall.

5.1 The Normal Distribution and Standard Error

With the Central Limit Theorem at our disposal, the Normal Distribution becomes much more powerful. By definition we know what proportion of observations will lie within 1 standard deviation above or below the mean (68%), what proportion will lie within 2 standard deviations above or below the mean (95%), and so on. Visually this can be seen below:

Before discussing exactly where those percentages come from, we will introduce an additional concept: the standard error:

The standard error measures the dispersion of the sample means.

In other words, the standard errors gives us a way express quantitatively how tightly we expect the sample means to cluster around the population mean. This can definitely generate some confusion, seeing at the difference between the standard deviation and standard error is subtle. So, to break it down:

  1. The standard deviation measures dispersion in the underlying population.
  2. The standard error measures the dispersion of the sample means.
  3. Hence, the standard error is the standard deviation of the sample means!

So, we could say that the experiment above with 5 die rolls had a large standard error, while the experiment with 50 die rolls had a small standard error. This patter will always hold: sample means will cluster more tightly around the population mean as the size of each sample gets larger. At the same time, sample means will cluster less tightly around the population mean when the underlying population is more spread out. Mathematically, we define the standard error as:

$$SE = \frac{s}{\sqrt{n}}$$

Where $s$ is the standard deviation of the population and $n$ is the size of the sample. At this point we finally are reaching the payoff for all of this work. Because sample means are distributed normally around the population means, we can harness the power of the normal curve. Based on the earlier visual, we expect that 68% of sample means will lie within one standard deviation of the population mean, 95% of sample means will lie within two standard deviations of the population mean, and 99.7% will lie within three standard deviations of the population mean. Note, you may ask were this percentages come from; well, we are able to calculate them based on the fact that the entire area under the curve must sum to 1. Making use of integrals from calculus we are able to find the area under the curve like so:

$$\int_{-\sigma}^{\sigma} f(x \mid \mu, \sigma^2)$$

This area is going to be ~0.68, which we can divide by 1 (the total area) in order to get the percentage of area under the curve from $[-\sigma, \sigma]$, resulting in 68%. This is an incredibly powerful feature that is the main reason the central limit theorem and normal distribution are so valuable. We are going to discuss in depth in the next section, however, here is an example to get us started. Say that we have two groups, $A$ and $B$, who are indentical in all ways expect that they are shown differently styled webpages when checking out on an ecommerce site. If we observe that 10% of group $A$ end up making a purchase, and 12% of group $B$ make a purchase, we can then use everything we have just talked about to see if the different landing pages shown to $A$ and $B$ were able to create a statistically significant difference in the purchase likelihood. This would allow for the site to find the optimal page configuration, assuming they want to maximize purchases.

At this risk of this post becoming far too long I am going to leave off here and pick up in my post Statistical Inference.

Appendix

A.1 The Weak Law of Large Numbers

This section will serve as a derivation for the Weak Law of Large Numbers, and is meant to provide a more concrete vantage point, in junction with the intuitive explanations from earlier.

To start, we will assume that we have a random variable, $X$, with a probability distribution that has a certain (finite) mean, $\mu$, and variance, $\sigma^2$. We then draw $n$ independent and identically distributed random variables from $X$:

$$\{ x_1, x_2,...,x_n\}$$

After drawing $n$ samples from $X$ we have a sample mean:

$$\text{sample mean} \longrightarrow M_n = \frac{x_1 + ... + x_n}{n}$$

Where the sample mean is a random variable because it is a function of random variables. This should be distinguished between the true mean, $\mu$, which is a number-it is not random:

$$\text{true mean} \longrightarrow E[X]$$

Now, the sample mean is the most natural way to estimate the true mean. The weak law of large numbers provides support for this notion. We can calculate the expectation of the sample mean:

$$E[M_n]$$

Note, there are two different types of averaging going on here:

  1. First, the sample mean averages over the values observed in one long experiment
  2. Secondly, the expectation averages over all possible outcomes of the experiment. It is a theoretical average.

Expectation of Sample Mean

We can find the expectation as follows:

$$E[M_n] = \frac{E[x_1 + x_2 + ... + x_n]}{n}$$

Then, using linearity we can apply the fact that $E[x_i] = \mu$:

$$E[M_n] = \frac{n \mu}{n} = \mu$$

Variance of Sample Mean

We can now calculate the variance of the sample mean, $var(M_n)$:

$$Var\Big(\frac{x_1 + ... + x_n}{n}\Big)$$

The variance of a random variable divided by a number is equivalent to the variance of the random variable, divided by the square of that number (seen here). In the nature of keeping this appendix self contained, we can quickly prove that now. We can let:

$$Y = aX$$$$E[X] = \mu$$$$ \nu = E[Y] = a \cdot E[X] = a \cdot \mu $$$$Var(Y) = E \big[(ax - a\mu)^2 \big] = E \big[ a^2 (x - \mu)^2 \big] = a^2 E \big[(x-\mu)^2\big] = a^2 Var(X)$$$$\text{Rule} \longrightarrow Var(aX) = a^2 Var(X)$$

And we have the property that we were looking for. So, picking up where we left off:

$$Var\Big(\frac{1}{n}\frac{x_1 + ... + x_n}{1}\Big) = \frac{Var(x_1 + ... + x_n)}{n^2}$$

Because the $x_i$'s are independent, the variance of $x_i$ is the sum of the variances. Hence, we end up with $n$ times the variance of each one of them:

$$Var(M_n) = \frac{n \sigma^2}{n^2} = \frac{\sigma^2}{n}$$

Apply Chebyshev Inequality

Now we are in a position to apply the Chebyshev Inequality. If you are unfamilar, I highly recommend checking out my post on probability inequalities. Chebyshev tells us that the distance of a random variable from it's mean being larger than a certain number, has a probability that is bounded above by the variance of the random variable of interest, divided by the square of the certain number:

$$P \big(\big| M_n - \mu \big|\geq \epsilon \big) \; \leq \; \frac{Var(M_n)}{\epsilon^2}$$

We have already calculated the variance, and so this quantity is:

$$P \big(\big| M_n - \mu \big|\geq \epsilon \big) \; \leq \; \frac{\sigma^2}{n\epsilon^2}$$

If we consider $\epsilon$ to be a fixed number and let $n \rightarrow \infty$, then we obtain a limiting value of 0:

$$P \big(\big| M_n - \mu \big|\geq \epsilon \big) \; \leq \; \frac{\sigma^2}{\infty \epsilon^2} = 0$$

This shows that the probability of falling far away from the mean diminishes to zero as we draw more samples.

Weak Law of Large Numbers

So, to summarize, the Weak Law of Large Numbers is defined as:

$$\text{Weak Law of Large Numbers} \longrightarrow \text{For } \epsilon > 0 \; , \; P\big( \big|M_n - \mu \big| \geq \epsilon \big) = P \Big( \Big| \frac{X_1 + ... + X_n}{n} - \mu \Big| \geq \epsilon \Big) \rightarrow 0 \; , \; \text{as } n \rightarrow \infty$$

I'd like to also offer an interpretation for the WLLN. We can think of having one long experiment, and during that experiment we draw many random variables from the same distribution. A way to think about those random variables is that each one is equal to the true mean plus some measurement noise:

$$X_i = \mu + W_i$$

Where $W_i$ has an expected value of 0, $E[W_i] = 0$, and is independent. So, we have a collection of noisy measurements and we take the average of them. The WLLN tells use that the sample mean of $M_n$ is unlikely to be far off from the true mean, $\mu$.

As a special case let us consider a probabilistic model in which we repeat, many times and independently, the same experiment. There is an event associated with that experiment that has a certain probability:

$$\text{Event: } A$$$$p = P(A)$$

And $X_i$ acts an indicator of the events occurrence:

$$ X_i = \begin{cases} 1, & \text{if A occurs} \\ 0, & \text{if A does not occur} \end{cases} $$

Where the expected value of $X_i$ is:

$$E[X_i] = p$$

In this particular case, the sample mean counts how many times $A$ occurred out of the $n$ experiments that we carried out. So, it is just the frequency with which event $A$ occurred. In other words, the sample mean $M_n$ is the empirical frequency of event $A$. The WLLN tells us that the empirical frequency will be close the the probability of the event $A$. In this way it reinforces or justifies the interpretation of probabilities as frequencies.


© 2018 Nathaniel Dake