3. Probability Inequalities¶

Probability inequality play a large role in determining an answer to the crucial question: Is learning feasible? Of course in this context I am referring to statistical/machine learning. You may think to yourself: "Of course it is possible! We constantly here about wonderful new algorithms and ML techniques created, computer vision systems, natural language understanding virtual assistants-many of which are discussed in depth in this blog!". In this you are most certainly correct.

However, what my question specifically is honing in on is:

Can we provide a theoretical back bone to ensure that the learning that we are doing will indeed generalize?

Answering that question is the purpose of this post. In order to get there we will actually move backwards. To start I will show the few equations that prove learning is indeed possible, and we will then go through several derivations in order to expose how we arrived at our answer.

1. Is Learning Feasible?¶

I want us to start by supposing the following: We are dealing with a sample from a population:

$$\mu = \text{population parameter}$$$$\nu = \text{sample parameter}$$$$N = \text{sample size}$$$$\epsilon = \text{very small value}$$

Now, in a very large sample (i.e. $N$ is large) we know that $\nu$ is probably close to $\mu$ (if this is unfamiliar I recommend looking at my statistics post on the central limit theorem). Another way of saying that these two values are close is by saying there are within $\epsilon$ of eachother:

$$\big| \nu - \mu \big| < \epsilon$$

Our goal though is to make this a bit more concrete; put another way, we want to be able to offer a guarantee about $\nu$. How probable is it that it is within $\epsilon$ of $\mu$? Well, we can prove that the probability that $\nu$ is not with $\epsilon$ of $\mu$ is:

$$P\big( \big| \nu - \mu \big| > \epsilon \big) \leq 2e^{-2\epsilon^2N}$$

Where the equation above is known as Hoeffding's Inequality. The statement is known as a P.A.C. statement-meaning that it is probably, approximately correct. In english, it says the following:

The probability that the difference between the sample and the population parameter is greater than $\epsilon$ is less than the exponential of $e^{-\epsilon^2N}$.

From a slightly more formal vantage point, it provides an upper bound on the probability that the sum of bounded independent random variables deviates from its expected value by more than a certain amount. This inequality is a member of the Law of Large Numbers.

1.1 Extending to Learning¶

With relative ease we can extend this to learning. In order to do so we will need a bit of new notation. We are going to introduce $h$, $\hat{E}(h)$ and $E(h)$, where they are defined as:

$$h = \text{Learned Hypothesis/Model}$$$$\hat{E}(h) = \text{Training set Error}$$$$E(h) = \text{Population Error}$$

For those familiar with machine learning this should be rather clear, but if you are not I will provide a bit of context. In a general learning scenario we acknowledge that there is a population data set that we do not have access to. It contains every single data point that exists. We only have access to a training data set, which is a very small sample from the population. If we learn a model, $h$, that makes predictions on input data points, then $\hat{E}(h)$ is the error that $h$ would make on the training data set, while $E(h)$ is the error that $h$ would make on the population data set.

What we want to ensure is that if $h$ is learned based off of our training data, can it generalize? In other words, we want to have some sort of guarantee on the difference between the error between $\hat{E}(h)$ and $E(h)$. We will refer to the probability of this difference as the bad event, since we do not want it to happen:

$$\overbrace{P\big( \big| \hat{E}(h) - E(h) \big| > \epsilon \big)}^\text{Bad event}$$

We want a bound on this bad event-a worst case guarantee. Once again we can use the hoeffding bound (since it deals with the sum of random variables differing from their expected value-here the error of each data point is a random variable and $E(h)$ is the expected value of the error):

$$P\big( \big| \hat{E}(h) - E(h) \big| > \epsilon \big) \leq 2e^{-2\epsilon^2N}$$

Where $\hat{E}(h)$ is the estimated sample error and $E(h)$ is the true error. Now, in all actuality we know that there are multiple hypothesis that we will be choosing from, so our equation can be expanded:

$$P\big( \big| \hat{E}(h) - E(h) \big| > \epsilon \big) \leq \sum_{m=1}^M P\big( \big| \hat{E}(h_m) - E(h_m) \big| > \epsilon \big) $$

Above we are simply taking the probability associated with each of $M$ hypotheses having an error greater than $\epsilon$ and summing them all up. This is known as the union bound and specifically states that:

For any finite or countable set of events, the probability that at least one of the events happens is no greater than the sum of the probabilities of the individual events.

So, essentially we are taking the worst case and adding up all of the probabilities associated with the error difference between the sample and true hypothesis being greater than $\epsilon$. The reason we want the worst case is because we don't want to jeopardize the applicability of our results. I should note that while this type of assumption is in fact simplistic, it is not trivially restricting but rather it is intended to ensure we are not missing some scenario that could be worse than what we are willing to accept.

We can then substitute the hoeffding bound that was shown earlier:

$$P\big( \big| \hat{E}(h) - E(h) \big| > \epsilon \big) \leq \sum_{m=1}^M 2e^{-2\epsilon^2N} $$

Which provides us with our final takeaway:

We can be confident that the probability that the difference between our training error and the population error is larger than our tolerance ($\epsilon$, the bad event), under the true learning scenario of generating $M$ hypothesis and picking the best one, is less than or equal to the summation on the right.

Note that the right hand side has an exponential in it which is good! That will decay towards zero quickly. However, we also have a $\epsilon^2$ term which will slow the decay to zero. And even more unfortunately, there is an added factor $M$:

$$P\big( \big| \hat{E}(h) - E(h) \big| > \epsilon \big) \leq 2Me^{-2\epsilon^2N} $$

We obviously want the probability of our bad event to be small, and hence we don't like having to magnify the right hand side, because that is the probability of something bad happening. Now, we can see that if we use $M=10$ hypothesis we are probably okay. On the other hand, if we use $M= one \; million$ hypothesis, we may run into trouble.

Now, we will get into what happens when we have many (eventually infinite) hypothesis in a post on the theory of generalization, but for now I want to move on and discuss the bound above.

2. Probability Bounds and Inequalities¶

How exactly did we arrive at that interesting looking exponential bound defined as the Hoeffding Inequality? More generally, what is the point of a bound? The best way to answer both of the above questions is via a derivation of the Hoeffding Inequality itself. Now, in order to get there we will need to build up our knowledge of several other inequalities, so let's get started.

2.1 Markov Inequality¶

It may not be entirely clear what exactly the point of a probability bound is, specifically in the context of learning above. In order to gain a full appreciation, consider the following:

We may be interested in saying something about the probability of an extreme event. Suppose that unfortunately we only know a little bit about the probability distribution at hand, in this case we only know the expected value. Can we still saying something about the probability of an extreme event occuring?

Above is precisely the goal of probabilistic inequalities. We want to be able to make mathematically backed claims about the probabilities of certain (often bad) events occurring. I like to think about inequalities allowing the following transition:

$$\text{Intuitive statement} \rightarrow \text{Precise, mathematically backed statement}$$

The best way to gain an intuitive understanding of this is via an example, and one that is crucial in understanding the hoeffding derivation, that is the Markov Inequality. The Markov Inequality, at its core, is trying to take an intuitive statement and provide it a mathematical backing. The intuitve statement is as follows:

If $X \geq 0$ and $E[X]$ is small, then $X$ is unlikely to be very large.

Intuitively that should make sense! If $E[X] = 1.2$ then the probability of $X = 10,000,000$ should be incredibly small. What the Markov Inequality allows us to do is make the intuitive statement much more precise. It states:

$$ \text{Markov Inequality} \longrightarrow \text{If} \; X \geq 0 \; \text{and} \; a >0, \text{then} \; P(x \geq a) \leq \frac{E[X]}{a}$$¶

Derivation 1¶

How exactly do we arrive at the above inequality? Well, our derivation looks as follows; first, we recall the expected value of $X$:

$$E[X] = \int_{0}^{\infty} x f_x(x) dx$$

Which, we can then state the following:

$$E[X] = \overbrace{\int_{0}^{\infty} x f_x(x) dx}^\text{Expected value of x} \;\; \geq \;\;\overbrace{\int_{a}^{\infty} x f_x(x) dx}^\text{Smaller bound, less area}$$

The above is true because when integrating from $[a, \infty]$ we are dealing with less total area. We can then focus on this new integral:

$$\int_{a}^{\infty} x f_x(x) dx$$

Which we can note that when evaluated $x$ will always be at least as large as $a$, given our bounds. This allows us to write:

$$\int_{a}^{\infty} x f_x(x) dx \;\; \geq \;\; \int_{a}^{\infty} a f_x(x) dx$$

And since $a$ is a constant we can pull that out:

$$\int_{a}^{\infty} x f_x(x) dx \;\; \geq \;\; a \overbrace{\int_{a}^{\infty} f_x(x) dx}^{P(x \geq a)}$$

Substituting $P(x \geq a)$ for our right integral:

$$\overbrace{\int_{a}^{\infty} x f_x(x) dx}^\text{Integral 1} \;\; \geq \;\; aP(x \geq a)$$

Recall the first line of our derivation:

$$E[X] = \int_{0}^{\infty} x f_x(x) dx \;\; \geq \;\; \overbrace{\int_{a}^{\infty} x f_x(x) dx}^\text{Integral 1}$$

We can substitute in $aP(x \geq a)$ for integral 1 (because of the matching inequalities):

$$E[X] \;\; \geq \;\; aP(x \geq a)$$

Which can be written equivalently as:

$$P(x \geq a) \;\; \leq \;\; \frac{E[X]}{a}$$

And with that we have arrived at the inequality that we were trying prove!

Derivation 2¶

I'd like to also walk through a second derivation which may provide a nice alternative way of looking at things. We will start by defining a new variable $Y$:

$$ Y = \begin{cases} 0, & \text{if } x < a \\ a, & \text{if } x \geq a \end{cases} $$

We can see above that in all cases $Y \leq X$, and hence:

$$E[Y] \leq E[X]$$

Next, we can solve for the expected value of $Y$:

$$E[Y] = 0 \cdot P(X < a) + a \cdot P(X \geq a) = a \cdot P(X \geq a)$$

This allows us to rewrite our prior inequality as:

$$a \cdot P(X \geq a) \;\; \leq \;\; E[X]$$

And with a simple algebraic manipulation we again arrive at the Markov Inequality:

$$P(X \geq a) \;\; \leq \;\; \frac{E[X]}{a}$$

Example¶

Now, why exactly is this useful? Consider the following example: we have a random variable $X$ that is exponentially distributed with $\lambda = 1$:

$$ f(x ; \lambda) = \begin{cases} \lambda e ^{-\lambda x}, & \text{if } x \geq 0 \\ 0, & \text{if } x <0 \end{cases} $$

And since $\lambda = 1$, we can rewrite the above as:

$$ f(x ; \lambda = 1) = \begin{cases} e ^{-x}, & \text{if } x \geq 0 \\ 0, & \text{if } x <0 \end{cases} $$

A property of the exponential distribution is that it's expected value is equal to:

$$E[X] = \frac{1}{\lambda}$$

And in this case that evaluates to $E[X] = 1$. Because $X$ is a random variable, we can apply the Markov Inequality!

$$P(X \geq a) \;\; \leq \;\; \frac{E[X]}{a}$$

Substituting in our expected value of 1:

$$P(X \geq a) \;\; \leq \;\; \frac{1}{a}$$

Taking a step back, let's restate our goal here:

We are trying to bound the probability $P(X \geq a)$. Often we won't know it's exact value, so knowing the worst case can be very helpful.

Now, because we know the distribution that $X$ takes on, we actually do know the true value for this probability:

$$P(X \geq a) = e^{-a}$$

However, as stated above, we often don't know the true distribution. By making use of the little bit of information that we are assuming that we do know about $X$, namely that it is a random variable and has an expected value of 1, we can apply the Markov Inequality in order to get a guarantee on the probability that $X$ is greater than or equal to $a$.

Visually, we can see the true probability and the bound generated via the Markov Inequality below (for visualization purposes $a$ is equal to 1 below):

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style="white", palette="husl")
sns.set_context("talk")
sns.set_style("ticks")

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

fig, ax = plt.subplots()
ax.set_xlim(0,1)
ax.set_ylim(-.1,.1)

plt.hlines(-0.001, 0, 1, color='black', zorder=1)

a = 2
true_prob = np.e ** (-a)
markov_inequality_bound = 1 / a


plt.scatter(true_prob,0, zorder=2, color="blue")
plt.scatter(markov_inequality_bound,0, zorder=2, color="red")

plt.axvline(1, 0.45, 0.55, color='black')
plt.axvline(0, 0.45, 0.55, color='black')

plt.annotate(
    '0',
    xy=(0, 0),
    xytext=(-.015, 0.015),
    size=20,
)
plt.annotate(
    '1',
    xy=(0, 0),
    xytext=(0.98, 0.015),
    size=20,
)

# True
plt.annotate(
    r'$e^{-a}$',
    xy=(true_prob, -0.00008),
    xytext=(true_prob, -0.04),
    size=20,
    arrowprops=dict(color='grey',arrowstyle="-", connectionstyle="angle3,angleA=90,angleB=0"),
    zorder=1
)
plt.annotate(
    'True\nValue',
    xy=(true_prob, -0.00008),
    xytext=(true_prob, 0.05),
    size=14,
    arrowprops=dict(color='grey',arrowstyle="-", connectionstyle="angle3,angleA=90,angleB=0"),
    zorder=1
)

# Markov
plt.annotate(
    r'$\frac{1}{a}"$',
    xy=(markov_inequality_bound, -0.00008),
    xytext=(markov_inequality_bound, -0.04),
    size=20,
    arrowprops=dict(color='grey',arrowstyle="-", connectionstyle="angle3,angleA=90,angleB=0"),
    zorder=1
)
plt.annotate(
    'Markov\nInequality',
    xy=(markov_inequality_bound, -0.00008),
    xytext=(markov_inequality_bound, 0.05),
    size=14,
    arrowprops=dict(color='grey',arrowstyle="-", connectionstyle="angle3,angleA=90,angleB=0"),
    zorder=1
)

plt.axis('off')
plt.show()

Now, in general we consider a bound good or useful if the bound is close the correct value. Clearly we can see that in this case the bound is not very close the actual value. For an idea of the numerical values involved here, when evaluated we have:

print("True probability: ", round(np.e**(-2), 5))
print("Markov Inequality Bound: ", 1/2)

True probability:  0.13534
Markov Inequality Bound:  0.5

Cleary the bound is correct, however it is not incredibly useful. If I told you that $P(X \geq a)$ is less than 0.99 I would be correct, but you would most likely roll your eyes and say that is obvious. The same thing is occuring here. The true probability will fall off exponetially as $a$ increases in magnitude, while the Markov Inequality will fall off at $\frac{1}{X}$. This raises the question: Are there other inequalities that can yield a more useful bound?

2.2 Chebyshev Inequality¶

The answer to that question is a resounding yes. There are indeed many other inequalities that exist that can produce more informative and useful bounds. We will look at two in the remainder of this post, starting with the Chebyshev Inequality.

Before we begin our derivation and exploration of Chebyshev's Inequality I want to make clear that we are not getting something for nothing. In other words, this inequality that we are about to see is not merely stronger using the same information; rather, it exploits information that the Markov Inequality did not (the Markov Inequality only utilize the expected value). It is by leveraging other information about the random variable that we are able to produce more useful and accurate bounds.

With that said, consider a random variable $X$ with a mean $\mu$ and a variance $\sigma^2$. The Chebyshev Inequality informally states:

If the variance is small, then $X$ is unlikely to be too far from the mean.

This can be precisely stated as:

$$\text{Chebyshev Inequality} \longrightarrow P \big(\big| X - \mu\big| \geq c \big) \; \leq \; \frac{\sigma^2}{c^2}$$¶

In english is can be stated as: The probability that the distance from the mean is larger than $c$ is less than the variance divided by the square of $c$. If the variance is small, the probability of falling far from the mean is small. If $c$ is large (a large distance from the mean), then the probability falls off at a rate of at least $\frac{1}{c^2}$

Derivation¶

As always the question that should be asked is how exactly did Chebyshev arrive at this inequality? The derivation proceeds as follows: First we will want to apply the Markov Inequality.

$$P(X \geq a) \;\; \leq \;\; \frac{E[X]}{a}$$

Remember, the Markov Inequality can be utilized for any random variable, $X$, and any number $a$ (that is greater than 0). So, nothing is stopping us from letting our random variable be $|x - \mu|$ and call $a$, $c$:

$$P(|X - \mu| \geq c)$$

Now, here is where the brilliance of Chebyshev comes in. He knew that he wanted to utilize information about $X$ that Markov's Inequality did not, namely he wanted to utilize the variance. In order to do that he realized that he could square both terms in the probability:

$$P\big((X - \mu)^2 \geq c^2\big)$$

At which point the Inequality would look like:

$$P\big((X - \mu)^2 \geq c^2\big) \;\; \leq \;\; \frac{\overbrace{E\big[ (X-\mu^2)\big]}^\text{Variance of X}}{c^2}$$

And based on Chebyshev's choice of random variable, $(X - \mu)^2$, our numerator on the right side is actually just the variance of $X$! So, we can substitute that in:

$$P\big((X - \mu)^2 \geq c^2\big) \;\; \leq \;\; \frac{\sigma^2}{c^2}$$

And we have arrived at the definition of the Chebyshev Inequality! Note, the event that we are finding the probability of has a square on each side-these can both be removed an replaced with an absolute value in order to achieve the exact same shape as the original inequality.

Application¶

An application of the Chebyshev Inequality could be something along the following lines: Assume we want to determine the probability that the distance from the mean is at least $k$ standard deviations. Well, in that case we can utilize Chebyshev by setting $c$ to be $k \sigma$:

$$P \big(\big| X - \mu\big| \geq k \sigma \big) \; \leq \; \frac{\sigma^2}{k^2 \sigma^2}$$

Which reduces to:

$$P \big(\big| X - \mu\big| \geq k \sigma \big) \; \leq \; \frac{1}{k^2}$$

How can this be useful to us? Well, for example let $k=3$. This is saying that the probability that you fall 3 standard deviations or more away from the mean is less than or equal to $\frac{1}{9}$. This is true regardless of the type of distribution that is being utilized.

If it is not immediately apparent, the reason this is very useful is because we can use this bound as a guarantee of our worst case scenario. For instance, let's say that we are modeling the temperature of an actuator in an aircraft, and we know the mean and the variance of the actuator temperature. We also know that if it becomes too hot-if the temperature rises three standard deviations or more above the mean-failure occurs. Clearly we want to know how probable this events occurrence is.

Now, remember that this is not saying that the probability is equal to $\frac{1}{9}$. Clearly if we have a very tight variance the probabilty of falling three standard deviations away from the mean is much less than $\frac{1}{9}$ (simply think about the normal distribution to reinforce this fact). However, that is the point of the bound. It is not giving us point estimate, which would indeed require more information. In the case of the normal distribution we know that it normal, it's actual mean and it's actual variance. This will not always be available! That is why it is incredibly useful to have such a generic inequality! Keep in mind that the the result above has no dependency on the value of the standard deviation or variance; the right hand side only is a function of $k$, the number of standard deviations.

While this is clearly a good thing in some ways (allows this to be applied very broadly), if we have additional information it may very well not be utilizing it in the most effective way.

Example¶

Let's now return to our earlier example that we performed with the Markov Inequality. We have a random variable $X$ that is exponentially distributed with $\lambda = 1$:

$$ f(x ; \lambda = 1) = \begin{cases} e ^{-x}, & \text{if } x \geq 0 \\ 0, & \text{if } x <0 \end{cases} $$

We saw that the Markov Inequality gave us a bound of $\frac{1}{a}$ and the exact answer was $e^{-a}$. Let's see what bound we can get with the Chebyshev Inequality. To start, remember that $X$ has a mean of 1. We want to then assume that $a$ is greater than 1, which is shown visually below:

fig, ax = plt.subplots()
ax.set_xlim(0,1)
ax.set_ylim(-.1,.1)

plt.hlines(-0.001, 0, 1, color='black', zorder=1)

scale = 20
mean = 1 / scale
a =  5 / scale

plt.hlines(-0.001, a, 1, color='limegreen', zorder=1, lw=5)


plt.scatter(mean, 0, zorder=2, color="blue")
plt.scatter(a, 0, zorder=2, color="red")

plt.axvline(1, 0.45, 0.55, color='black')
plt.axvline(0, 0.45, 0.55, color='black')

plt.annotate(
    '0',
    xy=(0, 0),
    xytext=(-.015, 0.015),
    size=20,
)
plt.annotate(
    r'$\infty$',
    xy=(0, 0),
    xytext=(0.98, 0.015),
    size=20,
)


plt.annotate(
    r"$\mu_X = 1$",
    xy=(mean, -0.00008),
    xytext=(mean, -0.04),
    size=14,
    arrowprops=dict(color='grey',arrowstyle="-", connectionstyle="angle3,angleA=90,angleB=0"),
    zorder=1
)

# Markov
plt.annotate(
    r'$a$',
    xy=(a, -0.00008),
    xytext=(a, -0.04),
    size=14,
    arrowprops=dict(color='grey',arrowstyle="-", connectionstyle="angle3,angleA=90,angleB=0"),
    zorder=1
)


plt.axis('off')
plt.show()

We are interested in the green region to the right of $a$, specifically the probability that $X$ is greater than or equal to $a$. We are looking at this region in order to get a comparison with the bound found via the Markov Inequality. As a quick reminder, Chebyshev has the form:

$$P \big(\big| X - \mu\big| \geq c \big) \; \leq \; \frac{\sigma^2}{c^2}$$

And that is the form that we are trying to end up with. We can start by stating that the probability that $X$ falls in the green region is equivalent to:

$$P(X \geq a) \; = \; P(X - 1 \geq a - 1)$$

And due to the symmetry associated with the absolute value, we can write that:

$$P(X - 1 \geq a - 1) \; \leq \; \overbrace{P( |X - 1| \geq a - 1)}^\text{Form of Chebyshev}$$

Now the right side has the form of the Chebyshev Inequality, so we can apply it as follows:

$$P( |X - 1| \geq a - 1) \; \leq \; \frac{\sigma^2}{(a-1)^2}$$

Where $\sigma^2$ is equal to 1:

$$P( |X - 1| \geq a - 1) \; \leq \; \frac{1}{(a-1)^2}$$

And we can swap back in $P(X \geq a)$ from original equality:

$$P(X \geq a) \; \leq \; \frac{1}{(a-1)^2}$$

Let's see how this bound compares with the Markov bound, as well as the true value:

fig, ax = plt.subplots()
ax.set_xlim(0,1)
ax.set_ylim(-.1,.1)

plt.hlines(-0.001, 0, 1, color='black', zorder=1)

a = 2
true_prob = np.e ** (-a)
markov_inequality_bound = 1 / a
chebyshev_bound = 1 / a**2

plt.scatter(true_prob,0, zorder=2, color="blue")
plt.scatter(markov_inequality_bound,0, zorder=2, color="red")
plt.scatter(chebyshev_bound,0, zorder=2, color="limegreen")

plt.axvline(1, 0.45, 0.55, color='black')
plt.axvline(0, 0.45, 0.55, color='black')

plt.annotate(
    '0',
    xy=(0, 0),
    xytext=(-.015, 0.015),
    size=20,
)
plt.annotate(
    '1',
    xy=(0, 0),
    xytext=(0.98, 0.015),
    size=20,
)

# True
plt.annotate(
    r'$e^{-a}$',
    xy=(true_prob, -0.00008),
    xytext=(true_prob - 0.08, -0.04),
    size=12,
    arrowprops=dict(color='grey',arrowstyle="-", connectionstyle="angle3,angleA=90,angleB=0"),
    zorder=1
)
plt.annotate(
    'True\nValue',
    xy=(true_prob, -0.00008),
    xytext=(true_prob - 0.08, 0.05),
    size=14,
    arrowprops=dict(color='grey',arrowstyle="-", connectionstyle="angle3,angleA=90,angleB=0"),
    zorder=1
)

# Markov
plt.annotate(
    r'$\frac{1}{a}"$',
    xy=(markov_inequality_bound, -0.00008),
    xytext=(markov_inequality_bound, -0.04),
    size=20,
    arrowprops=dict(color='grey',arrowstyle="-", connectionstyle="angle3,angleA=90,angleB=0"),
    zorder=1
)
plt.annotate(
    'Markov\nInequality',
    xy=(markov_inequality_bound, -0.00008),
    xytext=(markov_inequality_bound, 0.05),
    size=14,
    arrowprops=dict(color='grey',arrowstyle="-", connectionstyle="angle3,angleA=90,angleB=0"),
    zorder=1
)

# Chebyshev
plt.annotate(
    r'$\frac{1}{a^2}"$',
    xy=(chebyshev_bound, -0.00008),
    xytext=(chebyshev_bound, -0.04),
    size=20,
    arrowprops=dict(color='grey',arrowstyle="-", connectionstyle="angle3,angleA=90,angleB=0"),
    zorder=1
)
plt.annotate(
    'Chebyshev\nInequality',
    xy=(chebyshev_bound, -0.00008),
    xytext=(chebyshev_bound, 0.05),
    size=14,
    arrowprops=dict(color='grey',arrowstyle="-", connectionstyle="angle3,angleA=90,angleB=0"),
    zorder=1
)

plt.axis('off')
plt.show()

We can see a clear improvement in the Chebyshev Inequality compared the Markov Inequality!

Generally, Chebyshev is stronger and more informative than the Markov Inequality. This is because it exploits more information about the distribution of the random variable $X$. Specifically, it uses knowledge about not just the mean of $X$, but also the variance!

2.3 Hoeffding's Inequality¶

Finally, we have all of the tools that we need in order to derive the Hoeffding Inequality. I want to quickly restate the motivation for this in the context of this post. Our goal was as follows:

To bound the probability of our training error differing from our population by more than a certain amount.

Specifically, I made the claim that this probability was:

$$P\big( \big| \hat{E}(h) - E(h) \big| > \epsilon \big) \leq 2e^{-2\epsilon^2N}$$

This a case of hoeffdings inequality; in it's most general form it looks like:

$$\text{Hoeffding Inequality} \longrightarrow P\big( X_1 + ... + X_n - n\mu_X \geq na \big) \leq e^{\frac{-na^2}{2}}$$¶

Which specifically is looking for the probability that the sum of $n$ i.i.d. random variables takes on an abnormally large value. That may sound like a strange thing to be interested in at first, but I like to simply think of it as an extension of the Markov Inequality. In the Markov Inequality we were simply trying to determine the probability that a single sampled random variable $X$ would take on a value greater than $a$:

$$P(X \geq a)$$

Now, we are extending this to $n$ variables that are randomly sampled from $X$, and trying to determine the probability of observing their sum be greater than or equal to $na$:

$$P\big( X_1 + ... + X_n \geq na \big)$$

Our goal is to find the upper bound for the above quantity, which is hoeffdings inequality. This is an inequality that applies to a special case, although the special case does generalize. Here is the special case that we will consdier: the random variables, $X_i$'s drawn from $X$, are equally likely to take the values $+1$ or $-1$:

fig, ax = plt.subplots()
ax.set_xlim(0,1)
ax.set_ylim(-.1,.1)

plt.hlines(-0.001, 0, 1, color='black', zorder=1)

plt.axvline(0.25, 0.45, 0.55, color='black')
plt.axvline(0.75, 0.45, 0.55, color='black')

plt.annotate(
    '-1',
    xy=(0, 0),
    xytext=(0.25 -.04, -0.03),
    size=20,
)
plt.annotate(
    '1',
    xy=(0, 0),
    xytext=(0.73, -0.03),
    size=20,
)

plt.annotate(
    r'$P(-1) = \frac{1}{2}$',
    xy=(0.25 -.04, -0.00008),
    xytext=(0.25 -.02 - 0.08, 0.02),
    size=14,
    zorder=1
)
,
plt.annotate(
    r'$P(1) = \frac{1}{2}$',
    xy=(0.73, -0.00008),
    xytext=(0.68, 0.02),
    size=14,
    zorder=1
)


plt.axis('off')
plt.show()

We are interested in the random variable which is the sum of the $X$'s:

$$Y = X_1 + ... + X_n$$

Now, what do we know about $Y$? Well, because the distribution of $X$ is symmetric it's expected value is 0:

$$E[X_i] = 0$$

And based on linearity we know that:

$$E[Y] = 0$$

Additionally, $X_i$ has a variance of 1:

$$Var(X_i) = 1$$

And because there are $n$ $X_i$'s, the variance of $Y$ is $n$:

$$Var(Y) = n$$

By the central limit theorem $Y$ has an approximately normal distribution, $Y \approx N(0, 1)$, with a mean of 0:

from scipy.stats import norm

fig = plt.figure(figsize=(8, 5))   

lower_bound = -2.5
upper_bound = 2.5
length = 2000

mu = 0
sigma = 1
expected_mean = 1
num_observed_points = 50
observed_data = np.random.normal(expected_mean, sigma, num_observed_points)

x_axis = np.arange(-5, 5, 0.001)
y = norm.pdf(x_axis, mu, sigma)
plt.plot(x_axis, y, color=sns.xkcd_rgb["azure"])
plt.axvline(x=0, color=sns.xkcd_rgb["dusty blue"], linestyle="dashed")

plt.annotate(
    r'$\mu_{Y}$',
    xy=(0, 0.02),
    xytext=(-1.2, 0.1),
    arrowprops=dict(facecolor='black', shrink=0.01),
    size=15
)

plt.xlabel(r"$Y$", labelpad=10)
plt.ylabel(r"$P(Y)$")
plt.title(r"Distribution of $Y$")
plt.xticks([0], [0])

plt.show()

Now, we can standardize $Y$ by dividing the by the square root of $n$. The proof for standardizing a random variable is shown in the appendix, but we end up with. Because $Y$ has a mean of 0, after standardization it looks like:

$$\frac{Y}{\sqrt{n}}$$

And we can then state that:

$$P \Big(\frac{Y}{\sqrt{n}} \geq a\Big) \approx 1 - \Phi(a)$$

Where $\Phi$ represents the standard normal CDF. Visually, this looks like:

fig = plt.figure(figsize=(8, 5))   

a = 2

mu = 0
sigma = 1
expected_mean = 1
num_observed_points = 50
observed_data = np.random.normal(expected_mean, sigma, num_observed_points)

x_axis = np.arange(-5, 5, 0.001)
y = norm.pdf(x_axis, mu, sigma)
plt.plot(x_axis, y, color=sns.xkcd_rgb["azure"])
plt.axvline(x=0, color=sns.xkcd_rgb["dusty blue"], linestyle="dashed")

plt.annotate(
    r'$\Phi(a)$',
    xy=(-1, 0.17),
    xytext=(-3, 0.2),
    arrowprops=dict(facecolor='black', shrink=0.01, width=3, headwidth=10),
    size=15
)

plt.annotate(
    r'$P(\frac{Y}{\sqrt{n}} \geq a)$',
    xy=(2.2, 0.02),
    xytext=(3, 0.1),
    arrowprops=dict(facecolor='black', shrink=0.01, width=3, headwidth=10),
    size=15
)

plt.fill_between(x_axis, y, where=(x_axis < a), color=sns.xkcd_rgb["azure"], alpha=0.1)
plt.fill_between(x_axis, y, where=(x_axis > a), color=sns.xkcd_rgb["red"], alpha=0.2)

plt.xlabel(r"$\frac{Y}{\sqrt{n}}$", labelpad=10)
plt.ylabel(r"$P(\frac{Y}{\sqrt{n}})$")
plt.title(r"Distribution of $\frac{Y}{\sqrt{n}}$", pad=15)
plt.xticks([0, a], [0, r"a"])

plt.show()

We know that the entire area under the curve is equal to 1, and the shaded blue region represents $\Phi(a)$, meaning the shaded red region, which is simply $1 - \Phi(a)$, represents the probability that we are looking for. Now, there are two things to notice here; first and foremost, we are able to manipulate the inequality inside of the probability to be:

$$P \Big(Y \geq \sqrt{n}a\Big) \approx 1 - \Phi(a)$$

Our plot can be updated to reflect this:

fig = plt.figure(figsize=(8, 5))   

a = 2

mu = 0
sigma = 1
expected_mean = 1
num_observed_points = 50
observed_data = np.random.normal(expected_mean, sigma, num_observed_points)

x_axis = np.arange(-5, 5, 0.001)
y = norm.pdf(x_axis, mu, sigma)
plt.plot(x_axis, y, color=sns.xkcd_rgb["azure"])
plt.axvline(x=0, color=sns.xkcd_rgb["dusty blue"], linestyle="dashed")

plt.annotate(
    r'$\Phi(a)$',
    xy=(-1, 0.17),
    xytext=(-3, 0.2),
    arrowprops=dict(facecolor='black', shrink=0.01, width=3, headwidth=10),
    size=15
)

plt.annotate(
    r'$P(Y \geq  \sqrt{n}a)$',
    xy=(2.2, 0.02),
    xytext=(3, 0.1),
    arrowprops=dict(facecolor='black', shrink=0.01, width=3, headwidth=10),
    size=15
)

plt.fill_between(x_axis, y, where=(x_axis < a), color=sns.xkcd_rgb["azure"], alpha=0.1)
plt.fill_between(x_axis, y, where=(x_axis > a), color=sns.xkcd_rgb["red"], alpha=0.2)

plt.xlabel(r"$Y$", labelpad=10)
plt.ylabel(r"$P(Y)$")
plt.title(r"Distribution of $Y$", pad=15)
plt.xticks([0, a], [0, r"$\sqrt{n}a$"])

plt.show()

Secondly, notice that the right hand side does not depend on $n$! Regardless of the value of $n$, the probability on the left is going to be approximately equal to $1 - \Phi(a)$. From this we can infer that values of the order of $\sqrt{n}$ are fairly likely to occur. However, remember that our goal was to determine the probability of being larger than $na$, not simply $\sqrt{n}a$. In other words, we are interested in the following probability:

fig = plt.figure(figsize=(8, 5))   


a_prime = 2.7

mu = 0
sigma = 1
expected_mean = 1
num_observed_points = 50
observed_data = np.random.normal(expected_mean, sigma, num_observed_points)

x_axis = np.arange(-5, 5, 0.001)
y = norm.pdf(x_axis, mu, sigma)
plt.plot(x_axis, y, color=sns.xkcd_rgb["azure"])
plt.axvline(x=0, color=sns.xkcd_rgb["dusty blue"], linestyle="dashed")

plt.annotate(
    r'$\Phi(a)$',
    xy=(-1, 0.17),
    xytext=(-3, 0.2),
    arrowprops=dict(facecolor='black', shrink=0.01, width=3, headwidth=10),
    size=15
)

plt.annotate(
    r'$P(Y \geq na)$',
    xy=(2.8, 0.006),
    xytext=(3, 0.1),
    arrowprops=dict(facecolor='black', shrink=0.01, width=3, headwidth=10),
    size=15
)

plt.fill_between(x_axis, y, where=(x_axis < a_prime), color=sns.xkcd_rgb["azure"], alpha=0.1)
plt.fill_between(x_axis, y, where=(x_axis > a_prime), color=sns.xkcd_rgb["red"], alpha=0.2)

plt.xlabel(r"$Y$", labelpad=10)
plt.ylabel(r"$P(Y)$")
plt.title(r"Distribution of $Y$", pad=15)
plt.xticks([0, a, a_prime], [0, r"$\sqrt{n}a$", '$na$'], size=12)

plt.show()

So, our question at this point is to determine how small the probability $P(Y \geq na)$ is. At this point, we have Chebyshev's Inequality, which tells us that:

$$P(Y \geq na) \; \leq \; \frac{Var(Y)}{n^2a^2} $$

And in this case since the variance of $Y$ is $n$, the above reduces to:

$$P(Y \geq na) \; \leq \; \frac{1}{na^2} $$

So, Chebyshev's Inequality tells us that $P(Y \geq na)$ goes to 0 at least as fast as $\frac{1}{n}$ goes to 0. However, it turns out that this is extremely conservative. Hoeffdings Inequality, which we are still in the process of establishing, can tell us something much stronger. It tells us that the tail probability falls exponentially with $n$:

$$P(Y \geq na) \; \leq \; e^{\frac{-na^2}{2}}$$

This is clearly a signficant improvement over Chebyshev, so I hope that you are sufficiently motivated to see how Hoeffding arrived at this inequality. Let's waste no time, on to the derivation!

Derivation¶

The derivation relies on a beautiful trick. Instead of looking at our original event, $X_1 + ... + X_n \geq na$, we are going to look at the following equivalent event:

$$e^{s(X_1+...+X_n)} \geq e^{sna}$$

Where $s$ is a fixed positive number, whose choice will remain free for now. To be clear, throughout the remainder of this derivation we are assuming:

$$s > 0$$$$a > 0$$

Because we multipied each side of our original event by $s$ and then exponentiated (a monotonic function), we can state that:

$$X_1 + ... + X_n \geq na \;\; \text{ if and only if } \;\; e^{s(X_1+...+X_n)} \geq e^{sna}$$

Since the events are mathematically equivalent, we will try and say something about the probability of the transformed event:

$$P \big( e^{s(X_1+...+X_n)} \geq e^{sna} \big)$$

We will start by utilizing the Markov Inequality! Recall, Markov's Inequality has the form:

$$P \big(Z \geq c \big) \; \leq \; \frac{E[Z]}{c}$$

So, we can let $Z = e^{s(X_1+...+X_n)}$ and $c = e^{sna}$:

$$P \big(e^{s(X_1+...+X_n)} \geq e^{sna} \big) \; \leq \; \frac{E[e^{s(X_1+...+X_n)}]}{e^{sna}}$$

On the right hand side we can factor the exponential of a sum as a product of exponentials:

$$\frac{E[e^{sX_1}\cdot...\cdot e^{sX_n}]}{e^{sna}}$$

And then we can use the assumption that the $X$'s are independent. We have the expectation of the product of independent random variables, which is equal to the product of the expectations:

$$\frac{E[e^{sX_1}]\cdot...\cdot E[e^{sX_1}]}{e^{sna}}$$

And because $X_i$ is identically distributed, all of our terms will actually look the same:

$$\frac{\Big(E[e^{sX_1}] \Big) ^n}{e^{sna}}$$

Which with a bit of algebra can be written more generally as:

$$\Big(\frac{E[e^{sX_1}]}{e^{sa}}\Big) ^n$$

We can let the inner expression be equal to some number $\rho$:

$$\rho = \frac{E[e^{sX_1}]}{e^{sa}}$$

And we then have:

$$\rho^n$$

Let's take a minute to recap where we are at and the subsequent implications. We have shown that:

$$P\big( X_1 + ... + X_n \geq na \big) = P \big(e^{s(X_1+...+X_n)} \geq e^{sna} \big) \; \leq \; \rho^n $$

Now, when is this bound going to be interesting? Well, it will be interesting when $\rho$ is less than one, in which case our original probability will fall off exponentially with $n$. The key at this point is that we have the freedom to choose $s$. For any value of $s$ we obtain an upper bound:

$$\Big(\frac{E[e^{sX_1}]}{e^{sa}}\Big) ^n$$

We are going to choose $s$ to get the most informative/powerful upper bound. First, let's solve for the expectation of the numerator. Becasue $X_1$ takes values of -1 or +1 with equal probability, the expectation is simply:

$$E[e^{sX_1}] = \overbrace{\frac{1}{2}e^s}^{X_1 = 1} + \overbrace{\frac{1}{2}e^{-s}}^{X_1 = -1}$$$$E[e^{sX_1}] = \frac{1}{2}(e^s + e^{-s})$$

And we can substitute that back into our earlier expression:

$$\Big(\frac{\frac{1}{2}(e^s + e^{-s})}{e^{sa}}\Big) ^n$$

If we can now choose $s$ so that the above quantity is less than 1, then we will have achieved our objective. It is important to keep asking yourself why throughout this process. Why do we want the above quantity to be less than 1? Because that means that our bound will be less than 1 and fall exponentially with $n$, ensure that the probability of our event, $P\big( X_1 + ... + X_n \geq na \big)$, falls exponentially with $n$. The question is, how exactly should we commence with choosing $s$?

Let's start by plotting both the numerator and denominator of our above expression as functions of $s$:

fig = plt.figure(figsize=(8, 5))   

s = np.arange(-2, 2, 0.0001)
a = 0.5

numerator = 0.5 * (np.exp(s) + np.exp(-s))

denominator = np.exp(a*s)

plt.plot(s, numerator, color=sns.xkcd_rgb["blue"])
plt.plot(s, denominator, color=sns.xkcd_rgb["red"])

plt.xlabel(r"$s$")
plt.legend(["Numerator", "Denominator"])
plt.title(r"Numerator and Denominator as function of $s$")

plt.show()

We can see that the numerator takes a value of 1 when $s=0$, and it has a derivator of 0 when $s=0$. The denominator, an exponential, also has a value of 1 when $s=0$, but it has a positive derivative! This tells us that at least in the vicinity of 0 the denominator is going to be larger than the numerator, and hence the fraction is going to be less than 1. This means that we will have achieved our goal of an exponentially decaying bound.

So, for "small" $s$ we know that $\rho$ is less than 1. However, we want an explicit value for $\rho$, which can be done by fixing an explicit value for $s$. If we set $s = a$ then the bound that we get will be:

$$\Big(\frac{\frac{1}{2}(e^s + e^{-s})}{e^{sa}}\Big) ^n \;\overbrace{ \leq \; e^{\frac{-na^2}{2}}}^\text{Hoeffding Bound!}$$

Which rememeber this means that:

$$P\big( X_1 + ... + X_n \geq na \big) \; \leq \; \Big(\frac{\frac{1}{2}(e^s + e^{-s})}{e^{sa}}\Big) ^n \; \leq \; e^{\frac{-na^2}{2}}$$$$P\big( X_1 + ... + X_n \geq na \big) \; \leq \; e^{\frac{-na^2}{2}}$$

Now, clearly I skipped a step; how exactly do we show that by letting $s=a$ we end up with:

$$\Big(\frac{\frac{1}{2}(e^s + e^{-s})}{e^{sa}}\Big) ^n \; \leq \; e^{\frac{-na^2}{2}}$$

I am going to show that momentarily! First I want to highlight a few things. Notice that even if the $X$'s had a different distribution, as long as they had zero mean, the derivation up until now would have worked! We would have a slightly different expression in place of $\Big(\frac{\frac{1}{2}(e^s + e^{-s})}{e^{sa}}\Big) ^n$, yet the expression in the numerator will always have the property of zero derivative (a consequence of our zero mean assumption), and because of that the expression will always be less than 1 when we choose $s$ to be suitably small.

This will give a result for more general distributions, known as the Chernoff Bound. However, we will not develop the Chernoff Bound now, but rather continue on with Hoeffding and derive our inequality:

$$\Big(\frac{\frac{1}{2}(e^s + e^{-s})}{e^{sa}}\Big) ^n \; \leq \; e^{\frac{-na^2}{2}}$$

Which is equivalent to:

$$P\big( X_1 + ... + X_n - n\mu \geq na \big) \; \leq \; \overbrace{\Big(\frac{\frac{1}{2}(e^s + e^{-s})}{e^{sa}}\Big) ^n}^\text{Quantity we want to bound}$$

Where we are now subtracting $n \mu$ to ensure a zero mean. Looking at the numerator term, we can use a Taylor Series for the exponential function. The Taylor series has a special case called the Maclaurin Series, where $x$ is set to be 0. It looks like:

$$f(x) = f(0) + \frac{f^{(1)}(0)}{1!}x + \frac{f^{(2)}(0)}{2!}x^2 + \frac{f^{(3)}(0)}{3!}x^3 +... = \sum_{n=0}^{\infty} f^{(n)}(0) \cdot \frac{x^n}{n!}$$

In the specific case of our exponential, the Taylor Series takes on the form:

$$e^x = 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + ... = \sum_{n=0}^\infty \frac{x^n}{n!}$$

We can then apply that to our numerator as follows:

$$\frac{1}{2}(e^s + e^{-s}) = \frac{1}{2}(1 + s + \frac{s^2}{2!} + ...) + \frac{1}{2}(1 - s + \frac{s^2}{2!} - ...)$$

Where the odd power terms will subsequently cancel out, leaving us with:

$$\frac{1}{2}(e^s + e^{-s}) = \sum_{i=0}^{\infty} \frac{s^{2i}}{(2i)!}$$

And let's try and find a bound on the term in the denominator, $(2i)!$:

$$(2i)! = \overbrace{1 * 2 * 3 * ... * i}^\text{i!} * \overbrace{(i + 1) * ... * (2i)}^{\text{Each term is } \geq 2} \; \geq \; i! \cdot 2^i$$$$(2i)! \; \geq \; i! \cdot 2^i$$

Note that we are saying on the right that each term is greater than or equal to 2, and that there are $i$ such terms. This allows us to then bound our prior taylor series expansion. Because the term $i! \cdot 2^i$ is in the denominator, the inequality is reversed:

$$\sum_{i=0}^{\infty} \frac{s^{2i}}{(2i)!} \; \leq \; \sum_{i=0}^{\infty} \frac{s^{2i}}{i!\cdot 2^i} $$

We can rewrite the right hand side by taking the term $2^i$ in the denominator and combining it with the term in the numerator:

$$\sum_{i=0}^{\infty} \frac{s^{2i}}{i!\cdot 2^i} = \sum_{i=0}^{\infty} \frac{(\frac{s^2}{2})^i}{i!} $$

Which means our inequality now has the form:

$$\sum_{i=0}^{\infty} \frac{s^{2i}}{(2i)!} \; \leq \; \sum_{i=0}^{\infty} \frac{(\frac{s^2}{2})^i}{i!} $$

Notice that the right side now has the same for as the exponential taylor series expansion:

$$e^s = \sum_{i=0}^{\infty} \frac{s^i}{i!}$$

Except that instead of $s$ we have $\frac{s^2}{2}$. Hence, we can rewrite:

$$\sum_{i=0}^{\infty} \frac{(\frac{s^2}{2})^i}{i!} = e^{\frac{s^2}{2}}$$

And our inequality turns into:

$$\sum_{i=0}^{\infty} \frac{s^{2i}}{(2i)!} \; \leq \; e^{\frac{s^2}{2}}$$

And we know the left hand side was representing our original numerator:

$$\frac{1}{2}(e^s + e^{-s}) = \sum_{i=0}^{\infty} \frac{s^{2i}}{(2i)!} \; \leq \; e^{\frac{s^2}{2}}$$$$\frac{1}{2}(e^s + e^{-s}) \; \leq \; e^{\frac{s^2}{2}}$$

Now we can go back to our earlier inequality:

$$P\big( X_1 + ... + X_n - n\mu \geq na \big) \; \leq \; \Big(\frac{\frac{1}{2}(e^s + e^{-s})}{e^{sa}}\Big) ^n$$

And we can bound the right hand side as:

$$\Big(\frac{\frac{1}{2}(e^s + e^{-s})}{e^{sa}}\Big) ^n \; \leq \; \Big( \frac{e^{\frac{s^2}{2}}}{e^{sa}}\Big)^n = \big( e^{\frac{s^2}{2} - sa}\big)^n$$

If we then let $s=a$:

$$\Big(\frac{\frac{1}{2}(e^s + e^{-s})}{e^{sa}}\Big) ^n \; \leq \; \big( e^{\frac{a^2}{2} - a^2}\big)^n$$$$\Big(\frac{\frac{1}{2}(e^s + e^{-s})}{e^{sa}}\Big) ^n \; \leq \; e^{\frac{-na^2}{2}}$$

Allowing us to reach of final conclusion:

$$\text{Hoeffding Inequality} \longrightarrow P\big( X_1 + ... + X_n - n\mu \geq na \big) \; \leq \; e^{\frac{-na^2}{2}}$$¶

And with that we have gone through the derivation of Hoeffding's Inequality and hopefully shed some light on it's application in the study of the feasibility of learning.

Appendix¶

A.1 Standardizing a Random Variable¶

Standardizing a random variable is rather straight forward. The goal is to end up with a normal distribution that has a mean of 0 and a variance of 1. With that goal in mind, let's assume the variable that we are trying to standardize is $X$, and it has the distribution:

$$X \approx N(\mu , \sigma^2)$$

To standardize it, we do the following:

$$\frac{X - \mu}{\sigma}$$

The question is why? Well, remember our goal: to end up with a normal distribution with mean 0 and variance 1. Let's call our standardized $X$ variable $Y$:

$$Y = \frac{X - \mu}{\sigma}$$

What is the expected value of $Y$? Well, that is just:

$$E[Y] = E\Big[ \frac{X - \mu}{\sigma} \Big] = \frac{E[X] - \mu}{\sigma} = \frac{\mu - \mu}{\sigma} = 0$$

And, what is the variance of $Y$:

$$Var(Y) = E [(Y - \mu_Y)^2] = E[Y^2] = E \Big[ \Big( \frac{X - \mu}{\sigma}\Big)^2 \Big] = E \Big[\frac{1}{\sigma^2} (X - \mu)^2\Big]= E \Big[\frac{1}{\sigma^2}\Big] \cdot E \Big[(X - \mu)^2 \Big] $$

Where the right most side is equivalent to:

$$\frac{1}{\sigma^2} \cdot Var(X) = \frac{1}{\sigma^2} \cdot \sigma^2$$

Leaving us with a variance for $Y$ of:

$$Var(Y) = 1$$

And hence we see that:

$$Y \approx N(0, 1)$$

3. Probability Inequalities¶

1. Is Learning Feasible?¶

1.1 Extending to Learning¶

2. Probability Bounds and Inequalities¶

2.1 Markov Inequality¶

$$ \text{Markov Inequality} \longrightarrow \text{If} \; X \geq 0 \; \text{and} \; a >0, \text{then} \; P(x \geq a) \leq \frac{E[X]}{a}$$¶

Derivation 1¶

Derivation 2¶

Example¶

2.2 Chebyshev Inequality¶

$$\text{Chebyshev Inequality} \longrightarrow P \big(\big| X - \mu\big| \geq c \big) \; \leq \; \frac{\sigma^2}{c^2}$$¶

Derivation¶

Application¶

Example¶

2.3 Hoeffding's Inequality¶

$$\text{Hoeffding Inequality} \longrightarrow P\big( X_1 + ... + X_n - n\mu_X \geq na \big) \leq e^{\frac{-na^2}{2}}$$¶

Derivation¶

$$\text{Hoeffding Inequality} \longrightarrow P\big( X_1 + ... + X_n - n\mu \geq na \big) \; \leq \; e^{\frac{-na^2}{2}}$$¶

Appendix¶

A.1 Standardizing a Random Variable¶

References¶