4. Histogram vs. PDF vs. CDF¶

This post is TODO.

TODO: Fit in the content below somehow

A very interesting thing to keep in mind with histograms is that they do not necessarily need to be defined for an evenly spaced x axis. Let's start this idea at it's most basic point and see where it takes us
Consider why histograms/distributions even originated at all-to keep track of counts and get a better idea of which events were more prevalent
So, let's start with a histogram, which captures data we have already seen and allows us to visualize it's spread at different levels of granularity (bin size). By design, it is not passed in an evenly spaced set of $\mathbb{R}$, rather it is only based off of the data points in the sample it is visualizing. What would happen if we had an infinite number of data points, and an infinite number of bins?
Well, this leads us to probability distributions. These are an extension of histograms, or more accurately they are what a histogram approaches as the sample size grows to infinity.
Probability distributions are useful since they can compactly represent an infinite amount of data, assuming that the underlying data generating process is indeed based on the distribution we select
Now, probability distributions not necessarily defined of $\mathbb{R}$; however, there are defined over a set of evenly spaced input points. For the binomial that may be $[0, n]$, where $n$ is the number of trials, and for the gaussian it may be $\mathbb{R}$, and the poisson all natural numbers, $\mathbb{N}$. Again, all of these sets are evenly spaced, even if they are all not equivalent to $\mathbb{R}$.
The main idea to keep in mind here is this interplay between a histogram and a probability distribution. A histogram, by design, is not defined over a set of evenly spaced points; rather it captures the frequency of counts in a specific sample. If that sample was evenly spaced our histogram would just look uniform. A probability distribution, a mathematical extension of a histogram, is defined over a set of evenly spaced points.
Does this seem strange? The reason why this is the case has to do with limits, but a nice way to think about it is as follows. A histogram takes a set of bins (intervals), from the sample determines the counts in each bin, and this count is represented as a height. As we decrease the bin size to something arbitrarily small, and if we had an infinite number of data points in our sample (again, they do not need to be even spaced), we would eventually no area to our bin, just a height. Our probability density function is meant to capture the height at this limit. The idea is that our sample was not evenly spaced, and that is why the height of the histogram and probability density function is not simply uniform-it varies! The height captures this non uniformity.
Our pdf is then able to take in a set of evenly spaced points because it's structure, by design, encodes the information obtained from the histogram. It knows that some values of this evenly spaced x are more common than others, and it reflects that in the height (function output). Again, the function (pdf) itself is constructed in a way that captures/encodes this non uniformity, and reflects the histogram with an infinite number of points and arbitrarily small bins.
This is a very crucial point to keep in mind when thinking about the empirical nature of a histogram and the mathemtical extension of a pdf.
TODO: Add to above
- Histogram is mean to describe empirical distribution of sample
- PDF is meant to describe the underlying distribution (which is kind of the mind projection fallacy): https://www.lesswrong.com/posts/f6ZLxEWaankRZ2Crv/probability-is-in-the-mind
- Jaynes chapter 3, transition from binomial to gaussian (from discrete to
Notes to use in above:
- A pdf is, simply described, a function that takes in a data point, $x \in X$, and transforms it into it's respective probability density (a curve height). If we think about this relationship to the histogram, the pdf is really saying that it will transform $x$ to larger numbers (probabilities) if we expect to observe many data points nearby, and lower numbers if we expect to see fewer data points nearby
- if we remember that the relationship between PDF and CDF (CDF is the integral of PDF, PDF is the derivative of CDF) that may help us reason about this. The histogram, while generally overlayed with the PDF, is actually representing chunks of area (i.e. of probability). Because it is dealing with area, these chunks/bins are indeed probability. However, if we make the size of these bin widths finer and finer, to the point where they are infinitesimally small, they will eventually be essentially just height. At this point, they are literally equivalent to the derivative of the CDF (the derivative of probability). That is why we call them densities. In a CDF, if there is a range with a large slope, that corresponds to a density of observations!
- TODO: ADD 1-D VISUALIZATION OF THIS (points scattered on line and 'densely' clustered), then have their CDF, PDF, and histogram all above it!!! THIS IS IDEAL.
- This idea of 'density' corresponds to where the CDF has a very high slope! If the CDF has a very high slope, that means that it's derivative is high, and hence it's PDF will be high. A PDF is a way of representing the slope/rate of change/derivative in the probability of observing a certain range of $x$.
- If we take chunks/areas under the PDF (i.e. bins), these hold probabilities.
- https://www.nathanieldake.com/Mathematics/01-Calculus-01-Fundamental-Theorem-of-Calculus.html

We can then abstract from these messy distributions to more theoretical ones, such as the normal. Touch on taleb here.

Now, when it comes to our implementation, we need to figure out a way to computationally compare the two above empirical CDFS. Let's look at how we may do that.