As we tread further into the twenty first century, almost everyone is expected to memorize the mantra "we must make data driven decisions" (well, at least most people in the technology space, and certainly data scientists). However, I want us to pause for a moment and think about what that really means?
In an idealized, rigid, platonic world this may simply mean to cast our first intuitions aside, instead using metrics related to the problem at hand, past decision's outcomes, and so on. Now, in a simplistic system this is a satisfactory approach. Consider the following (overly simple) scenario; you want to figure out the fastest way to drive from your house to the grocery store at rush hour (note how constrained this situation already is). You intuitively feel as though route $A$ will be faster, so in general you take that route. You know that on average it takes about 9 minutes. Then, one day your roommate decides to join you and recommends route $B$, stating that they always drive that way and that on average it takes 6 minutes. You give route $B$ a shot an sure enough it takes 6 and half minutes.
This is an example of a very simplistic scenario that is conducive to basic data driven decision making. You have (essentially) all the data/variables you need in order to represent the scenario at hand. In other words, the decision is based on a univariate function:
$$\text{Time spent driving to grocery story} = f(\text{route})$$We have data regarding driving time of both routes, and in this toy example there is really nothing else we need to consider in order to make an optimal decision that reduces driving time.
It should be no surprise that this is not how things work in the real world. The real world is messy, contains an abundance of variables, and these variables manifest into uncertainty. The question that I have become obsessed with is as follows:
How can we reason optimally in complex and uncertain situations?
For instance, let's now say that your company sells widgets. You, as a person in marketing, are in charge of coming up with sales offerings around the holiday season. Your initial intuition is that if you give out 20 dollar coupons to anyone who makes a purchase, you will get more purchases. However, there is a competing hypothesis from your colleague that suggests offering a discount to customers who make over 1000 dollar in purchases would actually be more effective at generating revenue. You have some historical time series data, but neither have specifically been conditioned upon the exact hypotheses you are both proposing (i.e., you have no data that was collected during a 20 dollar coupon period, or during a 1000 dollar purchase discount period). The data that you have is necessarily incomplete, and even if it wasn't we still must confront the following logical problem:
How do we use data (frequencies of events) to estimate plausibilities of beliefs?
In other words, how can we use the data present to estimate how plausible on hypothesis is compared to the other? This question is the central focus of Probability Theory: The Logic of Science, by E.T. Jaynes. Often viewed as the first text to make probability theory a "hard" branch of mathematics (compared to a group of ad hoc methods), it is an incredibly ambitious and thought provoking book that should be on any data scientist's or statistician's bookshelf. With that said, at times it is rather dense, and I wanted to take the time to create a set of articles that serve as chapter summaries. Note, these do not mean to replace the original text; rather, they can be read in tandem to clear up any sources of confusion and ensure clear understanding.
With that said, let's begin digging into the book, starting with the preface and chapter 1.
Jaynes starts the book by stating who the intended audience is; while this is generally not very informative, here it actually has a good deal of merit! He states that the material is meant to help those who are in fields such as physics, chemistry, biology, geology, medicine, economics, sociology, engineering, operations research, etc.; any field where inference is needed. He adds a footnote stating that by inference he means:
Inference: Deductive reasoning whenever enough information is at hand to permit it; inductive or plausible reasoning when the necessary information is not available (as it almost never is in real problems). If a problem can be solved with deductive reasoning, probability theory is not needed for it. Thus, the topic of this book is the optimal processing of incomplete information.
As a data scientist this type of footnote should send tingles down your spine. In nearly every situation you will encounter you are often treading the line of making inferences based off of a combination of deductive and inductive reasoning; who wouldn't want to be making those inferences in an optimal way?
Now, for those unfamiliar with the concepts of deduction and induction I recommend checking out this section of my article on Bayesian Inference. But, for a quick recap we can think of deduction as forming some hypothesis about the state of the world or it's workings, gathering data about that state, and then seeing if our data confirms or denies our hypothesis. It can be visualized below:
That is deduction. We perform deduction every day with relative ease. On the other hand we have induction, which works in the opposite direction:
Here, we gather data about the world around us, and after picking up on a pattern we induce a hypothesis, and then work towards concluding it's validity. Now, it should be noted that this idea of reasoning from evidence to hypothesis can be thought of as a parallel of reasoning from effect to cause.
As a small example (that I came across when reading The Book of Why by Judea Pearl, consider Sherlock Holmes. Imagine that Sherlock Holmes was called to investigate a broken window. Sure he could go about it deductively and, while driving to the location of the window (i.e. before gathering any data), generate the hypothesis that the it was a group of children who broke the window. This is what humans do with ease, daily. However, what truly made Sherlock Holmes so powerful was his ability to perform induction. In this case, he arrives at the scene of the broken window, sees glass scattered about, and surely a host of other interesting things; he takes all of this data and then generates a hypothesis about how the window was broken.
If that example leads you to feel slightly intimidated by the process of induction, do not worry! Without digging into it too deeply, there is actually an asymmetry going on here that can lead you into chaos theory. For this I will borrow another example, this time from Nassim Taleb. Imagine that you want to understand what happens when you leave a block of ice out at room temperature. You hypothesize that it will melt (form your hypothesis). You then take 10 blocks of ice, leave them all out at room temperature, and see that they do indeed all melt (collect data). You just performed a very straight forward deduction, making use of what is referred to as the forward process.
Consider the inverse case now. Imagine that you walk into your kitchen and see that there is a puddle of water on the ground. You try and gather data, seeing if there is a leak in a pipe anywhere, if a cup spilled, etc. There is an incredibly long list of ways that this water could have gotten there. Trying to determine (from what is most likely incomplete data) that the water is actually there due to a block of ice that melted, is incredibly challenging. This is known as the backward process. If you are interested in this type of problem I recommend looking at the Butterfly Effect; for now I will leave it here to prevent going down a rabbit hole.
To give us a north star to focus on, I want to take a moment to highlight what is essentially the theme of the book:
Probability theory as extended logic.
We will dig into this much further in subsequent posts, but what this book does is create a framework in which the rules of probability theory can be viewed as uniquely valid principles of logic in general, leaving out reference to chance or random variables. This allows for the imaginary distinction between probability theory and statistical inference to disappear, allowing logical unity as well as greater technical power.
This theme amounts to recognition that the mathematical rules of probability theory are not merely rules for calculating frequencies of "random variables"; they are also the unique and consistent rules for conducting inference (i.e. plausible reasoning) of any kind.
This set of rules will automatically include all Bayesian calculations, as well as all frequentist calculations. Never the less, our basic rules are broader than either of these, and in many applications the calculations will not fit into either category. As explained by Jaynes:
The traditional frequentist methods which only use sampling distributions are usable and useful in particularly simple, idealized problems; however, the represent the most proscribed cases of probability theory, because they presuppose conditions (independent repetitions of a 'random experiment' but no relevant prior information) that are hardly ever met in real problems. This approach is quite inadequate for the current needs of science.
Jaynes proceeds to dig into the idea of prior information/knowledge, and how it is essential that it is included in data analysis and inference. He writes that a false premise built into a model which is never questioned cannot be removed by any amount of new data. The use of models which correctly represent the prior information that scientists have about the mechanism at work can prevent such folly in the future. By ignoring prior information we not only set ourselves up to fail in the inference we are trying to make, but we also risk stalling out scientific progression itself. No amount of analyzing coin tossing data by a stochastic model could have led us to the discovery of Newtonian Mechanics, which alone determines those data.
With our stage set we are ready to dive into what is one of the most thought provoking and eye opening books in the realm of mathematics and science. Enter chapter 1.
Suppose that on some dark night a policeman walks down a street, apparently deserted. Sudddenly he hears a burglar alarm, looks across the street, and sees a jewelry store with a broken window. Then a gentleman wearing a mask comes crawling through the broken window, carrying a bag which turns out to be full of expensive jewelry. The policeman doesn't hesitate at all in deciding that this gentleman is dishonest. But the question is: By what reasoning process does he arrive at this conclusion?
Keep in mind the key theme of this book, these articles, and more importantly data science in general: How can we reason optimally in complex and uncertain situations? It may seem as though this preface and introduction get slightly into the weeds; this is a book on probability after all, right? Jaynes goes to great length to work through the philosophy behind this vantage point of probability (seen as extended logic instead of based off of the standard Kolmogorov axioms); it is not unreasonable to ask why. The key lies in the fact that Jaynes is not simply trying to replace one axiomatic framework with another, but rather he is trying to build up a way to reason consistently and logically, and apply that reasoning to any area of inference. I recommend taking a moment to appreciate that and let it sink in. Many of the problems that Jaynes works this in the forthcoming chapters could have been solved with standard orthodox or bayesian statistics; what makes this framework so powerful is that it provides a way to reason consistently. You are not required to hold in your head a bag of ad hoc methods and techniques. Rather, you have a methodology that allows you to approach and reason about any problem logically and consistently. Surely this is what any scientifically minded problem solver would hope for!
With that said let's return to our problem. A bit of thought makes it clear that the policeman's conclusion was not a logical deduction from the evidence; there may have been a perfectly reasonable explanation for everything! For instance, it is certainly possible that the man was the owner of the jewelery store and was coming home from a masquerade party, and as he passed his store some teenagers threw a rock at the window, and he was simply trying to protect his property.
So, clearly the policemans reasoning process was not strictly logical deduction. However, we can grant that it did possess a high degree of validity. The evidence did not make the gentleman's dishonesty certain, but it did make it extremely plausible. This is the type of reasoning that humans have become proficient with long before studying mathematical theories. We encounter an abundance of these decisions daily (will it rain or won't it?) where we do not have enough information to permit deductive reasoning; but still we must decide immediately what to do.
Now, inspite of how familiar this process is to all of us, the formation of plausible conclusions is very subtle. This book allows us to replace intuitive judgements with definite theorems, and ad hoc procedures are replaced by rules that are determined uniquely by some elementary and inescapable criteria of rationality.
Now, let's take a moment to try and place this within the context of aristotelian logic; that is the most appropriate starting point to dissect the difference between deductive reasoning and plausible reasoning.
Deductive reasoning is defined as the process of reasoning from one or more statements (premises) in order to reach a logically certain conclusion. We can think of deductive reasoning as follows:
And this is defined mathematically below (known as modus ponens, the law of detachment):
$$ \begin{equation} \text{If A is true, then B is true} \\ \frac{\text{A is true}} {\text{therefore, B is true}} \end{equation} $$It is the primary deductive rule of inference. An example would be:
There is also the inverse (known as modus tollens, the law of contrapositive), another deductive rule of inference:
$$ \begin{equation} \text{If A is true, then B is true} \\ \frac{\text{B is false}} {\text{therefore, A is false}} \end{equation} $$And we have the counter simple example:
Both of the above are known as strong syllogisms. Now in general, Deductive reasoning ("top-down logic") contrasts with inductive reasoning ("bottom-up logic") in the following way; in deductive reasoning, a conclusion is reached reductively by applying general rules which hold over the entirety of a closed domain of discourse, narrowing the range under consideration until only the conclusion(s) is left. In inductive reasoning, the conclusion is reached by generalizing or extrapolating from specific cases to general rules, i.e., there is epistemic uncertainty (for example the black swan). However, the inductive reasoning mentioned here is not the same as induction used in mathematical proofs – mathematical induction is actually a form of deductive reasoning.
%%javascript
MathJax.Hub.Config({
TeX: { equationNumbers: { autoNumber: "AMS" } }
});