Nathaniel Dake Blog

1. Linear Algebra Introduction

Linear algebra is frequently utilized in the implementation of machine learning algorithms, so it is very important to have an intuitive understanding of what it represents and how it is used. I recommend looking at this along side of my numpy walkthrough in the math appendix. This overview is going to consist of three main parts:



1. Linear Algebra: Programmers Inuition

The first section is going to go through an overview from a programming perspective about what linear algebra is all about.



2. Matrix Multiplication: Programmers Inuition

The second section is going to go through the visual process of matrix and vector multiplication and how to think about these processes from a programming perspective. Some terminology will be introduced here as well.



3. Dot Product: Geometric Intuition as it relates to Machine Learning

The third section is going to go through a thorough explanation of why we even have linear algebra, what matrix transformations are, why eigenvalues and vectors are useful, and more.




Linear Algebra: A Programmers Intuition

Matrix multiplication, and linear algebra in general, is often explained in one of two ways: 1. Geometrically

Intuitions are based on thinking of matrix multiplication as a linear transformation applied to a vector, and how that transformation is really the scaling/rotation/skewing of a plane.



You have an original vector, a transformation is applied, a new vector comes out. The issue with this is that it can lead to a heavy geometrical realiance that becomes very difficult to reason with as your dimensionality increases. For instance, when there are only two dimensions, $x$ and $y$, it is simple to think of the linear transformation of a plane, just as we see in the image of Mona Lisa above. However, if we suddenly have 500 dimensions, that geometric intepretation becomes far less intuitive.

2: Composing of Linear Operations It is true that the technical definition of matrix multiplication that it results in the composition of the original functions. However, it again leaves you feeling slightly empty when it comes time to perform matrix multiplication in code. Let's look at a different intuition that is particularly useful in the context of machine learning.

Matrix Multiplication is about information flow

Before we can dig into this example, it is import to touch on the notation involved in matrix multiplication. In scalar multiplication, the order the numbers does not matter; this is the communicative property:

$$a * b = b * a$$

However, when we get to matrix multiplication, it is often confusing that this is not longer the case. Order does matter:

$$A * B\neq B *A $$

Why is this the case? Well the easiest way to explain this is that matrix multiplication utilizes the same notation that we do when writing functions. For example, if were to write that $f$ is a function of $x$ it would look like:

$$f(x)$$

In this case $f$ is a certain variable that depends on the value of $x$. Order in this case clearly matters:

$$f(x) \neq x(f)$$

So, that is the first the to always remember: when writing out matrix multiplication, the matrix on the left is the composition of functions, and the one on the right is the input.

With that said, the next piece of notation to consider is how matrices that represent functions are written. For instance, say we are dealing with a linear transformation that is representing the following:

$$F(x,y,z) = 3x +4y +5z$$

Where in this case $x , y , z $ are just variables or the dimensions we are transforming, and the coefficients are the amount each dimension is being scaled by. Keep in mind in machine learning we often have hundreds or thousands of dimensions, and they are usually represented as $x_1, x_2, x_3...$, meaning our linear transformation could look like:

$$F(x_1,x_2,x_3) = 3x_1 +4x_2 +5x_3$$

Say that we then had $x_1 = 3$, $x_2 = 2$, and $x_3 = 7$, our linear transformation would result in:

$$F(x_1,x_2,x_3) = F(3,2,7)= 3*3 +4*2 +5*7 =52$$

The convention in linear algebra is to place operations (our linear transformation $F$ in this case) in rows, and our data (our inputs $x_1, x_2, x_3$) in columns. This can be visualized below. $F$, which is a linear transformation, can be thought of as an operation (remember, it is a essentially a function):

Which in this specific case takes the form:

This row is actually as (1 x 3) dimensional matrix. That notation is often used to describe matrices, and can be read as 1 row with 3 columns. We no longer next the variable names, just the coefficients, since the coefficients are fully able to describe the transformation we are dealing with.

Again, based on standard linear algebra notation, data is going to be in the form of a column vector. So our input data, $x_1, x_2, x_3$, which now we will call $X$ is going to be a column vector:

And in this specific case will look like:

This column vector is of shape (3 x 1), which can be read as 1 column and 3 rows. Again, we do not need to try and incorporate the $x_1, x_2, x_3$ variables since the values and their index give us all we need to know.

If we wanted to access the value of $x_2$ in this case we would write $X_{21}$, which means the second row and first column. Linear algebra follows a notation (R x C), which stands for row x column. So whenever you see an index or are given the shape of a matrix, the first number represents the numbers of rows, the second the number of columns.

When we then want to apply the linear transformation $F$ to the input $X$, we perform what is known as matrix multiplication. Mathematically, that looks like:

$$F*X$$

A visualization is essential here; we can imagine taking our column vector, and placing it on its side as the linear transformation is applied:

And we end up with our resulting output, a (1 x 1) matrix, aka a scalar:

It should be noted that what we just did was the dot product, which I describe in my numpy walkthrough.

Now, hopefully you can imagine how this concept could be extended. What if we wanted to have several linear transformations, not just $F$. That would mean that we have multiple rows in our operations matrix. For each additional linear transformation row, our output will have an additional row. In our example above, we only have 1 operation row, so our output only has 1 row. If we had had 3 operation rows, our output would have had 3 rows.

As a side note, this is a good time to bring up or our matrix multiplication rule: the inner dimensions must match! Hence, when we multiplied a (1 x 3) matrix times a (3 x 1) matrix, the inner dimensions of 3 match! We can change the outer dimensions to any number we want! For the operations matrix, changing the outer dimension, the 1, simply means that our output will end out having that many rows. For the input data matrix, changing the outer dimension, 1 as well, just changes how many input vectors are going to be transformed by our operation matrix.

Okay great, with all of that said, its time to bring in a very helpful visualization. Any time we have a matrix and are applying it to an input, we can imagine pouring our each input vector through the operations matrix.



As an input passes down the operation matrix, it creates an output value.



A matrix is shorthand for the diagrams we have been making. We can think of it as single variable that represents a spreadsheet of inputs or operations.

$$Inputs = A = [input_1 \; input_2] = \begin{bmatrix} a & x \\ b & y \\ c & z \end{bmatrix}$$$$Operations = M = \begin{bmatrix} operation 1\\ operation 2 \end{bmatrix} = \begin{bmatrix} 3 & 4 & 5 \\ 3 & 0 & 0 \end{bmatrix} $$



Potential Confusion

Now there are several places we may be tripped up. The first is the order in which we read this. We have already gone over that we use function notation. All this means to recap is that instead of writing: input => matrix => output we would write the operations matrix first, followed by the input data matrix. We generally will write a matrix with a capital letter, in our case earlier $F$, and a single input with a lowercase, earlier it would have been $x$. Because we have several inputs and outputs, they are each matrices as well.

$$MA = B$$$$\begin{bmatrix} 3 & 4 & 5 \\ 3 & 0 & 0 \end{bmatrix} \begin{bmatrix} a & x \\ b & y \\ c & z \end{bmatrix} = \begin{bmatrix} 3a+4b+5c & 3x+4y+5c \\ 3a & 3x \end{bmatrix} $$

The second potentially confusing aspect of all of this is the numbering, which we have also briefly talked about. Our matrix size is going to be measure as row x column, or (R x C). However, standard notation is to refer to it as m x n. Items in the matrix are going to be referenced in the same way: $a_{ij}$ is the ith row in the jth column.

The third potenial source of confusion is that often when we have more than one column data vector, we start placing the data vectors as rows in a matrix. This is seen very frequently in machine learning contexts. It is something that we definitely want to be aware of. From a visual perspective, we can imagine having one data vector, $x$:

$$x = \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix}$$

Above it is still a column, however, once we have more than one data vector, we generally start placing them in rows. For instance, say we have 3 data vectors:

$$x^1 = \begin{bmatrix} x_1^1 \\ x_2^1 \\ x_3^1 \end{bmatrix},\; x^2 = \begin{bmatrix} x_1^2 \\ x_2^2 \\ x_3^2 \end{bmatrix},\; x^3 = \begin{bmatrix} x_1^3 \\ x_2^3 \\ x_3^3 \end{bmatrix}$$

Note here we are using superscripts because the subscript has already been utilized. It is not refering to exponentiation. In practice, instead of combining all of these column vectors and leaving them as columns, we would make them row vectors like so:

$$X = \begin{bmatrix} x_1^1 & x_2^1 & x_3^1 \\ x_1^2 & x_2^2 & x_3^2 \\ x_1^3 & x_2^3 & x_3^3 \end{bmatrix}$$

And generally the superscript notation is not used, and instead we have two subscripts, the first representing the row, the second the column:

$$X = \begin{bmatrix} x_{11} & x_{12} & x_{13} \\ x_{23} & x_{22} & x_{23} \\ x_{31} & x_{32} & x_{33} \end{bmatrix}$$



Matrix Multiplication: A Programmers Intuition

Okay now we can really try and expand our intuition of matrix multiplication, since we are going to be seeing it so frequently in machine learning. As we know, by convention a vertical column is generally a data vector, and a horizontal row is typically a function:

However, and this is a key understanding if you want matrix operations to be second nature while practicing machine learning, the operation could be a column, and the data a row! A visualization should help make this clear:

The row containing a horizontal function could really be three data points (each containing a single element), and the vertical column of data could really be three distinct functions, each taking a single parameter. This is the critical thing to understand-depending on what we want the outcome to be, we can combine data and code in a different order.

Matrix Transpose

By definition, the matrix transpose swaps rows and columns. What that means is say we have a column vector with 3 entries:

$$x = \begin{bmatrix} 3 \\ 4 \\ 5 \end{bmatrix}$$

The transpose of $x$, $x^T$, would then be:

$$x^T = \begin{bmatrix} 3 & 4 & 5\\ \end{bmatrix}$$

At which point it is either a function taking 3 arguments, or a data vector, but now it is 3 separate entities, since the transpose split it up.

Similarly, if we had a row vector:

$$f = \begin{bmatrix} 3 & 4 & 5\\ \end{bmatrix}$$

Then its transpose, $f^T$:

$$f^T = \begin{bmatrix} 3 \\ 4 \\ 5 \end{bmatrix}$$

Can either be a single data vector in a vertical column, or three separate functions, each taking a single input. Let's look at this in practice. Say we have the following equation:

$$x^T* x$$

In this case we mean $x^T$ (as a single function) is working on $x$ a single vector. The result is the dot product. In this case, we have applied the data to itself.



Now what about if we see it reversed?

$$x * x^T$$

Here we mean that $x$, now as a set of functions, is working on $x^T$, a set of individual data points. The result is a grid where we've applied each function to each data point.






Dot Product in Machine Learning

Now when first diving in to machine learning it is difficult to connect the portions of linear algebra, which mainly seem to be just big matrix multiplications, to that of the geometrical interpration of the dot product. It may seem odd that the dot product is continually mentioned, since it's geometrical interpretation feels very far away from a neural network, i.e. were a vector of inputs are being multiplied by a set of weights and that gives you the value at a specific node. How does this relate to geometry and the conventional dot product?

Before we dive into it, I need to preface by saying this was a "Eureka" moment for me. I had been diving into linear algebra for months, and had been training machine learning models, going through their theoretical underpinnings and implementing them in code, but until I made this connection everything felt very divided. Without further delay, let's get into it.

Say I have an image of a dog, a cat, and a new test image that could be either a dog or cat. I would like for my machine learning model to tell me which on it is. Well, the algorithm is going to do a very sophisticated process which is analogous to the dot product.

Geometrical View

From the geometrical side of things, what is a dot product? Well if I have two vectors, a and b, a dot product will determine how similar in direction vector a is to vector b, based on the measure of the angle between them.



If the input vector for the unknown image is closer in direction to the direction of the dog vector, it will classify the unknown image as a dog! However, if the unknown image vector is closer to the cat vector, it will classify it is a cat!

In deep learning, this classification is the result of many layers of successive dot product classifications before an answer is produced.

This may become more concrete with a thorough example. Imagine we are dealing with single layer perceptron for multi-class classification. Our input vector consists of 3 dimensions, and our we have 3 potential output classes.

We can see that in order to find the output node 1, 2 and 3 values, the input vector (the values of nodes 1, 2, 3 in the input layer) are multiplied by the corresponding weights.

Now, we can think about it as follows, the columns in the weight matrices represent the vector that is applied to the input vector, via the dot product. For instance, output node 1 is equal to:

$$x_1*W_{11}+x_2*W_{21}+x_3*W_{31}$$

And output node 2 is equal to: $$x_1*W_{12}+x_2*W_{22}+x_3*W_{32}$$

And output node 3 is equal to: $$x_1*W_{13}+x_2*W_{23}+x_3*W_{33}$$

What this means in relation to what we talked about earlier, is that the greater the similarity between our input vector and the column weight vector corresponding to a specific output class, the greater probability of the input vector belonging to that class! For instance, say that class 1 represents a dog, 2 a cat, and 3 a fish. If we make a prediction on a an input dog vector, our goal would be that the input vector is most similar to the weight vector $[W_{11} W_{21} W_{31}]$, which is the weight vector mapping to output node 1, representing dog.

This idea of course becomes much more difficult to follow when many hidden layers are introduced, but understanding how the dot product relates geometrically during all of the calculations that occur in neural networks is invaluable.

In [ ]:
 

© 2018 Nathaniel Dake