Nathaniel Dake Blog

2. K-Means Clustering

We know that we are going to be trying to perform K-Means Clustering on data. So let's take a moment to visualize some data that we may get.

We can see that the points are all the same color. This is because we are doing unsupervised learning, and there are no classes being given to these points. Each point is just a vector, and that is all we know about each point. However, our human pattern recognition abilities allow us to see 3 distinct groups. In other words, we don't need the data set to tell us that these three groups are distinct. Our own pattern recognition abilities allow us to see this very clearly.

1.1 The Problem

What you may have noticed after looking at the data above, however, is that it is a very unique and specific situation. The first limitation of this data was that it was 2-dimensional. This is of course necessary, because if you had a 100 dimensional data set you wouldn't be able to see it. The universe itself only has 3 dimensions of space, so that is all you can see. This is problematic because most real world data sets are not 2 dimensional, and hence you cannot see most real world data. This means that your own pattern recognition skills are not useful in this scenario. It would be nice if we had some algorithm to find clusters or groups of data, that could work no matter what the dimensionality of the data is.

Here is another issue. In the original data, it was generated so that it would show 3 distinct clusters very clearly. Real data isn't so nice. We will want to be able to know:

Is the cluster we found good or not?

1.2 Intuition

So, as a first step into understanding k-means clustering, let's consider 2 fundamental truths about the clusters in a data set.

Fundamental Truth 1
Suppose we know that the yellow, purple, and green points are the center of some clusters, and we would like to know which center this new blue point belongs to. This point of course belongs to the cluster center that it is closest to. So, the decision rule for choosing which cluster a new point belongs to, is to pick the nearest cluster center.


Fundamental Truth 2
Now, lets consider the second fundamental fact about clusters. Let's say that all of the points we see below belong to the same cluster. We can refer to the data points as:

$$x_1,...,x_C$$

What is the center of this cluster? Of course, that is just the mean of all these data points. This is also referred to as the centroid, if you are thinking geometrically. So, as you know, the way to find the mean of a set of vectors, is to add them up, and divide by the number of vectors:

$$m = \frac{1}{C}\sum_{i=1}^Cx_i$$

1.3 Combining into Clustering Algorithm

Believe it or not, the two above facts are all that you need to implement k-means clustering. It turns out that if we just initialize the set of clusters randomly, if we just repeat these two steps over and over, we will converge to an answer.


Pseudocode

Initialize: pick K random points to be the cluster centers
While not converged:
    Assign each point to the nearest cluster center
    Recalculate each cluster center from points that belong to it


2. K-Means Algorithm

The input to K-Means is a matrix $X$, which is of size $N x D$. This means that we are dealing with $N$ samples and that those samples have $D$ features.

2.1 Training Algorithm

There are two main steps to the k-means clustering training algorithm.

  1. First we choose $k$ different cluster centers. We generally just assign these to random points in the data set.
  2. Then we go into our main loop. Here the first step is to decide which cluster each point in $X$ belongs to. We do that by looking at every sample, and choosing the closest cluster center. Remember, these are usually just assigned randomly to begin with. The second step is to recalculate each cluster center based on the points that were assigned to it. We do this by taking all of the samples, and calculating the mean. This is done until the algorithm converges. Generally this happens very fast, in about 5 to 15 steps. This is very different from gradient descent in deep learning, where we may have thousands of iterations before convergence.


3. Soft K-Means

One problem that we find when we do K-Means is that it is highly sensitive to its initialization. A possible resolution is to restart K-Means several times, and use whichever result gives us the best final objective. What does this tell us though? Well, it tells us that the cost function is susceptible to local minima.

One way to overcome this it have fuzzy membership to each class. This means that each data point doesn't belong to one class or another. But rather, there is an "amount" of membership. For example, it may be 60% part of cluster 1, and 40% part of cluster 2. We can get soft k means which just a small adjustment to the regular k-means algorithm.

3.1 Soft K-Means Algorithm

Pseudocode

Initialize m1...mk = random points in X
While not converged:
    Step 1: Calculate cluster responsibilities
$$r_k^{(n)} = \frac{exp\Big[-\beta d(m_k, x^{(n)})\Big]}{\sum_j exp \Big[-\beta d(m_j, x^{(n)})\Big]}$$
    Step 2: Recalculate means
$$m_k = \frac{\sum_n r_k^{(n)}x^{(n)}}{\sum_n r_k^{(n)}}$$

We can see that step 1 is where the biggest change occurs. We are now defining a term $r_k^{(n)}$, that will always be a fraction. In the case of hard k-means, aka regular k-means, that is where $r_k^{(n)}$ is exactly 0 or 1. We can see that in step 2 we are calculating a weighted mean. If $r_k^{(n)}$ is higher, then this $x_n$ matters more to this cluster K. This means that it has more influence on the calculation of its mean.

If you are familiar with deep learning, you may recognize the calculation of $r_k^{(n)}$ to be something similar to that of the softmax function.

3.2 Soft K-Means Breakdown

Let's take a minute to breakdown these two steps of soft k-means in order to understand them better. If we look at the equation for the responsibility:

$$r_k^{(n)} = \frac{exp\Big[-\beta d(m_k, x^{(n)})\Big]}{\sum_{j=1}^K exp \Big[-\beta d(m_j, x^{(n)})\Big]}$$

We can see that it depends on the distance between the point, and each cluster center. Why does this make sense? Well, you can imagine that if a data point is very close to one cluster center, and very far away from another, then we will get a data point close to 1. This means we will have a higher probability that the point belongs to the cluster with the center closest to it, will be the highest!

However, now lets say a data point is right in between two clusters. Then we should get 0.5. This makes sense because we are equally confident that this data point could belong to either of the two clusters.

So, this method allows us to quantify how confident we are in the cluster assignments, rather than simply assigning a data point to whichever cluster it is closest to.

3.3 Relationship to Gaussian

You may notice that the numerator in the term for the responsibility looks suspiciously like a Gaussian.

$$Numerator: \; exp\Big[-\beta d(m_k, x^{(n)})\Big]$$$$Gaussian \; PDF: \frac{1}{\sqrt{2 \pi \sigma^2}} exp \Big(\frac{1}{2\sigma^2}(x-\mu)^2\Big)$$

We will make this connection soon, when we discuss Gaussian Mixture Models. As you may know, the gaussian contains a variance term, which shows up in the exponent. The variance term controls how fat or skinny the PDF of the Gaussian is. And so this controls how fat or skinny the influence of each cluster is on each data point.

3.4 Calculating the Mean

In step 2 we had to calculate the mean, but it looked a little different than normal. This equation actually has a name, the weighted arithmetic mean. It is a generalization of the regular arithmetic mean, where you can think of each data point as having the same weight of 1.

$$Regular \; mean: m_k = \frac{1}{N}\sum_n x^{(n)} = \frac{1*x^{(1)} + 1*x^{(2)} + ...}{1 + 1 + ...}$$$$Weighted \; mean: m_k = \frac{\sum_n r_k^{(n)}x^{(n)}}{\sum_n r_k^{(n)}} = \frac{r_k^{(1)}x^{(1)} + r_k^{(2)}x^{(2)} + ...}{r_k^{(1)} + r_k^{(2)} + ...}$$

This also makes sense in terms of clustering, since, for example, if a data point is far away from the cluster center, then the corresponding responsibility will be close to 0, and therefore each data point shouldn't have a big influence on calculating the cluster center.

3.5 Purpose of Soft K-Means

The purpose of Soft K-Means is that it allows us to quantify our confidence in the cluster assignment. Look at the situation below:

We can see that this data point is still really in the middle of the two clusters and shouldn't really be assigned to the cluster on the right. It is still mainly in the center. However, this is exactly what hard k-means would do. It would say that this point belongs to the cluster on the right, and treat it the same way as all of the other points which are closer, and maybe more definitively belong to the cluster on the right.

Soft K-means allows us to represent the intuitive understanding that we have of the point belonging not fully to either class, with a number.

"Test point belongs to yellow with 51% probability, but may still belong to purple with 49% probability"



4. The K-Means Objective Function

As was the case with supervised learning, it is very important to talk about the objective functions that we are trying to maximize. Assuming that we are using euclidean distance as our distance measure, our objective function is:

$$J = \sum_n \sum_k r_k^{(n)} ||m_k - x^{(n)}||^2$$

In english this means we will sum over all data points $n$, and then sum over all clusters $k$, the distance between each cluster and each data point, weighted by the responsibilities. So, really this is just the squared distance weighted by the responsiblities.

So, if $x_n$ is far away from the mean of cluster k, hopefully that responsibility has been set very low. In deep learning we will use gradient descent, but we do not use that here! In this case we actually do what is called coordinate descent This means that we are moving in the direction of a smaller J, with respect to only 1 variable at a time. We can see that this is true because we only update 1 variable at a time, either $r_k^{(n)}$, or $m_k$. There is a mathematical guarantee that each iteration will result in the objective function decreasing, and thus it will always converge. However, there is no guarantee that it will converge to a global minimum.



5. K-Means In Code

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

def d(u, v):
    diff = u - v
    return diff.dot(diff)
  
def cost(X, R, M):
    cost = 0
    for k in range(len(M)):
        diff = X - M[k]
        sq_distances = (diff * diff).sum(axis=1)
        cost += (R[:,k] * sq_distances).sum()
    return cost

def plot_k_means(X, K, max_iter=20, beta=1.0, show_plots=True):
    N, D = X.shape
    M = np.zeros((K, D))
    # R = np.zeros((N, K))
    exponents = np.empty((N, K))

    # initialize M to random
    for k in range(K):
        M[k] = X[np.random.choice(N)]

    costs = np.zeros(max_iter)
    for i in range(max_iter):
        # step 1: determine assignments / resposibilities
        # is this inefficient?
        for k in range(K):
            for n in range(N):
                exponents[n,k] = np.exp(-beta*d(M[k], X[n]))

        R = exponents / exponents.sum(axis=1, keepdims=True)

        # step 2: recalculate means
        for k in range(K):
            M[k] = R[:,k].dot(X) / R[:,k].sum()

        costs[i] = cost(X, R, M)
        if i > 0:
            if np.abs(costs[i] - costs[i-1]) < 1e-5:
                break

    if show_plots:
        fig, ax = plt.subplots(figsize=(12,8))
        plt.plot(costs)
        plt.title("Costs")
        plt.show()

        random_colors = np.random.random((K, 3))
        colors = R.dot(random_colors)
        fig, ax = plt.subplots(figsize=(12,8))
        plt.scatter(X[:,0], X[:,1], c=colors)
        plt.show()

    return M, R


def get_simple_data():
    # assume 3 means
    D = 2 # so we can visualize it more easily
    s = 4 # separation so we can control how far apart the means are
    mu1 = np.array([0, 0])
    mu2 = np.array([s, s])
    mu3 = np.array([0, s])

    N = 900 # number of samples
    X = np.zeros((N, D))
    X[:300, :] = np.random.randn(300, D) + mu1
    X[300:600, :] = np.random.randn(300, D) + mu2
    X[600:, :] = np.random.randn(300, D) + mu3
    return X

def main():
    X = get_simple_data()

    # what does it look like without clustering?
    fig, ax = plt.subplots(figsize=(12,8))
    plt.scatter(X[:,0], X[:,1])
    plt.show()

    K = 3 # luckily, we already know this
    plot_k_means(X, K)

    K = 5 # what happens if we choose a "bad" K?
    plot_k_means(X, K, max_iter=30)

    K = 5 # what happens if we change beta?
    plot_k_means(X, K, max_iter=30, beta=0.3)

if __name__ == '__main__':
    main()
In [2]:
import numpy as np
import matplotlib.pyplot as plt

def d(u, v):
    diff = u - v
    return diff.dot(diff)

def cost(X, R, M):
    cost = 0
    for k in range(len(M)):
        for n in range(len(X)):
            cost += R[n,k]*d(M[k], X[n])
    return cost

def plot_k_means(X, K, max_iter=20, beta=1.0):
    N, D = X.shape
    M = np.zeros((K, D))
    R = np.ones((N, K)) / K

    # initialize M to random
    for k in range(K):
        M[k] = X[np.random.choice(N)]

    grid_width = 5
    grid_height = max_iter / grid_width
    random_colors = np.random.random((K, 3))
    plt.figure()

    costs = np.zeros(max_iter)
    for i in range(max_iter):
        # moved the plot inside the for loop
        colors = R.dot(random_colors)
        fig, ax = plt.subplots(figsize=(12,8))
        plt.scatter(X[:,0], X[:,1], c=colors)

        # step 1: determine assignments / resposibilities
        # is this inefficient?
        for k in range(K):
            for n in range(N):
                R[n,k] = np.exp(-beta*d(M[k], X[n])) / np.sum( np.exp(-beta*d(M[j], X[n])) for j in range(K) )

        # step 2: recalculate means
        for k in range(K):
            M[k] = R[:,k].dot(X) / R[:,k].sum()

        costs[i] = cost(X, R, M)
        if i > 0:
            if np.abs(costs[i] - costs[i-1]) < 1e-5:
                break
    plt.show()

def main():
    # assume 3 means
    D = 2 # so we can visualize it more easily
    s = 4 # separation so we can control how far apart the means are
    mu1 = np.array([0, 0])
    mu2 = np.array([s, s])
    mu3 = np.array([0, s])

    N = 900 # number of samples
    X = np.zeros((N, D))
    X[:300, :] = np.random.randn(300, D) + mu1
    X[300:600, :] = np.random.randn(300, D) + mu2
    X[600:, :] = np.random.randn(300, D) + mu3

    # what does it look like without clustering?
    fig, ax = plt.subplots(figsize=(12,8))
    plt.scatter(X[:,0], X[:,1])
    plt.show()

    K = 3 # luckily, we already know this
    plot_k_means(X, K)

if __name__ == '__main__':
    main()
<matplotlib.figure.Figure at 0x111803ef0>


6. K-Means Clustering Visualization



7. When can K-Means Fail?

In [3]:
def plot_k_means(X, K, max_iter=20, beta=1.0, show_plots=True):
    N, D = X.shape
    M = np.zeros((K, D))
    # R = np.zeros((N, K))
    exponents = np.empty((N, K))

    # initialize M to random
    for k in range(K):
        M[k] = X[np.random.choice(N)]

    costs = np.zeros(max_iter)
    for i in range(max_iter):
        # step 1: determine assignments / resposibilities
        # is this inefficient?
        for k in range(K):
            for n in range(N):
                exponents[n,k] = np.exp(-beta*d(M[k], X[n]))

        R = exponents / exponents.sum(axis=1, keepdims=True)

        # step 2: recalculate means
        for k in range(K):
            M[k] = R[:,k].dot(X) / R[:,k].sum()

        costs[i] = cost(X, R, M)
        if i > 0:
            if np.abs(costs[i] - costs[i-1]) < 1e-5:
                break

    if show_plots:
        fig, ax = plt.subplots(figsize=(12,8))
        plt.plot(costs)
        plt.title("Costs")
        plt.show()

        random_colors = np.random.random((K, 3))
        colors = R.dot(random_colors)
        fig, ax = plt.subplots(figsize=(12,8))
        plt.scatter(X[:,0], X[:,1], c=colors)
        plt.show()

    return M, R

def donut():
    N = 1000
    D = 2

    R_inner = 5
    R_outer = 10

    # distance from origin is radius + random normal
    # angle theta is uniformly distributed between (0, 2pi)
    R1 = np.random.randn(N//2) + R_inner
    theta = 2*np.pi*np.random.random(N//2)
    X_inner = np.concatenate([[R1 * np.cos(theta)], [R1 * np.sin(theta)]]).T

    R2 = np.random.randn(N//2) + R_outer
    theta = 2*np.pi*np.random.random(N//2)
    X_outer = np.concatenate([[R2 * np.cos(theta)], [R2 * np.sin(theta)]]).T

    X = np.concatenate([ X_inner, X_outer ])
    return X

def main():
    # donut
    X = donut()
    plot_k_means(X, 2)

    # elongated clusters
    X = np.zeros((1000, 2))
    X[:500,:] = np.random.multivariate_normal([0, 0], [[1, 0], [0, 20]], 500)
    X[500:,:] = np.random.multivariate_normal([5, 0], [[1, 0], [0, 20]], 500)
    plot_k_means(X, 2)

    # different density
    X = np.zeros((1000, 2))
    X[:950,:] = np.array([0,0]) + np.random.randn(950, 2)
    X[950:,:] = np.array([3,0]) + np.random.randn(50, 2)
    plot_k_means(X, 2)

if __name__ == '__main__':
    main()


8. Disadvantages of K-Means Clustering

There are several main disadvantages when it comes K-means clustering.


1) You have to choose K
The first issue is that you have to K. In 2-D or 3-D data we can look at the data to help us choose. But what if our data was 100-D?


2) Local Minima
The second disadvantage is that k-means only converges to a local minimum. This is the same thing with deep learning, but it is a bad thing in the case of k-means. One way to remedy this is to restart k-means multiple times and chose the clustering that gives the best value for the objective.


3) Sensitive to Initial Configuration
Another disadvantage is that k-means is very sensitive to initial configuration.


4) K-Means cannot solve donut problem
K-means is not able to solve the donut problem, or even ellipses. It can only look for spherical clusters, because it is only taking into account squared distance.


5) Doesn't take density into account



9. How to Evaluate a Cluster?

We are now going to discuss why the cost function that we have been using is limited, and what some common alternatives are. We can begin by touching on the pros:


Pros

  • It decreases on every round
  • It also makes perfect sense - we want our data points to be close to the cluster center that they belong to. Therefore the squared distance should be low when the responsibility is high.
  • Another way of saying this is that we want low intra-cluster distances
  • And we want high inter-cluster distances

However, there are some pretty heavy drawbacks to this method.
Cons

  • If you have a really large data set, you may run into some issues. Because the cost is just the sum of squared distances, it grows with the size of the dataset.
  • It is also sensitive to the scale of the data. So, if your data is in the range of 0-1, then all of the squared distances will be less than 1. Data in the range of 10e6, the squared distances will be 10e12! So, the cost we get isn't a universal number like accuracy, where is would be the same no matter what data we are using; it is sensitive to the data itself.
  • Finally, it is sensitive to the number of clusters $K$. If we were to set $K$ = N, then cost would be 0!


To Summarize the Cons

  • The cost increases with the size of the data, both in number of samples and in number of features. This is easy to adjust for, since we can just divide by $N$ and $D$ to account for that.
  • The scale of the data effects the cost. We can adjust for that as well, since we can adjust all of the data before doing k-means clustering on it. For instance, you could just divide the data by the global mean and standard deviation first.
  • Disadvantage 3 is that the cost is sensitive to the number of clusters $K$. This is not very easy to adjust for. Remember, overfitting is a concept that is useful in supervised learning. This is where our model predicts too closely to the training targets and does not generalize well to the test targets. Since this is unsupervised learning, there are no targets and hence this problem remains.


9.1 Purity

One interesting way to evaluate a clustering is called a purity.

$$Purity = \frac{1}{N} \sum_{k=1}^K max_{j=1..K}|c_k \cap t_j|$$

First we divide by $N$, which balances for the number of samples that we have. Next, what is $c$ and what is $t$? $c_k$ stands for the cluster center indexed by $k$. In total we have $K$ number of clusters. $t_j$ stands for the targets in class $j$. We search for the max over $j$, which means that we find the target class that this cluster most likely belongs to, because they have the biggest intersection of points. Of course we will need to modify this a bit, since cluster membership given a data point isn't exact for soft k-means; it is rather probabilistic.

For an example, suppose we are looking at the MNIST data set, where each of the classes is a digit between 0 and 9. We have 10 cluster centers but we have no idea what they mean. To determine what they mean, we look at each of the target classes. If we find that the best intersection of data points, is with data representing the digit 5, then that means that this cluster probably represents the digit 5. Because of this, the best purity we can get is 1. That means that for each cluster, the points that are assigned to it all correspond to the same true label.

However, this exposes 1 big disadvantage of the purity measure. It requires targets, which is generally just the case for supervised learning. And if you have targets, why not just use supervised learning? Many other measures also require targets, such as the Rand Measure, F-measure, Jaccard Index, Normalized Mutual Information. These methods that require the use of a true label are called external validation methods.


9.2 Davies-Bouldin Index

In a more realistic scenario, when we are doing unsupervised learning we most likely do not have labels. However, we still want to be able to test whether or not this is a good clustering. There are internal validation methods that do not require the use of a true label. One example is the Davies-Bouldin Index.

$$DBI = \frac{1}{K} \sum_{k=1}^K max_{j \neq k} \Big[\frac{\sigma_k + \sigma_j}{d(c_k, c_j)}\Big]$$

It has similarities in appearance the equation for purity we just discussed. Given $k$, we take the maximum over $j$ of some measure. In this equation, the $\sigma_k$ represents the average distance from each data point in a cluster to it's center. You can see why $\sigma$ is appropriate, since it is kind of like the standard deviation. $\sigma_j$ is the same thing except for cluster $j$. Note that because we are using soft k-means we will need to calculate the sigmas correctly using the responsibilities, which are probabilistic. Next, $d(c_k, c_j)$ represents the distance between the cluster center $k$ and the cluster center $j$. Ideally, we want the numerator to be small and the denominator to be large. This is because we want everything within a cluster to be closer together, and we want each cluster to be far apart from other clusters. This, the lower the DBI the better!



10. K-Means on MNIST Dataset

We are now going to perform K-means clustering on the MNIST dataset. The data can be found at: https://www.kaggle.com/c/digit-recognizer. Each image is a $D = (28 x 28)$ matrix of pixel values, but we will flatten it into a 784 dimensional vector.

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from kmeans import plot_k_means, get_simple_data
from datetime import datetime

# Seaborn Plot Styling
sns.set(style="white", palette="husl")
sns.set_context("poster")
sns.set_style("ticks")
In [5]:
"""Function to get and transform our data"""
def get_data(limit=None):
  print('Reading in data...')
  df = pd.read_csv('../../../data/MNIST/train.csv')
  print('Transforming the data...')
  data = df.as_matrix()
  np.random.shuffle(data)
  X = data[:, 1:] / 255.0
  Y = data[:, 0]
  if limit is not None:
    X, Y = X[:limit], Y[:limit]
  return X, Y

"""
Purity Function: maximum purity is 1, higher is better. In english, what we are doing 
is starting at cluster 1, and then we have an inner loop that will go through all of the 
target labels. We then determine for cluster 1, which of the labels is most likely associated
with it. We do this because for each data point, we know the probability it belongs to cluster
1, and we also have the true label of each data point. We can then look at all of the data
points that have a certain label, and then determine the corresponding probability that those
points are also part of cluster 1. Whichever label yields the highest total probability is the
label we associate with that cluster.
"""
def purity(Y, R):
  N, K = R.shape                          # R is cluster assignments, Y is labels
  p = 0
  for k in range(K):                      # Looping k through all of the clusters
    best_target = -1 
    max_intersection = 0
    for j in range(K):                    # Looping j through all of the target labels
      intersection = R[Y==j, k].sum() 
      if intersection > max_intersection:
        max_intersection = intersection
        best_target = j
    p += max_intersection
  return p / N

"""
DBI Function: Here lower is better. We need all data points X, and the cluster means M,
and the responsibilities R. 
"""
def DBI(X, M, R):
  K, D = M.shape                 # K-means, each dimension has a mean 
  sigma = np.zeros(K)
  for k in range(K):             # Start by calculating sigmas. Average distance between 
    diffs = X - M[k]             # all data points in cluster K and the center
    squared_distances = (diffs * diffs).sum(axis=1)
    weighted_squared_distances = R[:, k] * squared_distances
    sigma[k] = np.sqrt( weighted_squared_distances.sum() / R[:,k].sum())
    
    # Calculate Davies-Bouldin Index
    dbi = 0
    for k in range(K):
        max_ratio = 0
        for j in range(K):
            if k != j:
                numerator = sigma[k] + sigma[j]
                denominator = np.linalg.norm(M[k] - M[j])
                ratio = numerator / denominator
                if ratio > max_ratio:
                    max_ratio = ratio
        dbi += max_ratio
    return dbi / K
  
def main():
  X, Y = get_data(10000)
  
  print("Number of data points:", len(Y))
  M, R = plot_k_means(X, len(set(Y)))
  # Exercise: Try different values of K and compare the evaluation metrics
  print("Purity:", purity(Y, R))
  print("DBI:", DBI(X, M, R))
  
  # plot the mean images
  # they should look like digits
  for k in range(len(M)):
    im = M[k].reshape(28, 28)
    plt.imshow(im, cmap='gray')
    fig, ax = plt.subplots(figsize=(12,8))
    plt.show()

if __name__ == "__main__":
  main()
Reading in data...
Transforming the data...
Number of data points: 10000
Purity: 0.5856819734814049
DBI: 1.1873093601254083


11. Choosing K

You may have been wondering throughout this section about what the best way to chose $K$ is, i.e. the number of clusters. In the past we have talked about the process of cross validation, however, that is a little bit more difficult in this case because we do not have a notion of accuracy (unsupervised learning), and we don't typically use train and test sets. Although, that may be an interesting idea to look into.

11.1 Cost decreases and K increases

One thing that you will notice is that your cost is always going to decrease as K increases. Remember, it is the sum of squared error within a cluster. That means that the closer all of those points are to the center of the clusters, the lower the cost. But what happens is that as you increase the number of clusters, you always make it easier for the points to be closer to the center of a cluster.

However, something interesting happens as you go from K = 1..N. You get this sort of hockey stick shape:

This shows that at some point, increasing $K$ only gives it marginal improvements. Thus it is at that point that the data fits to the clusters nicely. So, in the image above we can see that $K=6$ is that natural number of clusters for the data.

You should keep in mind that your plot of Cost vs K will produce this shape every time.

In [6]:
import numpy as np
import matplotlib.pyplot as plt
from kmeans import plot_k_means, get_simple_data, cost

def main():
  X = get_simple_data()
  fig, ax = plt.subplots(figsize=(12,8))
  plt.scatter(X[:,0], X[:,1])
  plt.show()

  costs = np.empty(10)
  costs[0] = None
  for k in range(1, 10):
    M, R = plot_k_means(X, k, show_plots=False)
    c = cost(X, R, M)
    costs[k] = c
    
  fig, ax = plt.subplots(figsize=(12,8))
  plt.plot(costs)
  plt.title("Cost vs K")
  plt.show()

if __name__ == '__main__':
  main()


12. K-Means Application: Related Words

In [3]:
import networkx as nx
import nltk
import numpy as np
import matplotlib.pyplot as plt
from nltk.stem import WordNetLemmatizer
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE, LocallyLinearEmbedding as LLE
from sklearn.feature_extraction.text import TfidfTransformer

wordnet_lemmatizer = WordNetLemmatizer()

titles = [line.rstrip() for line in open('../../data/nlp/all_book_titles.txt')]

# copy tokenizer from sentiment example
stopwords = set(w.rstrip() for w in open('../../data/nlp/stopwords.txt'))
# add more stopwords specific to this problem
stopwords = stopwords.union({
    'introduction', 'edition', 'series', 'application',
    'approach', 'card', 'access', 'package', 'plus', 'etext',
    'brief', 'vol', 'fundamental', 'guide', 'essential', 'printed',
    'third', 'second', 'fourth', })
def my_tokenizer(s):
    s = s.lower() # downcase
    tokens = nltk.tokenize.word_tokenize(s) # split string into words (tokens)
    tokens = [t for t in tokens if len(t) > 2] # remove short words, they're probably not useful
    tokens = [wordnet_lemmatizer.lemmatize(t) for t in tokens] # put words into base form
    tokens = [t for t in tokens if t not in stopwords] # remove stopwords
    tokens = [t for t in tokens if not any(c.isdigit() for c in t)] # remove any digits, i.e. "3rd edition"
    return tokens


# create a word-to-index map so that we can create our word-frequency vectors later
# let's also save the tokenized versions so we don't have to tokenize again later
word_index_map = {}
current_index = 0
all_tokens = []
all_titles = []
index_word_map = []
print("num titles:", len(titles))
print("first title:", titles[0])
for title in titles:
    try:
        title = title.encode('ascii', 'ignore') # this will throw exception if bad characters
        title = title.decode('utf-8')
        all_titles.append(title)
        tokens = my_tokenizer(title)
        all_tokens.append(tokens)
        for token in tokens:
            if token not in word_index_map:
                word_index_map[token] = current_index
                current_index += 1
                index_word_map.append(token)
    except Exception as e:
        print(e)



# now let's create our input matrices - just indicator variables for this example - works better than proportions
def tokens_to_vector(tokens):
    x = np.zeros(len(word_index_map))
    for t in tokens:
        i = word_index_map[t]
        x[i] += 1
    return x

N = len(all_tokens)
D = len(word_index_map)
X = np.zeros((D, N)) # terms will go along rows, documents along columns
i = 0
for tokens in all_tokens:
    X[:,i] = tokens_to_vector(tokens)
    i += 1

def d(u, v):
    diff = u - v
    return diff.dot(diff)

def cost(X, R, M):
    cost = 0
    for k in range(len(M)):
        # method 1
        # for n in range(len(X)):
        #     cost += R[n,k]*d(M[k], X[n])

        # method 2
        diff = X - M[k]
        sq_distances = (diff * diff).sum(axis=1)
        cost += (R[:,k] * sq_distances).sum()
    return cost

def plot_k_means(X, K, index_word_map, max_iter=20, beta=1.0, show_plots=True):
    N, D = X.shape
    M = np.zeros((K, D))
    R = np.zeros((N, K))
    exponents = np.empty((N, K))

    # initialize M to random
    for k in range(K):
        M[k] = X[np.random.choice(N)]

    costs = np.zeros(max_iter)
    for i in range(max_iter):
        # step 1: determine assignments / resposibilities
        # is this inefficient?
        for k in range(K):
            for n in range(N):
                # R[n,k] = np.exp(-beta*d(M[k], X[n])) / np.sum( np.exp(-beta*d(M[j], X[n])) for j in range(K) )
                exponents[n,k] = np.exp(-beta*d(M[k], X[n]))

        R = exponents / exponents.sum(axis=1, keepdims=True)

        # step 2: recalculate means
        for k in range(K):
            M[k] = R[:,k].dot(X) / R[:,k].sum()

        costs[i] = cost(X, R, M)
        if i > 0:
            if np.abs(costs[i] - costs[i-1]) < 10e-5:
                break

    if show_plots:
        # plt.plot(costs)
        # plt.title("Costs")
        # plt.show()

        random_colors = np.random.random((K, 3))
        colors = R.dot(random_colors)
        plt.figure(figsize=(80.0, 80.0))
        plt.scatter(X[:,0], X[:,1], s=300, alpha=0.9, c=colors)
        annotate1(X, index_word_map)
        # plt.show()
        plt.savefig("test.png")


    # print out the clusters
    hard_responsibilities = np.argmax(R, axis=1) # is an N-size array of cluster identities
    # let's "reverse" the order so it's cluster identity -> word index
    cluster2word = {}
    for i in range(len(hard_responsibilities)):
      word = index_word_map[i]
      cluster = hard_responsibilities[i]
      if cluster not in cluster2word:
        cluster2word[cluster] = []
      cluster2word[cluster].append(word)

    # print out the words grouped by cluster
    for cluster, wordlist in cluster2word.items():
      print("cluster", cluster, "->", wordlist)

    return M, R

def annotate1(X, index_word_map, eps=0.1):
  N, D = X.shape
  placed = np.empty((N, D))
  for i in range(N):
    x, y = X[i]

    # if x, y is too close to something already plotted, move it
    close = []

    x, y = X[i]
    for retry in range(3):
      for j in range(i):
        diff = np.array([x, y]) - placed[j]

        # if something is close, append it to the close list
        if diff.dot(diff) < eps:
          close.append(placed[j])

      if close:
        # then the close list is not empty
        x += (np.random.randn() + 0.5) * (1 if np.random.rand() < 0.5 else -1)
        y += (np.random.randn() + 0.5) * (1 if np.random.rand() < 0.5 else -1)
        close = [] # so we can start again with an empty list
      else:
        # nothing close, let's break
        break

    placed[i] = (x, y)

    plt.annotate(
      s=index_word_map[i],
      xy=(X[i,0], X[i,1]),
      xytext=(x, y),
      arrowprops={
        'arrowstyle' : '->',
        'color' : 'black',
      }
    )

print("vocab size:", current_index)

transformer = TfidfTransformer()
X = transformer.fit_transform(X).toarray()

reducer = TSNE()
Z = reducer.fit_transform(X)
plot_k_means(Z[:,:2], current_index//10, index_word_map, show_plots=True)
num titles: 2373
first title: Philosophy of Sex and Love A Reader
vocab size: 2070
cluster 108 -> ['philosophy', 'reading', 'context', 'magic', 'witchcraft', 'criminological', 'formation', 'near', 'prophet', 'caliphate', 'wondering', 'consequence']
cluster 110 -> ['sex', 'murray', 'flechtner', 'sale', 'lease', 'national', 'transaction', 'prostitution', 'pornography', 'industry', 'addiction']
cluster 187 -> ['love', 'mar', 'venus', 'secret', 'improving', 'lasting', 'intimacy', 'fulfillment', 'giving', 'receiving', 'passion', 'martian', 'cultivating', 'act', 'afro-surinamese']
cluster 22 -> ['reader', 'testament', 'theorizing', 'conscious', 'voice', 'sailing', 'wine-dark', 'rethinking', 'modernity']
cluster 152 -> ['judaism', 'christianity', 'woman', 'religious', 'feminism', 'interpretation', 'document', 'eye', 'dubois', 'ellen', 'carol', 'torah', 'gentile']
cluster 9 -> ['islam', 'journey', 'mesoamerica', 'cosmic', 'frontier', 'workbook', 'theological', 'six-volume', 'set', 'learn', 'biblical', 'pack']
cluster 31 -> ['microprocessor', 'motorola', 'family', 'interfacing', 'hardware', 'mechanic', 'static', 'software', 'microcomputer', 'quantum', 'assembly', 'high-performance', 'accessible']
cluster 176 -> ['principle', 'anatomy', 'physiology', 'human', 'seeley', 'vander', 'mydevelopmentlab', 'caribbean', 'tortora', 'animal']
cluster 182 -> ['bernhard', 'edouard', 'fernow', 'story', 'jew', 'muslim', 'monotheistic', 'augustine', 'theologian', 'defense', 'hardcover', 'philip', 'heller']
cluster 170 -> ['north', 'everyday', 'economy', 'visualizing', 'bates', 'taking', 'intensity', 'modulated', 'radiation', 'burton', 'oncology', 'rhoades']
cluster 0 -> ['american', 'tradition', 'art', 'history', 'source', 'western', 'volume', 'transformation', 'dramaturgy', 'myartslab', 'janson', 'look', 'worldwide', 'liberty']
cluster 163 -> ['forestry', 'nature', 'personality', 'domain', 'knowledge', 'selfless', 'consciousness', 'nirvana', 'agricultural', 'woodland', 'owner', 'desiring', 'a-z', 'positive', 'impact', 'enemy', 'themselves']
cluster 131 -> ['encyclopedia', 'applied', 'expanded', 'eastern', 'immigrant', 'refugee', 'skills-based', 'zen', 'industrial/organizational']
cluster 183 -> ['buddhism', 'experience', 'phenomenon', 'introducing', 'tool', 'buddhist', 'creative', 'production', 'pre-islamic', 'fractal', 'selforganization']
cluster 10 -> ['programming', 'analog', 'cmos', 'integrated', 'circuit', 'electric', 'microelectronics', 'claim', 'ground', 'masteringengineering', 'matlab', 'microelectric']
cluster 73 -> ['anthem', 'era', 'theravada', 'benares', 'colombo', 'fox', 'mcdonald', 'speakable', 'unspeakable', 'mktg', 'transition']
cluster 63 -> ['student', 'chemical', 'engineering', 'geographic', 'resource', 'thermodynamics', 'dvd', 'thermal-fluid', 'thermostatistics', 'reactivity', 'appendix']
cluster 49 -> ['modern', 'classical', 'ghost', 'roman', 'sourcebook', 'tragedy', 'stage', 'essay', 'translation', 'play', 'authoritative', 'accompanied']
cluster 150 -> ['read', 'professor', 'lively', 'entertaining', 'line', 'expanding', 'discourse', 'transnational', 'literacy', 'ideology', 'strike', 'circular', 'migration', 'flapper', 'madcap', 'style', 'celebrity']
cluster 205 -> ['literature', 'forensics', 'criminalistics', 'ecology', 'anthology', 'core', 'element', 'cyber', 'mynursingkit', 'sun', 'norton', 'eighth', 'vols', 'feminism-art-theory', 'hard', 'disk', 'ec-council']
cluster 84 -> ['communication', 'administrative', 'research', 'method', 'criminology', 'criminal', 'justice', 'skill-building', 'qualtrics', 'dilemma', 'text/reader']
cluster 160 -> ['understanding', 'mader', 'mirror', 'humanity', 'comprehensive', 'billing', 'coding', 'reimbursement', 'encoderpro', 'com', 'demo']
cluster 64 -> ['world', 'religion', 'illustrated', 'review', 'scripture', 'public', 'you', 'texas', 'eight', 'myplate', 'masteringnutrition', 'lippincott', 'organizational']
cluster 89 -> ['africa', 'communications-', 'infrastructure', 'defeat', 'bad', 'news', 'rwanda', 'musinga', 'african', 'reversing', 'sail', 'africana']
cluster 166 -> ['america', 'microbiology', 'narrative', 'prescott', 'laboratory', 'ninth', 'mind', 'serology', 'one-volume', 'seagull']
cluster 26 -> ['hinduism', 'sacred', 'swaminarayan', 'belief', 'headline', 'confucianism', 'shintoism', 'taoism', 'intersection', 'sky', 'water', 'stream', 'running', 'shinto']
cluster 124 -> ['china', 'asian', 'east', 'head', 'dragon', 'serpent', 'tail', 'ming', 'campaign', 'commander', 'scholar', 'mongolia', 'calligraphy']
cluster 190 -> ['wisdom', 'popular', 'camphor', 'flame', 'india', 'beat', 'archetype', 'tribute', 'spanning', 'continent']
cluster 200 -> ['text', 'global', 'age', 'analytical', 'graduate', 'gene', 'protein', 'gardner', 'spacetime', 'calculus-based', 'photon']
cluster 86 -> ['soul', 'cinema', 'colonial', 'suspect', 'relation', 'resistance', 'carolina', 'body', 'meditation', 'existence', 'god', 'distinction', 'demonstrated']
cluster 128 -> ['life', 'design', 'sexuality', 'self', 'society', 'changing', 'appreciating', 'diversity', 'looseleaf', 'globalization', 'gendered', 'masteringgeography', 'embracing']
cluster 129 -> ['thriving', 'revolution', 'hole', 'hospitality', 'tourism', 'field', 'black', 'internationalism', "n't"]
cluster 11 -> ['chaos', 'stochastic', 'microelectronic', 'dynamical', 'electrical', 'nonlinear', 'time-series', 'dynamics+chaos', 'friendly', 'silent', 'edward', 'hopper', 'processes`']
cluster 53 -> ['handbook', 'medicine', 'internal', 'hospital', 'pocket', 'massachusetts', 'notebook', 'anesthesia', 'department', 'pain']
cluster 133 -> ['management', 'system', 'database', 'operating', 'material', 'implementation', 'internals', 'warehouse', 'freebsd', 'turfgrass', 'bind-in']
cluster 146 -> ['blood', 'custom', 'born', 'fire', 'german', 'dagger', 'war', 'photographic', 'dlv/nsfk', 'diplomat', 'red', 'cross', 'police', 'rlb', 'teno', 'reichsbahn', 'postal', 'hunting', 'etc']
cluster 68 -> ['relative', 'faith', 'journalist', 'investigates', 'toughest', 'objection', 'permeability', 'petroleum', 'reservoir', 'insect', 'pest', 'tropical', 'started', 'beginning', 'homeschooled', 'self-taught']
cluster 199 -> ['wheelock', 'latin', 'astronomy', 'exploration', 'beginner', 'dictionary', 'founded', 'andrew', 'freud', 'universe', 'masteringastronomy', 'solar', 'self-teaching', 'grammar/north', 'observational', 'eerdmans', 'millenium', 'distributed']
cluster 135 -> ['choice', 'pharmacology', 'adaptation', 'studying', 'nursing', 'lilley', 'private', 'mammalogy', 'assisting', 'occupation', 'patient-centered', 'kee']
cluster 65 -> ['uncertainty', 'luck', 'thrive', 'despite', 'schaum', 'roger', 'tokheim', 'shigley', 'mechanical', 'vibration', 'writing', 'capo', 'finest', 'rock', 'pop', 'jazz', 'country', 'seal']
cluster 113 -> ['relativity', 'atom', 'defined', 'explained', 'einstein', 'gravitation', 'cosmology', 'master', 'derived', 'appraised', 'special', 'physicist', 'gravity', 'deviance']
cluster 179 -> ['pure', 'marine', 'semi-riemannian', 'geometry', 'biodiversity', 'chapman', 'hall/crc', 'pension', 'fund']
cluster 70 -> ['physic', 'engineer', 'scientist', 'chap', 'conceptual', 'extended', 'secularity']
cluster 69 -> ['experiment', 'fluid', 'digital', 'foundation', 'simulation', 'evidence', 'xilinx', 'cpld', 'behavioral', 'vhdl', 'interaction', 'quantitative', 'psy', 'lan', 'wan']
cluster 71 -> ['pathophysiology', 'concept', 'variable', 'practical', 'skeptic', 'single', 'multivariable', 'altered', 'awareness', 'applying', 'porth']
cluster 122 -> ['health', 'physical', 'profession', 'mosby', 'examination', 'seidel', 'helping', 'gould']
cluster 94 -> ['care', 'power', 'electronics', 'geosystems', 'primary', 'provider', 'converter', 'diesel', 'electricity', 'anomaly', 'detection', 'colonize', 'this', 'color']
cluster 138 -> ['professional', 'grammar', 'esl', 'spelling', 'phonics', 'curriculum', 'reversible']
cluster 52 -> ['operation', 'measure', 'efficiency', 'non-linguists', 'primer', 'dependent', 'phylogenomics']
cluster 127 -> ['machine', 'learning', 'perspective', 'wardlaw', 'combo', 'sociological', 'adaptive', 'computation', 'probabilistic', 'ten', 'algorithmic', 'analytics']
cluster 72 -> ['e-commerce', 'ethical', 'business', 'obligation', 'entrepreneurial', 'standard', 'legal', 'environment', 'e-business', 'regulatory', 'setting', 'miller/cross', 'prep-u']
cluster 12 -> ['strategy', 'medical', 'picture', 'basic', 'developing', 'decision-focused', 'administering', 'mark', 'lieberman', 'nontechnical']
cluster 154 -> ['technology', 'international', 'proceeding', 'conference', '...', 'june', 'drag', 'reduction', 'ceas/dragnet', 'european', 'potsdam', 'germany', 'note', 'multidisciplinary', 'patras', 'greece']
cluster 74 -> ['real', 'analysis', 'insurance', 'estate', 'finance', 'investment', 'mcgraw-hill/irwin', 'corporate', 'alternate', 'creation', 'insureance', 'diaspora', 'commercial']
cluster 6 -> ['complex', 'sexual', 'backpack', 'fiction', 'poetry', 'drama', 'carbohydrate', 'biomedical', 'mahayana', 'doctrinal', 'researcher', 'allah', 'outlaw', 'city']
cluster 95 -> ['outline', 'loose-leaf', 'module', 'highlight', 'david', 'griffith', 'isbn', 'myers']
cluster 142 -> ['paperback', 'program', 'fitness', 'career', 'exercising', 'option', 'sukiennik', 'diane', 'author']
cluster 23 -> ['probability', 'random', 'process', 'statistic', 'elementary', 'step', 'formula', 'signal']
cluster 139 -> ['immunology', 'mucosal', 'roitt', 'kuby', 'kindt', 'veterinary']
cluster 27 -> ['security', 'stakeholder', 'homeland', 'all-hazards', 'response', 'achieving', 'sustainability', 'intervention', 'mayhem', 'terrorism', 'population', 'morality']
cluster 17 -> ['sixth', 'goodman', 'gilman', 'pharmacological', 'basis', 'therapeutic', 'oca/ocp', 'oracle', 'all-in-one', 'exam', 'cd-rom', 'pathology', 'expert', 'administration', 'correlation', 'print']
cluster 38 -> ['foreword', 'warren', 'buffett', 'matching', 'supply', 'demand', 'fallen', 'angel', 'reception', 'enochic']
cluster 175 -> ['twelfth', 'sociology', 'environmental', 'behavior', 'psychology', 'inquiry', 'abnormal', 'invitation', 'josephus', 'update', 'deviant', 'gateway', 'dsm-v']
cluster 45 -> ['press', 'moral', 'model', 'tenth', 'logic', 'reasoning', 'cognitive', 'psycholinguistic', 'acl-mit', 'doing', 'wireless', 'modelling']
cluster 99 -> ['clinical', 'lange', 'kumar', 'clark', 'license', 'cram', 'mediterranean', 'land']
cluster 5 -> ['ethic', 'contemporary', 'chemistry', 'organic', 'biochemistry', 'issue', 'bioethics', 'digital+microprocessor', 'argument', 'medicinal', 'ecoimmunology', 'masteringchemistry', 'inorganic', 'level', 'berg']
cluster 106 -> ['decision', 'structured', 'intel', 'wave', 'oscillation', 'prelude', 'informed', 'loss']
cluster 7 -> ['seventh', 'law', 'employment', 'casebook', 'aspen', 'constitutional', 'food', 'krause', 'secured', 'credit', 'regulation']
cluster 159 -> ['science', 'earth', 'political', 'mypoliscilab', 'masteringgeology']
cluster 32 -> ['understand', 'linguistics', 'teach', 'yourself', 'education', 'language', 'file', 'structural', 're-entry', 'aiaa', 'monitoring', 'blackwell', 'intake', 'sign']
cluster 98 -> ['harrison', 'self-assessment', 'time', 'nvestigative', 'reporter', 'protocol', 'action', 'hilbert', 'space', 'synchronization']
cluster 91 -> ['board', 'performance', 'record', 'computerized', 'dance', 'installation', 'kernel', 'geoffrey', 'holder', 'uc/os-iii', 'scalable', 'romable', 'preemptive', 'multitasking', 'dsps', 'included', 'discover']
cluster 194 -> ['strategic', 'marketing', 'masterplan', 'managing', 'profitable', 'customer-based', 'preface', 'service', 'channel', 'embedded', 'microcontrollers', 'arm', 'cortex']
cluster 177 -> ['starting', 'java', 'solving', 'object-oriented', 'adts', 'object', 'myprogramminglab', 'control', 'compatible', 'absraction']
cluster 21 -> ['book', 'biological', 'electronic', 'specie', 'commerce', 'spectrum', 'portable', 'carte', 'ala', 'coloring']
cluster 184 -> ['tintinalli', 'emergency', 'regional', 'united', 'promise', 'trouble', 'subversion', 'subregions']
cluster 29 -> ['manual', 'solution', 'consult', 'chapter', 'washington', 'stewart', 'mcmurry', 'devore', 'michigan', 'allergy', 'asthma', 'subspecialty']
cluster 90 -> ['pharmacotherapy', 'pathophysiologic', 'morgan', 'kaufmann', 'hardware/software', 'interface', 'fifth', 'antiquity', 'nurse', 'talon', 'eagle', 'understandable']
cluster 19 -> ['computer', 'organization', 'biology', 'network', 'molecular', 'top', 'bscs', 'architecture', 'myreligionlab', 'campbell', 'masteringbiology', 'cell', 'user', 'security+', 'cybersecurity', 'inside']
cluster 3 -> ['risk', 'calculus', 'reinforced', 'concrete', 'networking', 'top-down', 'e-marketing', 'bible', 'transcendentals', 'thomas']
cluster 137 -> ['plant', 'stern', 'introductory', 'computational', 'lecture-tutorials', 'lift', 'matlab/octave', 'raven', 'bed', 'shadow']
cluster 125 -> ['economics', 'mcgraw-hill', 'disability', 'masteringmicrobiology', 'disease', 'taxonomy', 'adult', 'acsms', 'person', 'chronic']
cluster 126 -> ['spiral', 'psychological', 'publication', 'archaeology', 'down-to-earth', 'mysoclab', 'birth', 'bce', 'association', 'developmentally', 'appropriate', 'childhood', 'serving']
cluster 97 -> ['computing', 'natural', 'advanced', 'processing', 'neuro-fuzzy', 'soft', 'intelligence', 'python', 'eaia', 'artificial', 'guarda', 'portugal', 'october', 'miguel', 'filgueiras', 'portal', 'faro']
cluster 156 -> ['planning', 'educational', 'ecological', 'myeducationlab', 'census', 'pragmatic']
cluster 13 -> ['information', 'driven', 'recent', 'advance', 'juvenile', 'delinquency', 'applicatioins', 'nldb', 'alicante', 'spain', 'klagenfurt', 'austria', 'proceed', 'nlp', 'fintal', 'turku', 'finland']
cluster 111 -> ['film', 'watching', 'advertising', 'promotion', 'criticism', 'flashback', 'implementing', 'evaluating', 'meggs', 'graphic', 'key', 'soap', 'cigarette', 'aspect', 'discussion', 'selection']
cluster 151 -> ['culture', 'government', 'sketch', 'role', 'economic', 'development', 'institutional', 'worker', 'genetics', 'growth']
cluster 35 -> ['evolution', 'challenge', 'competition', 'intermediate', 'annual', 'report', 'true', 'chinese', 'prehistory', 'trip', 'princeton', 'sense']
cluster 78 -> ['study', 'aerodynamics', 'companion', 'low-speed', 'helicopter', 'geha', 'ideal-fluid', 'getting', 'compressible', 'wonder']
cluster 60 -> ['gender', 'reconstructing', 'experiencing', 'race', 'class', 'convergence', 'ethnicity', 'multiple', 'identity', 'counseling', 'merrill', 'arab', 'syrian', 'ethnogenesis']
cluster 33 -> ['integrative', 'plus/learnsmart', 'semester', 'online', 'learnsmart', 'card/apr', 'access/phils', 'mycrimekit', 'greek-english', 'lexicon']
cluster 140 -> ['geography', 'region', 'dental', 'hygienist', 'atlas', 'histology', 'correlated', 'realm', 'netter', 'embryology', 'throughput', 'sequencing', 'acc']
cluster 80 -> ['theater', 'classroom', 'culturally', 'linguistically', 'diverse', 'esol', 'teacher', 'ethnically', 'illusive', 'utopia', 'korea', 'theory/text/performance', 'exceptional', 'teaching', 'walking', 'processing-ijcnlp', 'joint', 'jeju', 'island']
cluster 18 -> ['bioinformatics', 'genetic', 'vold', 'theoretical', 'search', 'optimization', 'geneticist', 'biologist']
cluster 161 -> ['geology', 'unfinished', 'nation', 'concise', 'people', 'exploring', 'central', 'microcase', 'project-based', 'boy']
cluster 132 -> ['communicating', 'success', 'path', 'straight', 'tangled', 'bank']
cluster 88 -> ['pharmacy', 'technician', 'environmentalism', 'multiculturalism']
cluster 20 -> ['office', 'patient', 'payment', 'graphical', 'numerical', 'algebraic']
cluster 197 -> ['unity', 'form', 'connect', 'includes', 'apr', 'phils', 'flight', 'aeronautics']
cluster 62 -> ['function', 'force', 'kinematics', 'transcendental', 'disorder', 'driving', 'nanoscience', 'updated', 'immune']
cluster 173 -> ['theory', 'practice', 'cryptography', 'game', 'rodrigues', 'econometrics', 'w/student', 'w/crunchit/eesee']
cluster 155 -> ['differential', 'equation', 'value', 'partial', 'fourier', 'boundary', 'featured', 'title']
cluster 103 -> ['vector', 'microsoft', 'excel', 'support', 'regularization', 'kernel-based', 'classify', 'vba', 'modeler']
cluster 162 -> ['dynamic', 'sustainable', 'industrial', 'allelopathy', 'springer', 'plnlp', 'agroecology', 'noncooperative', 'actuary', 'contribution', 'contingency', 'chester', 'wallace', 'jordan']
cluster 107 -> ['modeling', 'speed', 'signaling', 'jitter', 'budgeting', 'fluid-phase', 'equilibrium', 'short', 'photography', 'interpretive', 'consider', 'darkroom', 'preparation', 'image', 'archive', 'europe']
cluster 59 -> ['statistical', 'testing', 'test', 'measurement', 'inference', 'spatial', 'mental', 'diagnostic']
cluster 54 -> ['technique', 'nutritional', 'assessment', 'forensic', 'investigative', 'victim', 'offender', 'treatment', 'scientific', 'trend']
cluster 105 -> ['nutrition', 'lesikar', 'connecting', 'landscape', 'activity', 'mapping', 'performed', 'pedagogy', 'cycle', 'controversy']
cluster 167 -> ['matter', 'mysearchlab', 'politics', 'empirical', 'coursereader', 'passionate']
cluster 43 -> ['change', 'greek', 'ancient', 'athenaze', 'conformity', 'conflict', 'myanthrolab', 'particle', 'retrieving']
cluster 41 -> ['leader', 'networked', 'parallel', 'workstation', 'variation', 'theme', 'boundary-value', 'available', 'youbook', 'briefer', 'cellular', 'abbas']
cluster 174 -> ['building', 'consumer', 'perl', 'mysql', 'nonviolence', 'peace', 'individual', 'ecosystem']
cluster 44 -> ['critical', 'linear', 'algebra', 'technical', 'twenty-first', 'century', 'mycommunicationkit', 'avant-garde', 'star', 'galaxy', 'multi-variable', 'nineteenth', 'ecocriticisms', 'kuwait', 'paradox']
cluster 185 -> ['skill', 'personal', 'est', 'esta', 'focus', 'help', 'develop', 'successful', 'genderspeak', 'effectiveness']
cluster 202 -> ['investigation', 'insight', 'procedure', 'wsj', 'insert', 'homicide', 'tactic']
cluster 4 -> ['urban', 'designing', 'wildlife', 'forest', 'community', 'northeast', 'mcknight', 'consensus', 'civic', 'participation', 'architect', 'planner', 'designer', 'local', 'habitat', 'greenspaces', 'conservation']
cluster 168 -> ['broadcasting', 'cable', 'beyond', 'medium', 'operational', 'amplifier', 'neural', 'entrepreneurship', 'turing', 'limit']
cluster 119 -> ['internet', 'university', 'framework', 'english', 'preparing', 'educator', 'engage', 'placement', 'radin', 'rothchild', 'reese', 'silverman', 'emerging']
cluster 195 -> ['pathway', 'heat', 'transfer', 'ee', 'james', 'kurose', 'keith', 'ross', 'biochemical', 'oca', 'administrator', 'certified', 'associate', 'thermal', 'neutron', 'scattering']
cluster 40 -> ['college', 'discrete', 'hcs', 'oxford', 'proboscidea', 'palaeoecology', 'elephant', 'vocabulary', 'mathematics', 'actuarial', 'barnett', 'finite', 'undergraduate', 'regression', 'cdrom', 'contingent']
cluster 192 -> ['living', 'connection', 'real-world', 'crime', 'scene', 'lab', 'binder', 'ready', 'planet', 'blue', 'how-to', 'connectionist']
cluster 149 -> ['survey', 'appreciative', 'view', 'micro', 'christian', 'islamic', 'real-time', 'practitioner', 'crusade']
cluster 87 -> ['labor', 'responsibility', 'idea', 'speaking', 'reason', 'seduced', 'elite', 'exploit']
cluster 55 -> ['managerial', 'accounting', 'cost', 'balanced', 'overview']
cluster 56 -> ['algorithm', 'data', 'mining', 'structure', 'c++', 'using', 'stl', 'abstraction', 'pseudocode', 'sa']
cluster 102 -> ['vertebrate', 'comparative', 'manager', 'deconstruction', 'towards', 'semiotics', 'solvency']
cluster 180 -> ['music', 'appreciation', 'cd', 'upgrade', 'eleventh', 'musical', 'listening', 'broadway', 'recording', 'shorter', 'accompany', 'score', 'enjoyment', 'perceptive', 'reel', 'showtime']
cluster 116 -> ['semiconductor', 'device', 'solid', 'pushing', 'electron', 'thales', 'aristotle', 'total', 'artwork', 'expressionism']
cluster 57 -> ['social', 'policy', 'combined', 'civilization', 'eighteenth', 'framing', 'brooks/cole', 'empowerment', 'welfare', 'hate']
cluster 50 -> ['version', 'loose', 'leaf', 'ansi', 'conventional', 'current']
cluster 203 -> ['financial', 'money', 'market', 'banking', 'bond']
cluster 96 -> ['audio', 'code', 'mystatlab', 'terminology', 'programmed', 'simplified', 'termplus']
cluster 2 -> ['benson', 'microbiological', 'complete', 'brown', 'microbioligical', 'entrepreneur', 'guidebook', 'william', 'stalling', 'quest', 'truth', 'exile', 'stranger', 'application-oriented', 'lawrence']
cluster 16 -> ['math', 'theology', 'systematic', 'enhanced', 'rang', 'dale', 'hybrid', 'webassign', 'multi', 'term', 'constructing', 'irregular', 'bamboo', 'minjung']
cluster 24 -> ['exercise', 'sport', 'biologic', 'child', 'toolkit', 'safety', 'w/web', 'recreation', 'young', 'acsm', 'frontal']
cluster 28 -> ['active', 'lifestyle', 'wellness', 'jewishness', 'critique', 'zionism', 'direction', 'illuminated']
cluster 114 -> ['mean', 'doe', 'center', 'hold', 'analyzing', 'threat', 'vulnerability', 'countermeasure', 'reality', 'myth', 'discriminate', 'debate', 'activism']
cluster 191 -> ['statement', 'valuation', 'project', 'imagination', 'fixed', 'income', 'longitudinal', 'wiley', 'takaful', 'damodaran']
cluster 51 -> ['auditing', 'investigating', 'fraud', 'decision-making', 'integrity', 'protecting', 'accessibility']
cluster 112 -> ['construction', 'difference', 'inequality', 'meaning', 'orientation', 'prism', 'content', 'illness', 'compiler']
cluster 171 -> ['institution', 'stock', 'trak', 'coupon', 'abridged', 'stock-trak']
cluster 198 -> ['cultural', 'anthropology', 'prentice', 'hall', 'linguistic', 'condition', 'globalizing', 'textbook', 'cengage', 'advantage', 'workbook/reader', 'essence', 'problem-based', 'battery', 'guyton']
cluster 158 -> ['topic', 'dna', 'typing', 'rise', 'fall', 'athens', 'nine', 'desire', 'hidden', 'parable', 'enoch', 'jesus', 'impossible', 'queer', 'wood']
cluster 14 -> ['revised', 'christ', 'messiah', 'paul', 'precalculus', 'stewart/redlin/watson', 'catechism', 'catholic', 'church', 'accordance', 'official', 'promulgated', 'pope', 'john']
cluster 1 -> ['vision', 'practicality', 'multimedia', 'playing', 'writer', 'contract', 'integration', 'rule', 'tabbed', 'intelligent', 'grounding', 'representation', 'statutory', 'cross-border']
cluster 136 -> ['course', 'surgical', 'technologist', 'mahamudra', 'unprecedented', 'training', 'pinnacle']
cluster 186 -> ['speech', 'recognition', 'pattern', 'episode', 'medieval', 'lesson', 'table']
cluster 144 -> ['cognition', 'brain', 'conversation', 'historical', 'indian', 'textual']
cluster 42 -> ['visual', 'agenda', 'alternative', 'epilogue', 'longman', 'classic', 'sight', 'singing']
cluster 47 -> ['pearson', 'monetary', 'multinational', 'sex-based', 'discrimination', 'mymathlab', 'etex', 'variableplus', 'emphasis', 'civil']
cluster 117 -> ['crisis', 'future', 'farm', 'bill', 'hearing', 'redmond', 'oregon', 'committee', 'agriculture', 'senate', 'congress', 'session', 'august', 'consolidation', 'antitrust', 'u.s.', 'mtbe', 'renewable', 'fuel']
cluster 85 -> ['late', 'flow', 'seeing', 'ourselves', 'cross-cultural', 'period', 'cambridge', 'persian', 'roman-rabbinic', 'incompressible', 'aerospace']
cluster 169 -> ['wide', 'web', 'forty', 'changed', 'prediction', 'profiling', 'policing', 'punishing']
cluster 143 -> ['school', 'thomson', 'main', 'street', 'tibetan', 'nyingma', 'translating', 'smart', 'short-term', 'trading', 'compact', 'reacting']
cluster 165 -> ['absolute', 'democracy', 'ultimate', 'thinking', 'liberal', 'representing', 'movie', 'lehninger', 'novel', 'bedford']
cluster 145 -> ['direct', 'interactive', 'hundred', 'albert', 'including', 'ancillaries']
cluster 193 -> ['prolog', 'programmer', 'genome', 'west', 'jewish', 'psychotherapy', 'mathematical', 'error', 'correction', 'forgotten', 'atlantic', 'string', 'tree', 'sequence', 'implication', 'fundamentalism']
cluster 79 -> ['mastering', 'differentiated', 'instruction', 'cld', 'myeducationkit', 'cancer', 'mechanism', 'target', 'radiographic', 'positioning', 'related', 'monstrous-feminine', 'psychoanalysis']
cluster 123 -> ['origin', 'cult', 'territory', 'city-state']
cluster 115 -> ['virtue', 'instinct', 'cooperation', 'civilisation', 'agro-industries', 'stochastics', 'francaise', 'avant', 'republique', 'hemisphere', 'irresistible', 'shift']
cluster 157 -> ['anne', 'orthwood', 'bastard', 'virginia', 'carlos', 'aldama', 'bat', 'cuba', 'drum']
cluster 76 -> ['south-east', 'south', 'esoteric', 'tantra', 'asia', 'oriental', 'ceramic', 'thai', 'vietnamese', 'khmer', 'collection', 'gallery', 'australia', 'adelaide']
cluster 189 -> ['white', 'ware', 'found', 'philippine', 'masteringa', 'cat', 'benjamin', 'cummings', 'dissection']
cluster 34 -> ['deuteronomy', 'judaean', 'paper', 'italian', 'renaissance', 'cover', 'brock', 'microorganism', 'maya', 'guatemalan', 'root']
cluster 66 -> ['canon', 'criterion', 'father', 'maus', 'survivor', 'tale', 'bleeds']
cluster 92 -> ['evaluation', 'reliability', 'queueing', 'markov', 'chain', 'infotrac', 'microeconomic', 'extension']
cluster 178 -> ['propulsion', 'jet', 'simple', 'thermodynamic', 'engine', 'pipeline', 'chip', 'multiprocessor', 'bluebook', 'uniform', 'citation', 'plain', 'turbine', 'axial-flow', 'radial-flow']
cluster 75 -> ['question', 'mythinkinglab', 'major', 'example', 'tort', 'explanation', 'twenty', 'frequently', 'qualifying']
cluster 77 -> ['travesti', 'brazilian', 'transgendered', 'prostitute', 'miller', 'freund']
cluster 36 -> ['arabic', 'sound', 'nelson', 'japanese-english', 'character', 'based', 'alif', 'baa', 'al-kitaab']
cluster 104 -> ['ashe', 'traditional', 'healing', 'sub-saharan', 'classified', 'bibliography', 'index', 'afro-american', "'the", 'schooling', 'scandal', 'diva', 'filipino', 'gay']
cluster 8 -> ['experimental', 'distribution', 'abundance', 'piety', 'revival', 'feminist', 'subject']
cluster 172 -> ['mymicrobiologyplace', 'website', 'williams', 'diet', 'therapy', 'lpn', 'thread', 'premium', 'studyware']
cluster 15 -> ['picturing', 'empire', 'home', 'repatriation', 'reintegration', 'postwar', 'harvard', 'monograph', 'heaven', 'integrating', 'healthcare', 'letter', 'chittagong', 'couple', 'offline']
cluster 153 -> ['nuclear', 'pet/ct', 'hip', 'hop', 'youth']
cluster 134 -> ['sea', 'mediating', 'divine', 'prophecy', 'revelation', 'dead', 'scroll', 'temple', 'bridge', 'symposium', 'copenhagen', 'denmark', 'february']
cluster 81 -> ['functional', 'discovering', 'genomics', 'proteomics', 'ibm', 'spss', 'tip', 'jury', 'trial', 'green', 'synthesis']
cluster 101 -> ['base', 'aerodynamic', 'centrifugal', 'compressor', 'transport', 'aircraft']
cluster 83 -> ['japan', 'hokkeji', 'reemergence', 'female', 'monastic', 'premodern']
cluster 30 -> ['harmonic', 'real-variable', 'orthogonality', 'oscillatory', 'integral']
cluster 82 -> ['naval', 'rivalry', 'stability', 'road', 'pearl', 'harbor', 'vehicle', 'sae', 'lone', 'actor', 'diction', 'singer']
cluster 37 -> ['annotated', 'mona', 'lisa', 'crash', 'prehistoric', 'post-modern']
cluster 39 -> ['voicemale', 'husband', 'marriage', 'wife', 'housework', 'commitment', 'mutual', 'causality', 'dharma']
cluster 147 -> ['reference', 'rotating', 'frame', 'relativistic']
cluster 46 -> ['vanishing', 'status', 'wild', 'orang-utans', 'close', 'twentieth']
cluster 120 -> ['deity', 'bris', 'sku']
cluster 48 -> ['zurich', 'painting', 'icon', 'worthwhile', 'mthon', 'donldan']
cluster 204 -> ['discovered', 'anti-biblical', 'racism', 'self-worship', 'superstition', 'deceit', 'multicultural', 'sexism']
cluster 181 -> ['ebook', 'coursemate', 'geol', 'govt', 'gov', 'supplement', 'wandering', 'galilean', 'honour', 'sean', 'freyne', 'journal', 'rabbinic']
cluster 164 -> ['window', 'linux', 'mta', 'skin', 'talking']
cluster 25 -> ['manner', 'photograph', 'tina', 'barney']
Out[3]:
(array([[ 26.36498155,   4.89641519],
        [-52.42814473,  -7.25990397],
        [ 35.12224299, -64.62202135],
        [  5.60781073, -16.78100796],
        [-45.72940377, -17.60788714],
        [  3.2169369 ,  -8.89136642],
        [ 33.45413317, -51.39143181],
        [-45.77903701,   0.19884997],
        [ 81.27050669, -18.97641918],
        [ 27.26749797,  -8.51629535],
        [-23.30585184, -16.99723892],
        [-34.43359727, -12.87912757],
        [  5.45367999,   9.97273351],
        [-36.1073178 ,  35.87253999],
        [ 66.59485572,  37.35395159],
        [ 14.95646877,  62.82407201],
        [-56.42395129,  31.42205361],
        [-64.56967521, -14.57725304],
        [ 18.44887152, -24.80075469],
        [  1.47189539,  -0.12104911],
        [ 27.66443101, -29.81955387],
        [ -1.58457358, -14.57185786],
        [ 34.84888003,  11.98307032],
        [  0.55244558,  19.05671188],
        [-20.93033505,  19.10931209],
        [-49.39774513,  63.864151  ],
        [ 28.51609906,  27.18042266],
        [-18.81607974,  -3.0029328 ],
        [ 64.57752596,  12.81318803],
        [-37.80755942,  14.94668204],
        [-51.71486359, -77.10896606],
        [-17.95691824, -11.32823552],
        [-24.36650742,  29.30338792],
        [ 32.28359412, -38.9292315 ],
        [ 16.81211025, -81.54520555],
        [-10.059302  , -38.75900433],
        [-66.42748006,  46.64468765],
        [ 25.85248566,  89.73182678],
        [-72.21563797, -35.50197296],
        [ 52.54010857,  66.66598553],
        [ 14.80647759, -49.05779736],
        [ 37.44880972,  36.20559504],
        [ 40.92830251, -20.41704556],
        [-18.11726034, -36.41797878],
        [ 50.00268002,  -0.81377906],
        [ 10.41073099, -22.13246387],
        [-49.43111738,  76.89012655],
        [-62.80411301,   1.63375244],
        [  2.27479405,  89.23298508],
        [-13.06565068, -29.4074416 ],
        [-12.17294247, -47.00548598],
        [  9.2254964 ,  26.70107535],
        [ 17.22197332, -19.79293382],
        [ 16.0354826 ,  41.54667442],
        [ 14.41286172,  32.25673113],
        [ 32.37101402, -13.95033835],
        [  1.39845313, -30.41585381],
        [ 18.32766101,   4.0345973 ],
        [-66.42748006,  46.64468765],
        [ 17.90058409,  27.40757085],
        [-22.24106357, -45.05923865],
        [ 25.85248566,  89.73182678],
        [  7.63143612,  17.79894296],
        [-34.61884233,   7.2259128 ],
        [ 10.24412889,   5.03297903],
        [-45.45824495, -39.82909132],
        [-84.40082659, -31.07643536],
        [-45.72940377, -17.60788714],
        [ 78.39785242,   2.12143259],
        [-26.19870169,  -7.2435405 ],
        [ -5.44043554,  17.73531168],
        [-13.03538452,  14.1985609 ],
        [ -4.31735366,  -6.3626585 ],
        [-31.88241386, -55.91385997],
        [ -0.46489847, -21.75578593],
        [-45.78435605,  11.13188465],
        [-50.25132888,  43.48043796],
        [ 40.31313971, -76.54754967],
        [-16.71676831,  28.11525875],
        [ 67.24007093, -35.87699347],
        [-13.34687576,  51.13161262],
        [-12.44013202,  66.30704603],
        [-23.84092553,  56.97940763],
        [ 22.2835465 ,  68.26535676],
        [ -2.18996382,   6.45763572],
        [ 37.04106002, -29.66304136],
        [ -2.79919041,  56.44934416],
        [  3.93849323,  24.91203184],
        [ 56.98964574, -41.30495347],
        [ -3.1731945 , -71.80237153],
        [  6.18687063, -37.05250379],
        [-56.71486529, -28.08943222],
        [ 55.10728444, -20.0314026 ],
        [-13.06565068, -29.4074416 ],
        [-27.91584108,  15.4429681 ],
        [ -9.05768516,  30.54241624],
        [ 18.54066443,  20.6173223 ],
        [-23.64836939,  41.86916174],
        [ 38.48149872,  22.28789712],
        [-15.16886956, -63.54353857],
        [ 14.95646877,  62.82407201],
        [-34.55443382,  53.23878034],
        [ 27.64856602,  20.18199834],
        [ 13.87594368, -32.70165244],
        [ -6.95226743, -82.7274995 ],
        [-39.12699904, -22.85271051],
        [  6.94747798, -27.14449805],
        [ 16.11278141, -66.38522768],
        [-18.91901487, -24.33842513],
        [  2.27479405,  89.23298508],
        [-24.49803313, -68.92147996],
        [  2.48738519,  45.28018191],
        [ -7.45162804,  10.77726171],
        [  7.64364359, -57.42885098],
        [ 26.96338845, -20.81207851],
        [-31.6130689 ,  68.60627524],
        [-65.9849041 ,  14.48251076],
        [-33.77273179, -33.46658471],
        [-51.71486359, -77.10896606],
        [-53.99931625,   8.71550084],
        [  1.35471955,  88.2725597 ],
        [-26.19870169,  -7.2435405 ],
        [  3.38348961,  34.05560594],
        [ 63.64202595,  24.39393806],
        [ 28.59370954,  61.35217278],
        [-18.80213035,  11.18522891],
        [-74.69106821,  24.10657215],
        [ 21.56523887, -31.81482884],
        [ -8.52251138, -13.24407494],
        [ -8.2063035 , -24.70539606],
        [-45.72940377, -17.60788714],
        [ 36.11878461,  -8.87100129],
        [ 28.37063975,  36.62048597],
        [-11.35029727,  -5.15931341],
        [-25.18917025,  83.22421382],
        [ -2.05653292,  27.24349513],
        [-42.55961554,  55.25476619],
        [-15.34463156,  36.96639617],
        [ 13.27571063,  18.97757368],
        [ 11.74532542,  12.70184374],
        [-34.18831938,  25.02373283],
        [-26.19870169,  -7.2435405 ],
        [ 37.91187731,   1.9446228 ],
        [ -1.71716582,  74.09058189],
        [ 34.21898073, -19.54962986],
        [-25.86579723, -32.16314145],
        [ 82.52305924,  22.48905714],
        [ 82.15961075,  30.42002153],
        [-23.84092553,  56.97940763],
        [ 19.83425936, -13.43072056],
        [-43.74942892, -61.01399096],
        [ 47.61290811, -14.95496268],
        [ 30.8753219 ,   0.4958894 ],
        [ 43.91457978,  18.37194993],
        [-33.2051084 ,  44.94086522],
        [ 41.81199031,  -6.43680863],
        [  8.64742149,  39.4836193 ],
        [-27.08299298, -83.31162008],
        [ 49.12613433, -48.94621109],
        [ 24.04797224,  15.51845035],
        [ -3.68280826, -55.52453005],
        [ 19.14686415,  10.89657657],
        [-35.89840267,  -2.63128308],
        [-46.97617923,  21.87084658],
        [-56.32549973,  54.8863739 ],
        [ 25.86064873,  49.25400124],
        [ 15.44600336,  -2.78630977],
        [ 27.76871329,  12.79284622],
        [-28.59950598, -23.28678462],
        [ 62.07903337, -58.20426655],
        [ -5.97553235,  40.05109191],
        [ 36.94074911,  50.2387972 ],
        [ 52.71481578,  27.7779026 ],
        [  0.18296424,  12.6068063 ],
        [ 48.72465273,  36.41751934],
        [  8.73210854,  -2.54090523],
        [-11.35239165, -18.51601926],
        [ -3.34105675, -31.75357   ],
        [-61.91177546, -50.18482335],
        [ 45.66944081, -34.50032028],
        [ 46.67464161,   9.78729924],
        [ 10.96830076,  52.93253078],
        [ 68.19725446, -18.64388304],
        [ 21.32192882,  -5.39414761],
        [-30.97241085, -43.34752536],
        [ 56.70444748,  16.4083609 ],
        [ 51.07900728, -27.67473192],
        [ 52.96570663,  51.45273285],
        [ 64.57752596,  12.81318803],
        [-15.5750732 , -53.33304511],
        [ 11.5062541 ,  76.69516525],
        [ 17.95834938, -40.23245342],
        [ -6.78956802,   1.2884748 ],
        [  0.24163358, -44.67721122],
        [-14.43359256,   3.50490242],
        [-80.09506273,  -4.14618273],
        [-11.35029727,  -5.15931341],
        [ 26.55155587, -42.25387116],
        [-25.37580428,   4.18140082],
        [ 61.35447231,  -4.73341929],
        [-11.41324506,  22.22359463],
        [-84.40082659, -31.07643536],
        [ 21.43333927,  38.19808958],
        [ 39.86371899,  47.40925972],
        [ 34.76490283,  74.94012737],
        [ 12.34966244, -11.83564745],
        [ 30.8753219 ,   0.4958894 ]]),
 array([[0.00000000e+000, 0.00000000e+000, 0.00000000e+000, ...,
         0.00000000e+000, 0.00000000e+000, 0.00000000e+000],
        [0.00000000e+000, 0.00000000e+000, 0.00000000e+000, ...,
         0.00000000e+000, 0.00000000e+000, 0.00000000e+000],
        [0.00000000e+000, 0.00000000e+000, 0.00000000e+000, ...,
         0.00000000e+000, 0.00000000e+000, 0.00000000e+000],
        ...,
        [0.00000000e+000, 1.86311936e-192, 0.00000000e+000, ...,
         0.00000000e+000, 0.00000000e+000, 0.00000000e+000],
        [0.00000000e+000, 1.07814844e-188, 0.00000000e+000, ...,
         0.00000000e+000, 0.00000000e+000, 0.00000000e+000],
        [0.00000000e+000, 3.27741140e-182, 0.00000000e+000, ...,
         0.00000000e+000, 0.00000000e+000, 0.00000000e+000]]))
IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

© 2018 Nathaniel Dake