We know that we are going to be trying to perform K-Means Clustering on data. So let's take a moment to visualize some data that we may get.
We can see that the points are all the same color. This is because we are doing unsupervised learning, and there are no classes being given to these points. Each point is just a vector, and that is all we know about each point. However, our human pattern recognition abilities allow us to see 3 distinct groups. In other words, we don't need the data set to tell us that these three groups are distinct. Our own pattern recognition abilities allow us to see this very clearly.
What you may have noticed after looking at the data above, however, is that it is a very unique and specific situation. The first limitation of this data was that it was 2-dimensional. This is of course necessary, because if you had a 100 dimensional data set you wouldn't be able to see it. The universe itself only has 3 dimensions of space, so that is all you can see. This is problematic because most real world data sets are not 2 dimensional, and hence you cannot see most real world data. This means that your own pattern recognition skills are not useful in this scenario. It would be nice if we had some algorithm to find clusters or groups of data, that could work no matter what the dimensionality of the data is.
Here is another issue. In the original data, it was generated so that it would show 3 distinct clusters very clearly. Real data isn't so nice. We will want to be able to know:
Is the cluster we found good or not?
So, as a first step into understanding k-means clustering, let's consider 2 fundamental truths about the clusters in a data set.
Fundamental Truth 1
Suppose we know that the yellow, purple, and green points are the center of some clusters, and we would like to know which center this new blue point belongs to. This point of course belongs to the cluster center that it is closest to. So, the decision rule for choosing which cluster a new point belongs to, is to pick the nearest cluster center.
Fundamental Truth 2
Now, lets consider the second fundamental fact about clusters. Let's say that all of the points we see below belong to the same cluster. We can refer to the data points as:
What is the center of this cluster? Of course, that is just the mean of all these data points. This is also referred to as the centroid, if you are thinking geometrically. So, as you know, the way to find the mean of a set of vectors, is to add them up, and divide by the number of vectors:
$$m = \frac{1}{C}\sum_{i=1}^Cx_i$$Believe it or not, the two above facts are all that you need to implement k-means clustering. It turns out that if we just initialize the set of clusters randomly, if we just repeat these two steps over and over, we will converge to an answer.
Pseudocode
Initialize: pick K random points to be the cluster centers
While not converged:
Assign each point to the nearest cluster center
Recalculate each cluster center from points that belong to it
The input to K-Means is a matrix $X$, which is of size $N x D$. This means that we are dealing with $N$ samples and that those samples have $D$ features.
There are two main steps to the k-means clustering training algorithm.
- First we choose $k$ different cluster centers. We generally just assign these to random points in the data set.
- Then we go into our main loop. Here the first step is to decide which cluster each point in $X$ belongs to. We do that by looking at every sample, and choosing the closest cluster center. Remember, these are usually just assigned randomly to begin with. The second step is to recalculate each cluster center based on the points that were assigned to it. We do this by taking all of the samples, and calculating the mean. This is done until the algorithm converges. Generally this happens very fast, in about 5 to 15 steps. This is very different from gradient descent in deep learning, where we may have thousands of iterations before convergence.
One problem that we find when we do K-Means is that it is highly sensitive to its initialization. A possible resolution is to restart K-Means several times, and use whichever result gives us the best final objective. What does this tell us though? Well, it tells us that the cost function is susceptible to local minima.
One way to overcome this it have fuzzy membership to each class. This means that each data point doesn't belong to one class or another. But rather, there is an "amount" of membership. For example, it may be 60% part of cluster 1, and 40% part of cluster 2. We can get soft k means which just a small adjustment to the regular k-means algorithm.
Pseudocode
Initialize m1...mk = random points in X
While not converged:
Step 1: Calculate cluster responsibilities
$$r_k^{(n)} = \frac{exp\Big[-\beta d(m_k, x^{(n)})\Big]}{\sum_j exp \Big[-\beta d(m_j, x^{(n)})\Big]}$$
Step 2: Recalculate means
$$m_k = \frac{\sum_n r_k^{(n)}x^{(n)}}{\sum_n r_k^{(n)}}$$We can see that step 1 is where the biggest change occurs. We are now defining a term $r_k^{(n)}$, that will always be a fraction. In the case of hard k-means, aka regular k-means, that is where $r_k^{(n)}$ is exactly 0 or 1. We can see that in step 2 we are calculating a weighted mean. If $r_k^{(n)}$ is higher, then this $x_n$ matters more to this cluster K. This means that it has more influence on the calculation of its mean.
If you are familiar with deep learning, you may recognize the calculation of $r_k^{(n)}$ to be something similar to that of the softmax function.
Let's take a minute to breakdown these two steps of soft k-means in order to understand them better. If we look at the equation for the responsibility:
$$r_k^{(n)} = \frac{exp\Big[-\beta d(m_k, x^{(n)})\Big]}{\sum_{j=1}^K exp \Big[-\beta d(m_j, x^{(n)})\Big]}$$We can see that it depends on the distance between the point, and each cluster center. Why does this make sense? Well, you can imagine that if a data point is very close to one cluster center, and very far away from another, then we will get a data point close to 1. This means we will have a higher probability that the point belongs to the cluster with the center closest to it, will be the highest!
However, now lets say a data point is right in between two clusters. Then we should get 0.5. This makes sense because we are equally confident that this data point could belong to either of the two clusters.
So, this method allows us to quantify how confident we are in the cluster assignments, rather than simply assigning a data point to whichever cluster it is closest to.
You may notice that the numerator in the term for the responsibility looks suspiciously like a Gaussian.
$$Numerator: \; exp\Big[-\beta d(m_k, x^{(n)})\Big]$$$$Gaussian \; PDF: \frac{1}{\sqrt{2 \pi \sigma^2}} exp \Big(\frac{1}{2\sigma^2}(x-\mu)^2\Big)$$We will make this connection soon, when we discuss Gaussian Mixture Models. As you may know, the gaussian contains a variance term, which shows up in the exponent. The variance term controls how fat or skinny the PDF of the Gaussian is. And so this controls how fat or skinny the influence of each cluster is on each data point.
In step 2 we had to calculate the mean, but it looked a little different than normal. This equation actually has a name, the weighted arithmetic mean. It is a generalization of the regular arithmetic mean, where you can think of each data point as having the same weight of 1.
$$Regular \; mean: m_k = \frac{1}{N}\sum_n x^{(n)} = \frac{1*x^{(1)} + 1*x^{(2)} + ...}{1 + 1 + ...}$$$$Weighted \; mean: m_k = \frac{\sum_n r_k^{(n)}x^{(n)}}{\sum_n r_k^{(n)}} = \frac{r_k^{(1)}x^{(1)} + r_k^{(2)}x^{(2)} + ...}{r_k^{(1)} + r_k^{(2)} + ...}$$This also makes sense in terms of clustering, since, for example, if a data point is far away from the cluster center, then the corresponding responsibility will be close to 0, and therefore each data point shouldn't have a big influence on calculating the cluster center.
The purpose of Soft K-Means is that it allows us to quantify our confidence in the cluster assignment. Look at the situation below:
We can see that this data point is still really in the middle of the two clusters and shouldn't really be assigned to the cluster on the right. It is still mainly in the center. However, this is exactly what hard k-means would do. It would say that this point belongs to the cluster on the right, and treat it the same way as all of the other points which are closer, and maybe more definitively belong to the cluster on the right.
Soft K-means allows us to represent the intuitive understanding that we have of the point belonging not fully to either class, with a number.
"Test point belongs to yellow with 51% probability, but may still belong to purple with 49% probability"
As was the case with supervised learning, it is very important to talk about the objective functions that we are trying to maximize. Assuming that we are using euclidean distance as our distance measure, our objective function is:
$$J = \sum_n \sum_k r_k^{(n)} ||m_k - x^{(n)}||^2$$In english this means we will sum over all data points $n$, and then sum over all clusters $k$, the distance between each cluster and each data point, weighted by the responsibilities. So, really this is just the squared distance weighted by the responsiblities.
So, if $x_n$ is far away from the mean of cluster k, hopefully that responsibility has been set very low. In deep learning we will use gradient descent, but we do not use that here! In this case we actually do what is called coordinate descent This means that we are moving in the direction of a smaller J, with respect to only 1 variable at a time. We can see that this is true because we only update 1 variable at a time, either $r_k^{(n)}$, or $m_k$. There is a mathematical guarantee that each iteration will result in the objective function decreasing, and thus it will always converge. However, there is no guarantee that it will converge to a global minimum.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
def d(u, v):
diff = u - v
return diff.dot(diff)
def cost(X, R, M):
cost = 0
for k in range(len(M)):
diff = X - M[k]
sq_distances = (diff * diff).sum(axis=1)
cost += (R[:,k] * sq_distances).sum()
return cost
def plot_k_means(X, K, max_iter=20, beta=1.0, show_plots=True):
N, D = X.shape
M = np.zeros((K, D))
# R = np.zeros((N, K))
exponents = np.empty((N, K))
# initialize M to random
for k in range(K):
M[k] = X[np.random.choice(N)]
costs = np.zeros(max_iter)
for i in range(max_iter):
# step 1: determine assignments / resposibilities
# is this inefficient?
for k in range(K):
for n in range(N):
exponents[n,k] = np.exp(-beta*d(M[k], X[n]))
R = exponents / exponents.sum(axis=1, keepdims=True)
# step 2: recalculate means
for k in range(K):
M[k] = R[:,k].dot(X) / R[:,k].sum()
costs[i] = cost(X, R, M)
if i > 0:
if np.abs(costs[i] - costs[i-1]) < 1e-5:
break
if show_plots:
fig, ax = plt.subplots(figsize=(12,8))
plt.plot(costs)
plt.title("Costs")
plt.show()
random_colors = np.random.random((K, 3))
colors = R.dot(random_colors)
fig, ax = plt.subplots(figsize=(12,8))
plt.scatter(X[:,0], X[:,1], c=colors)
plt.show()
return M, R
def get_simple_data():
# assume 3 means
D = 2 # so we can visualize it more easily
s = 4 # separation so we can control how far apart the means are
mu1 = np.array([0, 0])
mu2 = np.array([s, s])
mu3 = np.array([0, s])
N = 900 # number of samples
X = np.zeros((N, D))
X[:300, :] = np.random.randn(300, D) + mu1
X[300:600, :] = np.random.randn(300, D) + mu2
X[600:, :] = np.random.randn(300, D) + mu3
return X
def main():
X = get_simple_data()
# what does it look like without clustering?
fig, ax = plt.subplots(figsize=(12,8))
plt.scatter(X[:,0], X[:,1])
plt.show()
K = 3 # luckily, we already know this
plot_k_means(X, K)
K = 5 # what happens if we choose a "bad" K?
plot_k_means(X, K, max_iter=30)
K = 5 # what happens if we change beta?
plot_k_means(X, K, max_iter=30, beta=0.3)
if __name__ == '__main__':
main()
import numpy as np
import matplotlib.pyplot as plt
def d(u, v):
diff = u - v
return diff.dot(diff)
def cost(X, R, M):
cost = 0
for k in range(len(M)):
for n in range(len(X)):
cost += R[n,k]*d(M[k], X[n])
return cost
def plot_k_means(X, K, max_iter=20, beta=1.0):
N, D = X.shape
M = np.zeros((K, D))
R = np.ones((N, K)) / K
# initialize M to random
for k in range(K):
M[k] = X[np.random.choice(N)]
grid_width = 5
grid_height = max_iter / grid_width
random_colors = np.random.random((K, 3))
plt.figure()
costs = np.zeros(max_iter)
for i in range(max_iter):
# moved the plot inside the for loop
colors = R.dot(random_colors)
fig, ax = plt.subplots(figsize=(12,8))
plt.scatter(X[:,0], X[:,1], c=colors)
# step 1: determine assignments / resposibilities
# is this inefficient?
for k in range(K):
for n in range(N):
R[n,k] = np.exp(-beta*d(M[k], X[n])) / np.sum( np.exp(-beta*d(M[j], X[n])) for j in range(K) )
# step 2: recalculate means
for k in range(K):
M[k] = R[:,k].dot(X) / R[:,k].sum()
costs[i] = cost(X, R, M)
if i > 0:
if np.abs(costs[i] - costs[i-1]) < 1e-5:
break
plt.show()
def main():
# assume 3 means
D = 2 # so we can visualize it more easily
s = 4 # separation so we can control how far apart the means are
mu1 = np.array([0, 0])
mu2 = np.array([s, s])
mu3 = np.array([0, s])
N = 900 # number of samples
X = np.zeros((N, D))
X[:300, :] = np.random.randn(300, D) + mu1
X[300:600, :] = np.random.randn(300, D) + mu2
X[600:, :] = np.random.randn(300, D) + mu3
# what does it look like without clustering?
fig, ax = plt.subplots(figsize=(12,8))
plt.scatter(X[:,0], X[:,1])
plt.show()
K = 3 # luckily, we already know this
plot_k_means(X, K)
if __name__ == '__main__':
main()
def plot_k_means(X, K, max_iter=20, beta=1.0, show_plots=True):
N, D = X.shape
M = np.zeros((K, D))
# R = np.zeros((N, K))
exponents = np.empty((N, K))
# initialize M to random
for k in range(K):
M[k] = X[np.random.choice(N)]
costs = np.zeros(max_iter)
for i in range(max_iter):
# step 1: determine assignments / resposibilities
# is this inefficient?
for k in range(K):
for n in range(N):
exponents[n,k] = np.exp(-beta*d(M[k], X[n]))
R = exponents / exponents.sum(axis=1, keepdims=True)
# step 2: recalculate means
for k in range(K):
M[k] = R[:,k].dot(X) / R[:,k].sum()
costs[i] = cost(X, R, M)
if i > 0:
if np.abs(costs[i] - costs[i-1]) < 1e-5:
break
if show_plots:
fig, ax = plt.subplots(figsize=(12,8))
plt.plot(costs)
plt.title("Costs")
plt.show()
random_colors = np.random.random((K, 3))
colors = R.dot(random_colors)
fig, ax = plt.subplots(figsize=(12,8))
plt.scatter(X[:,0], X[:,1], c=colors)
plt.show()
return M, R
def donut():
N = 1000
D = 2
R_inner = 5
R_outer = 10
# distance from origin is radius + random normal
# angle theta is uniformly distributed between (0, 2pi)
R1 = np.random.randn(N//2) + R_inner
theta = 2*np.pi*np.random.random(N//2)
X_inner = np.concatenate([[R1 * np.cos(theta)], [R1 * np.sin(theta)]]).T
R2 = np.random.randn(N//2) + R_outer
theta = 2*np.pi*np.random.random(N//2)
X_outer = np.concatenate([[R2 * np.cos(theta)], [R2 * np.sin(theta)]]).T
X = np.concatenate([ X_inner, X_outer ])
return X
def main():
# donut
X = donut()
plot_k_means(X, 2)
# elongated clusters
X = np.zeros((1000, 2))
X[:500,:] = np.random.multivariate_normal([0, 0], [[1, 0], [0, 20]], 500)
X[500:,:] = np.random.multivariate_normal([5, 0], [[1, 0], [0, 20]], 500)
plot_k_means(X, 2)
# different density
X = np.zeros((1000, 2))
X[:950,:] = np.array([0,0]) + np.random.randn(950, 2)
X[950:,:] = np.array([3,0]) + np.random.randn(50, 2)
plot_k_means(X, 2)
if __name__ == '__main__':
main()
There are several main disadvantages when it comes K-means clustering.
1) You have to choose K
The first issue is that you have to K. In 2-D or 3-D data we can look at the data to help us choose. But what if our data was 100-D?
2) Local Minima
The second disadvantage is that k-means only converges to a local minimum. This is the same thing with deep learning, but it is a bad thing in the case of k-means. One way to remedy this is to restart k-means multiple times and chose the clustering that gives the best value for the objective.
3) Sensitive to Initial Configuration
Another disadvantage is that k-means is very sensitive to initial configuration.
4) K-Means cannot solve donut problem
K-means is not able to solve the donut problem, or even ellipses. It can only look for spherical clusters, because it is only taking into account squared distance.
5) Doesn't take density into account
We are now going to discuss why the cost function that we have been using is limited, and what some common alternatives are. We can begin by touching on the pros:
Pros
However, there are some pretty heavy drawbacks to this method.
Cons
To Summarize the Cons
One interesting way to evaluate a clustering is called a purity.
$$Purity = \frac{1}{N} \sum_{k=1}^K max_{j=1..K}|c_k \cap t_j|$$First we divide by $N$, which balances for the number of samples that we have. Next, what is $c$ and what is $t$? $c_k$ stands for the cluster center indexed by $k$. In total we have $K$ number of clusters. $t_j$ stands for the targets in class $j$. We search for the max over $j$, which means that we find the target class that this cluster most likely belongs to, because they have the biggest intersection of points. Of course we will need to modify this a bit, since cluster membership given a data point isn't exact for soft k-means; it is rather probabilistic.
For an example, suppose we are looking at the MNIST data set, where each of the classes is a digit between 0 and 9. We have 10 cluster centers but we have no idea what they mean. To determine what they mean, we look at each of the target classes. If we find that the best intersection of data points, is with data representing the digit 5, then that means that this cluster probably represents the digit 5. Because of this, the best purity we can get is 1. That means that for each cluster, the points that are assigned to it all correspond to the same true label.
However, this exposes 1 big disadvantage of the purity measure. It requires targets, which is generally just the case for supervised learning. And if you have targets, why not just use supervised learning? Many other measures also require targets, such as the Rand Measure, F-measure, Jaccard Index, Normalized Mutual Information. These methods that require the use of a true label are called external validation methods.
In a more realistic scenario, when we are doing unsupervised learning we most likely do not have labels. However, we still want to be able to test whether or not this is a good clustering. There are internal validation methods that do not require the use of a true label. One example is the Davies-Bouldin Index.
$$DBI = \frac{1}{K} \sum_{k=1}^K max_{j \neq k} \Big[\frac{\sigma_k + \sigma_j}{d(c_k, c_j)}\Big]$$It has similarities in appearance the equation for purity we just discussed. Given $k$, we take the maximum over $j$ of some measure. In this equation, the $\sigma_k$ represents the average distance from each data point in a cluster to it's center. You can see why $\sigma$ is appropriate, since it is kind of like the standard deviation. $\sigma_j$ is the same thing except for cluster $j$. Note that because we are using soft k-means we will need to calculate the sigmas correctly using the responsibilities, which are probabilistic. Next, $d(c_k, c_j)$ represents the distance between the cluster center $k$ and the cluster center $j$. Ideally, we want the numerator to be small and the denominator to be large. This is because we want everything within a cluster to be closer together, and we want each cluster to be far apart from other clusters. This, the lower the DBI the better!
We are now going to perform K-means clustering on the MNIST dataset. The data can be found at: https://www.kaggle.com/c/digit-recognizer. Each image is a $D = (28 x 28)$ matrix of pixel values, but we will flatten it into a 784 dimensional vector.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from kmeans import plot_k_means, get_simple_data
from datetime import datetime
# Seaborn Plot Styling
sns.set(style="white", palette="husl")
sns.set_context("poster")
sns.set_style("ticks")
"""Function to get and transform our data"""
def get_data(limit=None):
print('Reading in data...')
df = pd.read_csv('../../../data/MNIST/train.csv')
print('Transforming the data...')
data = df.as_matrix()
np.random.shuffle(data)
X = data[:, 1:] / 255.0
Y = data[:, 0]
if limit is not None:
X, Y = X[:limit], Y[:limit]
return X, Y
"""
Purity Function: maximum purity is 1, higher is better. In english, what we are doing
is starting at cluster 1, and then we have an inner loop that will go through all of the
target labels. We then determine for cluster 1, which of the labels is most likely associated
with it. We do this because for each data point, we know the probability it belongs to cluster
1, and we also have the true label of each data point. We can then look at all of the data
points that have a certain label, and then determine the corresponding probability that those
points are also part of cluster 1. Whichever label yields the highest total probability is the
label we associate with that cluster.
"""
def purity(Y, R):
N, K = R.shape # R is cluster assignments, Y is labels
p = 0
for k in range(K): # Looping k through all of the clusters
best_target = -1
max_intersection = 0
for j in range(K): # Looping j through all of the target labels
intersection = R[Y==j, k].sum()
if intersection > max_intersection:
max_intersection = intersection
best_target = j
p += max_intersection
return p / N
"""
DBI Function: Here lower is better. We need all data points X, and the cluster means M,
and the responsibilities R.
"""
def DBI(X, M, R):
K, D = M.shape # K-means, each dimension has a mean
sigma = np.zeros(K)
for k in range(K): # Start by calculating sigmas. Average distance between
diffs = X - M[k] # all data points in cluster K and the center
squared_distances = (diffs * diffs).sum(axis=1)
weighted_squared_distances = R[:, k] * squared_distances
sigma[k] = np.sqrt( weighted_squared_distances.sum() / R[:,k].sum())
# Calculate Davies-Bouldin Index
dbi = 0
for k in range(K):
max_ratio = 0
for j in range(K):
if k != j:
numerator = sigma[k] + sigma[j]
denominator = np.linalg.norm(M[k] - M[j])
ratio = numerator / denominator
if ratio > max_ratio:
max_ratio = ratio
dbi += max_ratio
return dbi / K
def main():
X, Y = get_data(10000)
print("Number of data points:", len(Y))
M, R = plot_k_means(X, len(set(Y)))
# Exercise: Try different values of K and compare the evaluation metrics
print("Purity:", purity(Y, R))
print("DBI:", DBI(X, M, R))
# plot the mean images
# they should look like digits
for k in range(len(M)):
im = M[k].reshape(28, 28)
plt.imshow(im, cmap='gray')
fig, ax = plt.subplots(figsize=(12,8))
plt.show()
if __name__ == "__main__":
main()
You may have been wondering throughout this section about what the best way to chose $K$ is, i.e. the number of clusters. In the past we have talked about the process of cross validation, however, that is a little bit more difficult in this case because we do not have a notion of accuracy (unsupervised learning), and we don't typically use train and test sets. Although, that may be an interesting idea to look into.
One thing that you will notice is that your cost is always going to decrease as K increases. Remember, it is the sum of squared error within a cluster. That means that the closer all of those points are to the center of the clusters, the lower the cost. But what happens is that as you increase the number of clusters, you always make it easier for the points to be closer to the center of a cluster.
However, something interesting happens as you go from K = 1..N. You get this sort of hockey stick shape:
This shows that at some point, increasing $K$ only gives it marginal improvements. Thus it is at that point that the data fits to the clusters nicely. So, in the image above we can see that $K=6$ is that natural number of clusters for the data.
You should keep in mind that your plot of Cost vs K will produce this shape every time.
import numpy as np
import matplotlib.pyplot as plt
from kmeans import plot_k_means, get_simple_data, cost
def main():
X = get_simple_data()
fig, ax = plt.subplots(figsize=(12,8))
plt.scatter(X[:,0], X[:,1])
plt.show()
costs = np.empty(10)
costs[0] = None
for k in range(1, 10):
M, R = plot_k_means(X, k, show_plots=False)
c = cost(X, R, M)
costs[k] = c
fig, ax = plt.subplots(figsize=(12,8))
plt.plot(costs)
plt.title("Cost vs K")
plt.show()
if __name__ == '__main__':
main()
import networkx as nx
import nltk
import numpy as np
import matplotlib.pyplot as plt
from nltk.stem import WordNetLemmatizer
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE, LocallyLinearEmbedding as LLE
from sklearn.feature_extraction.text import TfidfTransformer
wordnet_lemmatizer = WordNetLemmatizer()
titles = [line.rstrip() for line in open('../../data/nlp/all_book_titles.txt')]
# copy tokenizer from sentiment example
stopwords = set(w.rstrip() for w in open('../../data/nlp/stopwords.txt'))
# add more stopwords specific to this problem
stopwords = stopwords.union({
'introduction', 'edition', 'series', 'application',
'approach', 'card', 'access', 'package', 'plus', 'etext',
'brief', 'vol', 'fundamental', 'guide', 'essential', 'printed',
'third', 'second', 'fourth', })
def my_tokenizer(s):
s = s.lower() # downcase
tokens = nltk.tokenize.word_tokenize(s) # split string into words (tokens)
tokens = [t for t in tokens if len(t) > 2] # remove short words, they're probably not useful
tokens = [wordnet_lemmatizer.lemmatize(t) for t in tokens] # put words into base form
tokens = [t for t in tokens if t not in stopwords] # remove stopwords
tokens = [t for t in tokens if not any(c.isdigit() for c in t)] # remove any digits, i.e. "3rd edition"
return tokens
# create a word-to-index map so that we can create our word-frequency vectors later
# let's also save the tokenized versions so we don't have to tokenize again later
word_index_map = {}
current_index = 0
all_tokens = []
all_titles = []
index_word_map = []
print("num titles:", len(titles))
print("first title:", titles[0])
for title in titles:
try:
title = title.encode('ascii', 'ignore') # this will throw exception if bad characters
title = title.decode('utf-8')
all_titles.append(title)
tokens = my_tokenizer(title)
all_tokens.append(tokens)
for token in tokens:
if token not in word_index_map:
word_index_map[token] = current_index
current_index += 1
index_word_map.append(token)
except Exception as e:
print(e)
# now let's create our input matrices - just indicator variables for this example - works better than proportions
def tokens_to_vector(tokens):
x = np.zeros(len(word_index_map))
for t in tokens:
i = word_index_map[t]
x[i] += 1
return x
N = len(all_tokens)
D = len(word_index_map)
X = np.zeros((D, N)) # terms will go along rows, documents along columns
i = 0
for tokens in all_tokens:
X[:,i] = tokens_to_vector(tokens)
i += 1
def d(u, v):
diff = u - v
return diff.dot(diff)
def cost(X, R, M):
cost = 0
for k in range(len(M)):
# method 1
# for n in range(len(X)):
# cost += R[n,k]*d(M[k], X[n])
# method 2
diff = X - M[k]
sq_distances = (diff * diff).sum(axis=1)
cost += (R[:,k] * sq_distances).sum()
return cost
def plot_k_means(X, K, index_word_map, max_iter=20, beta=1.0, show_plots=True):
N, D = X.shape
M = np.zeros((K, D))
R = np.zeros((N, K))
exponents = np.empty((N, K))
# initialize M to random
for k in range(K):
M[k] = X[np.random.choice(N)]
costs = np.zeros(max_iter)
for i in range(max_iter):
# step 1: determine assignments / resposibilities
# is this inefficient?
for k in range(K):
for n in range(N):
# R[n,k] = np.exp(-beta*d(M[k], X[n])) / np.sum( np.exp(-beta*d(M[j], X[n])) for j in range(K) )
exponents[n,k] = np.exp(-beta*d(M[k], X[n]))
R = exponents / exponents.sum(axis=1, keepdims=True)
# step 2: recalculate means
for k in range(K):
M[k] = R[:,k].dot(X) / R[:,k].sum()
costs[i] = cost(X, R, M)
if i > 0:
if np.abs(costs[i] - costs[i-1]) < 10e-5:
break
if show_plots:
# plt.plot(costs)
# plt.title("Costs")
# plt.show()
random_colors = np.random.random((K, 3))
colors = R.dot(random_colors)
plt.figure(figsize=(80.0, 80.0))
plt.scatter(X[:,0], X[:,1], s=300, alpha=0.9, c=colors)
annotate1(X, index_word_map)
# plt.show()
plt.savefig("test.png")
# print out the clusters
hard_responsibilities = np.argmax(R, axis=1) # is an N-size array of cluster identities
# let's "reverse" the order so it's cluster identity -> word index
cluster2word = {}
for i in range(len(hard_responsibilities)):
word = index_word_map[i]
cluster = hard_responsibilities[i]
if cluster not in cluster2word:
cluster2word[cluster] = []
cluster2word[cluster].append(word)
# print out the words grouped by cluster
for cluster, wordlist in cluster2word.items():
print("cluster", cluster, "->", wordlist)
return M, R
def annotate1(X, index_word_map, eps=0.1):
N, D = X.shape
placed = np.empty((N, D))
for i in range(N):
x, y = X[i]
# if x, y is too close to something already plotted, move it
close = []
x, y = X[i]
for retry in range(3):
for j in range(i):
diff = np.array([x, y]) - placed[j]
# if something is close, append it to the close list
if diff.dot(diff) < eps:
close.append(placed[j])
if close:
# then the close list is not empty
x += (np.random.randn() + 0.5) * (1 if np.random.rand() < 0.5 else -1)
y += (np.random.randn() + 0.5) * (1 if np.random.rand() < 0.5 else -1)
close = [] # so we can start again with an empty list
else:
# nothing close, let's break
break
placed[i] = (x, y)
plt.annotate(
s=index_word_map[i],
xy=(X[i,0], X[i,1]),
xytext=(x, y),
arrowprops={
'arrowstyle' : '->',
'color' : 'black',
}
)
print("vocab size:", current_index)
transformer = TfidfTransformer()
X = transformer.fit_transform(X).toarray()
reducer = TSNE()
Z = reducer.fit_transform(X)
plot_k_means(Z[:,:2], current_index//10, index_word_map, show_plots=True)