Nathaniel Dake Blog

2. Natural Language Corpus Data

This post was inspired and based off of chapter fourteen of the book Beautiful Data, the chapter being written by Peter Norvig. The chapter can be found here. The exercise will examining data consisting of the plainest of speech: 1 trillion words of English, taken from publically available webpages.

This data set was published by Thorsten Brants and Alex Franz of Google in 2006, and is made publically available through the Linguistic Data Consortium here.

This data set summarizes the original texts by counting the number of appearances of each words, and of each two-, three-, four-, and five-word sequence. For example, the word "the" appears 23 billion times (2.2% of the trillion words), making it the most common word.


Technical Definitions

Before we can dig into an analysis, we must learn a few pieces of technical terminology that will serve us well in the long run.

Corpus: A collection of text is called a corpus.

Tokens: We treat the corpus as a sequence of tokens-meaning words and punctuation.

Types: Each distinct token is called a type, so the text "Run, Lola Run" has 4 tokens (the comman counts as one) but only three types.

Vocabulary: The set of all types is called the vocabulary.

Unigram: A 1-token sequence is a unigram.

Bigram: A 2-token sequence is a bigram.

n-gram: An n-token sequence is an n-gram.

Probability: We will refer to P as probability, as in P(the) = 0.022, which means that the probability of the toekn "the" is 0.022, or 2.2%. If W is a sequence of tokens, then W3 is the third token, and W1:3 is the sequence of the first through third tokens. $P(Wi = the|Wi-1=of)$ is the conditional probability of "the", given that "of" is the previous token.

We are now ready to look at some tasks that can be accomplished using the data!


1. Word Segmentation

In general, English readers do not need to perform the task of word segmentation, the process of deciding where word boundaries are. This is directly due to the use of spaces in the english language (as contrasted with, for example, Mandarin).

However, in some texts, such as URL's, spaces are not present. How could a search engine or computer program work out such a mistake?

As an example, lets look at the English text

"choosespain.com"

This website is hoping to convince you to choose Spain as a travel destination, but if you segment the name wrong, you get the less appealing name:

"chooses pain"

As a human reader, you are able to make the right choice by drawing upon years of experience; you may initially guess that it would be nearly an insurmountable task to encode that experience into a computer algorithm. However, there is a shortcut we can take that works surprisingly well! We can look up each phrase in the bigram table! We see that "choose Spain" has a count of 3,210, whereas "chooses pain" does not appear at all (which means that it occurs fewer than 40 times in the trillion word corpus). Thus "choose Spain" is at least 80 times more likely, and can safely be consdered the right segmentation.

Now supposed we are trying to interpret the phrase:

insufficientnumbers

If we add together capitalized and lowercase version of the words, the counts are:

insufficient numbers ->  20715
in sufficient numbers -> 32378

“In sufficient numbers” is 50% more frequent than “insufficient numbers” but that’s hardly compelling evidence. We are left in a frustrating position: we can guess, but we can’t be confident. In uncertain problems like this, we don’t have any way of calculating a definitive correct answer, we don’t have a complete model of what makes one answer right, and in fact human experts don’t have a complete model, either, and can disagree on the answer. Still, there is an established methodology for solving uncertain problems.

1.1 Methodology


1.1.1 Define a probabilistic model

While we can't define all the factors (semantic, syntactic, lexical, social) that make "choose Spain" a better candidate for a domain name, but we can define a simplified model that gives approximate probabilities. For short candidates like "choose Spain" we could just look up the n-gram in the corpus data and use that as the probability. For longer candidates we will need some way of composing an answer from smaller parts. For words we haven't seen before, we'll have to estimate the probability of an unknown word. The point is that we define a language model- a probability distribution over all the strings in the language-and learn the parameters of the model from our corpus data, then use the model to define the probability of each candidate.

1.1.2 Enumerate the candidates

We may not be sure whether “insufficient numbers” or “in sufficient numbers” is more likely to be the intended phrase, but we can agree that they are both candidate segmentations, as is “in suffi cient numb ers,” but that “hello world” is not a valid candidate. In this step we withhold judgment and just enumerate possibilities—all the possibilities if we can, or else a carefully selected sample.

1.1.3 Choose the most probable candidate

Apply the language model to each candidate to get its probability, and choose the one with the highest probability.

From a mathematical perspective this can be written as:

$$best = argmax_{c \in candidates} P(c)$$

And from a code perspective, it would be:

``` best = max(candidates, key=P) ```

Let's now apply this methodology to segmentation. We want to define a function, segment, which takes as input a string with no spaces and returns a list of words that is the best segmentation:

>>> segment('choosespain')
['choose','spain']

Let's start with step 1, the probabilistic language model. The probability of a sequence of words is the product of the probabilities of each word, given the words context: all the preceeding words. It can be written mathematically as:

$$P(W_{1:n}) = \prod_{k=1:n}P(W_k \;|\;W_{1:k-1} )$$

As an example, if we were analyzing the sentence:

"The dog ran"

It would be broken down into:

$$P("the \; dog \; ran") = P(the) * P(dog \; | \; the) * P(ran \; | \; the dog)$$

Now, we don’t have the data to compute this exactly, so we can approximate the equation by using a smaller context. Since we have data for sequences up to 5-grams, it would be tempting to use the 5-grams, so that the probability of an n-word sequence would be the product of each word given the four previous words (not all previous words).

There are three difficulties with the 5-gram model. First, the 5-gram data is about 30 gigabytes, so it can’t all fit in RAM. Second, many 5-gram counts will be 0, and we’d need some strategy for backing off, using shorter sequences to estimate the 5-gram probabilities. Third, the search space of candidates will be large because dependencies extend up to four words away. All three of these difficulties can be managed, with some effort. But instead, let’s first consider a much simpler language model that solves all three difficulties at once: a unigram model, in which the probability of a sequence is just the product of the probability of each word by itself. In this model, the probability of each word is independent of the other words (this is a naive bayes approach of a bayesian network):

$$P(W_{1:n}) \prod_{k=1:n}P(W_k)$$

Now, if we were to segment wheninrome, we would consider candidates such as when in rome, and compute:

$$P(when) * P(in) * P(rome)$$

If the product is higher than any other candidates product, then that is the best answer!

An n-character string has $2^{n-1}$ different segmentations (there are n-1 positions between characters, each of which can either be or not be a word boundary. Thus the string:

wheninthecourseofhumaneventsitbecomesnecessary

has 35 trillion segmentations. But you most definitely were able to find the right segmentation in just a few seconds; clearly, you couldn't have enumerated all the candidates. You most likely scanned "w", "wh", and "whe", and rejected them as improbable words, but accepted "when" as probable. Then you moved on to the remainder and found its best segmentation. Once we make the simplifying assumption that each word is independent of the others, it means that we don’t have to consider all combinations of words.

That gives us a sketch of the segment function: consider every possible way to split the text into a first word and a remaining text (we can arbitrarily limit the longest possible word to, say, L=20 letters). For each possible split, find the best way to segment the remainder. Out of all the possible candidates, the one with the highest product of P(first)× P(remaining) is the best.

Here we show a table of choices for the first word, probability of the word, probability of the best segmentation of the remaining words, and probability of the whole (which is the product of the probabilities of the first and the remainder). We see that the segmentation starting with “when” is 50,000 times better than the second-best candidate.

In only a few lines of Python, we are able to implement segment. First, our imports that will be used throughout the rest of this notebook:

In [1]:
import re, string, random, glob, operator, heapq, functools
from collections import defaultdict
from math import log10

Now we will write two support functions that will be used in our segment functions. Note that memo uses memoization, which is defined as:

An optimization technique used primarily to speed up computer programs by storing the results of expensive function calls and returning the cached result when the same inputs occur again.

In [2]:
def memo(f):
  "Memoize function f."
  table = {}
  def fmemo(*args):
    if args not in table:
      table[args] = f(*args)
    return table[args]
  fmemo.memo = table
  return fmemo  

def product(nums):
    "Return the product of a sequence of numbers."
    return functools.reduce(operator.mul, nums, 1)

And now we can build out segment and it's related functions:

In [3]:
@memo 
def segment(text):
  "Return a list of words that is the best segmentation of text."
  if not text: return []
  candidates = ([first] + segment(rem) for first, rem in splits(text))
  return max(candidates, key=Pwords)

def splits(text, L=20):
  "Return a list of all possible (first, rem) pairs, len(first)<=L."
  return [(text[:i + 1], text[i + 1:])
         for i in range(min(len(text), L))]

def Pwords(words):
  "The Naive Bayes probability of a sequence of words."
  return product(Pw(w) for w in words)

This is the entire program!. We will use product as a utility function that multiplies together a list of numbers, memo is a decorator that caches the results of previous calls to a function so that they don't have to be recomputed, and Pw estimates the probability of a word by consulting the unigram count data.

Without memo, a call to segment for an n-character text makes $2^n$ recursive calls to segment; with memo it makes only n calles-memo makes this a fairly efficient dynamic programming algorithm. Each of the n calls considers O(L) splits, and evaluates each split by multiplying O(n) probabilities, so the whole algorithm is O($n^2L$).

As for Pw, we read in the unigram counts from a datafile. If a word appears in the corpus, its estimated probability is Count(word)/N, where N is the corpus size. Actually, instead of using the full 13-million-type unigram datafile, I created vocab_common, which (a) is case-insensitive, so that the counts for “the”, “The”, and “THE” are added together under a single entry for “the”; (b) only has entries for words made out of letters, not numbers or punctuation (so “+170.002” is out, as is “can’t”); and (c) lists only the most common 1/3 of a million words (which together cover 98% of the tokens).

The only tricky part of Pw is when a word has not been seen in the corpus. This happens sometimes even with a trillion-word corpus, so it would be a mistake to return 0 for the probability. But what should it be? The number of tokens in the corpus, N, is about a trillion,and the least common word in vocab_common has a count of 12,711. So a previously unseen word should have a probability of somewhere between 0 and 12,710/N. Not all unseen words are equally unlikely: a random sequence of 20 letters is less likely to be a word than a random sequence of 6 letters. We will define a class for probability distributions, Pdist, which loads a datafile of (key, count) pairs. By default, the probability of an unknown word is 1/N, but each instance of a Pdist can supply a custom function to override the default. We want to avoid having too high a probability for very long words, so we (rather arbitrarily) start at a probability of 10/N, and decrease by a factor of 10 forevery letter in the candidate word. We then define Pw as a Pdist:

In [4]:
class Pdist(dict):
  "A probability distribution estimated from counts in datafile."
  def __init__(self, data=[], N=None, missingfn=None):
    for key, count in data:
      self[key] = self.get(key, 0) + int(count)
    self.N = float(N or sum(self.itervalues()))
    self.missingfn = missingfn or (lambda k, N: 1. / N)
  
  def __call__(self, key):
    if key in self: return self[key] / self.N
    else: return self.missingfn(key, self.N)
    
def datafile(name, sep='\t'):
  "Read key, value pairs from file."
  for line in open('data/' + name):
    yield line.split(sep)
    
def avoid_long_words(key, N):
  "Estimate the probability of an unknown word."
  return 10. / (N * 10 ** len(key))

N = 1024908267229 # Number of tokens

Pw = Pdist(datafile('count_1w.txt'), N, avoid_long_words)
In [5]:
print(segment('choosespain'))
['choose', 'spain']
In [6]:
print(segment('thisisatest'))
['this', 'is', 'a', 'test']
In [7]:
print(segment('wheninthecourseofhumaneventsitbecomesnecessary'))
['when', 'in', 'the', 'course', 'of', 'human', 'events', 'it', 'becomes', 'necessary']
In [8]:
print(segment('speedofart'))
['speed', 'of', 'art']
In [9]:
print(segment('expertsexchange'))
['experts', 'exchange']
In [10]:
print(segment('itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolishness'))
['it', 'was', 'the', 'best', 'of', 'times', 'it', 'was', 'the', 'worst', 'of', 'times', 'it', 'was', 'the', 'age', 'of', 'wisdom', 'it', 'was', 'the', 'age', 'of', 'foolishness']
In [11]:
print(segment('inaholeinthegroundtherelivedahobbitnotanastydirtywetholefilledwiththeendsofwormsandanoozysmellnoryetadrybaresandyholewithnothinginittositdownonortoeatitwasahobbitholeandthatmeanscomfort'))
['in', 'a', 'hole', 'in', 'the', 'ground', 'there', 'lived', 'a', 'hobbit', 'not', 'a', 'nasty', 'dirty', 'wet', 'hole', 'filled', 'with', 'the', 'ends', 'of', 'worms', 'and', 'an', 'oozy', 'smell', 'nor', 'yet', 'a', 'dry', 'bare', 'sandy', 'hole', 'with', 'nothing', 'in', 'it', 'to', 'sitdown', 'on', 'or', 'to', 'eat', 'it', 'was', 'a', 'hobbit', 'hole', 'and', 'that', 'means', 'comfort']
In [12]:
print(segment('faroutintheunchartedbackwatersoftheunfashionableendofthewesternspiralarmofthegalaxyliesasmallunregardedyellowsun'))
['far', 'out', 'in', 'the', 'uncharted', 'backwaters', 'of', 'the', 'unfashionable', 'end', 'of', 'the', 'western', 'spiral', 'arm', 'of', 'the', 'galaxy', 'lies', 'a', 'small', 'un', 'regarded', 'yellow', 'sun']

Overall the results look good, but there are two errors: 'un','regarded' should be one word, and 'sitdown' should be two. Still, that’s a word precision rate of 157/159 = 98.7%; not too bad.

The first error is in part because “unregarded” does not appear in our 1/3-million-word vocabulary. (It is in the full 13-million-word vocabulary at position 1,005,493, with count 7,557.) If we put it in the vocabulary, we see that the segmentation is correct:

The second error happens because, although “sit” and “down” are common words (with probability .003% and .04%, respectively), the product of their two probabilities is just slightly less than the probability of “sitdown” by itself. However, the probability of the two-word sequence “sit down,” according to the bigram counts, is about 100 times greater. We can try to fix this problem by modeling bigrams; that is, considering the probability of each word, given the previous word:

$$P(W_{1:n}) = \prod_{k=1:n}P(W_k \; | \; W_{k-1})$$

Of course the complete bigram table won’t fit into memory. If we keep only bigrams that appear 100,000 or more times, that works out to a little over 250,000 entries, which does fit. We can then estimate $P(down | sit)$ as Count(sit down)/Count(sit). If a bigram does not appear in the table, then we just fall back on the unigram value. We can define cPw, the conditional probability of a word given the previous word, as:

In [13]:
def cPw(word, prev):
  "The conditional probability P(word | previous-word)."
  try: 
    return P2w[prev + ' ' + word] / float(Pw[prev])
  except KeyError:
    return Pw(word)
  
P2w = Pdist(datafile('count_2w.txt'), N)

(You may note cPw is not a probability distribution, because the sum over all words for a given previous word can be greater than 1. This approach has the technical name stupid backoff, but it works well in practice, so we won’t worry about it.) We can now compare “sitdown” to “sit down” with a preceding “to”:

In [14]:
cPw('sit', 'to') * cPw('down', 'sit') / cPw('sitdown', 'to')
Out[14]:
1698.0002330199263

Bi-gram Model

We see that “sit down” is 1,698 times more likely than “sitdown”, because “sit down” is a popular bigram, and because “to sit” is popular but “to sitdown” is not.

This looks promising! Let's implement a new version of segment using a bigram model. While we're at it, we'll fix two other issues.

  1. When segment added one new word to a sequence of n words segmented in the remainder, it called Pwords to multiply together at n+ probabilities. But segment had already multiplied all the probabilities in the remainder. It would be more efficient to remember the probability of the remainder and then just do one more multiplication.

  2. There is a potential problem with arithmetic underflow. If we apply Pwords to a sequence consisting of the word “blah” repeated 61 times, we get 5.2•10$^{–321}$, but if we add one more “blah,” we get 0.0. The smallest positive floating-point number that can be represented is about 4.9•10$^{–324}$; anything smaller than that rounds to 0.0. To avoid underflow, the simplest solution is to add logarithms of numbers rather than multiplying the numbers themselves.

We will define segment2, which differs from segment in three ways:

1) Bigram Language Model: First, it uses a conditional bigram language model, cPw, rather than the unigram model Pw.
2) Different Function Signature: Second, the function signature is different. Instead of being passed a single argument (the text), segment2 is also passed the previous word. At the start of the sentence, the previous word is the special beginning-of-sentence marker, <S>. The return value is not just a list of words, but rather a pair of values: the probability of the segmentation, followed by the list of words. We return the probability so that it can be stored (by memo) and need not be recomputed; this fixes problem (1), the inefficiency. The function combine takes four inputs—the first word and the remaining words, plus their probabilities—and combines them by appending the first word to the remaining words, and by multiplying the probabilities—except that in order to solve problem (2), we introduce the third difference...
3) Add Logarithms: We add logarithms of probabilities instead of multiplying the raw probabilities.

In [15]:
@memo
def segment2(text, prev='<S>'):
  "Return (log P(words), words), where words is the best segmentation."
  if not text: return 0.0, []
  candidates = [combine(log10(cPw(first, prev)), first, segment2(rem, first))
               for first, rem in splits(text)]
  return max(candidates)

def combine(Pfirst, first, Prem_and_rem):
  "Combine first and rem results into one (probability, words) pair."
  Prem, rem = Prem_and_rem
  return Pfirst + Prem, [first] + rem

segment2 makes O(nL) recursive calls, and each one considers O(L) splits, so the wholealgorithm is O(nL$^2$). In effect this is the Viterbi algorithm, with memo implicitly creating the Viterbi tables.

segment2 correctly segments the “sit down” example, and gets right all the examples that the first version got right. Neither version gets the “unregarded” example right.

In [16]:
print(segment2('inaholeinthegroundtherelivedahobbitnotanastydirtywetholefilledwiththeendsofwormsandanoozysmellnoryetadrybaresandyholewithnothinginittositdownonortoeatitwasahobbitholeandthatmeanscomfort'))
(-164.42574174168834, ['in', 'a', 'hole', 'in', 'the', 'ground', 'there', 'lived', 'a', 'hobbit', 'not', 'a', 'nasty', 'dirty', 'wet', 'hole', 'filled', 'with', 'the', 'ends', 'of', 'worms', 'and', 'an', 'oozy', 'smell', 'nor', 'yet', 'a', 'dry', 'bare', 'sandy', 'hole', 'with', 'nothing', 'in', 'it', 'to', 'sit', 'down', 'on', 'or', 'to', 'eat', 'it', 'was', 'a', 'hobbit', 'hole', 'and', 'that', 'means', 'comfort'])

Could we improve on this performance? Probably. We could create a more accurate model of unknown words. We could incorporate more data, and either keep more entries from the unigram or bigram data, or perhaps add trigram data.

In [ ]:
 

© 2018 Nathaniel Dake