Natural Language Processing (NLP) Fundamentals: Hidden Markov Models (HMMs)
This blog post is part of a series, titled Natural Language Processing (NLP) Fundamentals.
Disclaimer: I am a developer at Amazon.com for their Alexa technologies, on devices such as Amazon Echo. As a consequence, there may be hints of Amazon sprinkled throughout this blog series. However, I should note that this blog series is not sponsored by Amazon.com, and represent my opinions only, and not those of the company.
In my last post, I asked if you like graphs. Well, I have more questions for you this time! Do you enjoy:
- Taking calculated risks?
- Playing with your odds?
- Looking at probability equations?
If you answered “yes” to any of those questions, then fantastic! You’ll love learning about Hidden Markov Models (HMMs). After all, HMMs provide a computer with the means to take calculated risks.
Wikipedia describes a Hidden Markov Model (HMM) as “a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved (hidden) states”. If you felt that this definition was in any way intuitive, then you are either a genius or a masochist. I’ll play the optimist and say you are a genius! I am unfortunately not, and so here is an explanation that is more intuitive, at least for myself.
Let’s play a quick game to help get us in the right mindset. Assume I gave you the following sentence:
“Alexa, please pass the ____”
And, the following audio snippet (an “observation”):
Using this data, can you fill in the blank?
Just to be clear, Amazon Echo probably can’t physically pass salt/stuff to you. You can of course ask Alexa to order things on Amazon (like Alec Baldwin apparently does) or get pizza delivered to your doorstep. However, the Alexa Voice Service makes it possible to deploy Amazon’s voice AI on any device with a speaker and a microphone. Alexa-enabled robotic butler, anybody?
Hidden Markov Models are probability models that help programs come to the most likely decision, based on both previous decisions (like previously recognized words in a sentence) and current data (like the audio snippet). The game above is similar to the problem that a computer might try to solve when doing automatic speech recognition.
The words “Alexa, please pass the” could be considered hidden states that had been determined earlier in the recognition process. The audio snippet is considered an observation that is used, in conjunction with the previously recognized words, to determine the next word (i.e., next hidden state).
As I did with Finite State Transducers (FSTs), allow me to demonstrate this intuition with an image.
Like FSTs, HMMs are similar to directed graphs, but with several key differences:
- There are “start probabilities” associated with entering a state (red arrows). For visual clarity, I’m omitting the probabilities (from 0 to 1) next to these arrows, but trust me, they should be there! If you added up all the red arrows, you should get 1.
- There are “transition probabilities” associated with going from one state to another, or back to itself (blue arrows). Again, these would be numbers from 0 to 1, next to the blue arrows. If you add up all the blue arrows exiting a given state, you should get 1.
- To move from one state to another, we are given information about the previous state and the current “observation”. In other words, if I am deciding if I should move to “there”, I may know that my previous state was “Hey”, and that I am given a particular sound byte (like #1)
- An observation is associated with an “emission probability” for each state. In other words, with a fully-defined HMM, I should be able to get (or calculate) the probability of a particular sound byte occurring for the word “Nade”. This sound byte could sound close to “Nade”, like “Nate”, “Made”, etc., or something far off, like “Hello”, “John”, “Alexa”. You might correctly guess that the far off sound bytes should have lower probabilities than the closer ones.
As you probably gathered by this point, Hidden Markov Models are used in modern speech recognition systems. HMMs are also used in a bunch of other applications as well though:
- Speech synthesis (i.e., given a sentence, produce a series of sound bytes that likely correspond with that sentence)
- Part-of-speech tagging (e.g., in “She finished the race”, is “race” most likely a verb or a noun?)
- Machine translation
- Gene prediction
- Time Series Analysis (e.g., stock predictions! Someone even wrote a thesis on this.)
Training an HMM
Perhaps you are now thinking, “Whoa HMMs are cool! But how are they created?” It turns out this is an incredibly nuanced question, but let’s at least talk about the basics (of which I’m more comfortable with anyway). And, for those of you still with me, there’s a distinct value-add if you continue reading: what we discuss next, training, is a cornerstone of machine learning — specifically, supervised learning.
A Sample Problem
Imagine that you are interested in figuring out the weather in SomeWonderland, for each day from January 1, 2016 to January 10, 2016. While Google or other modern and savvy tech tools do not provide the weather tracking data for SomeWonderland, you have Nade’s diary, which (strangely) details the number of ice cream cones that he consumed each day from January 1, 2016 to January 10, 2016. Can you use Nade’s ice cream cone consumption to determine if a particular day was hot or cold?
This problem could be approached using an HMM, with the weather as hidden states, and Nade’s ice cream consumption as observations.
You also happen to have some data that can be used for training. Let’s get started, shall we?
A possible training set might include:
- Many sets of 10-day weather patterns at SomeWonderland.
- Nade’s ice cream consumption for many days, with the weather for each day.
Set of 10-day weather patterns at SomeWonderland
One possible set of 10-day Weather patterns could look as simple as the following:
- [H, H, C, C, H, H, C, C, H, H]
- [H, H, H, H, H, H, H, H, H, H]
- [C, C, C, C, C, C, C, C, C, C]
- [C, C, H, H, H, H, H, H, H, H]
Nade’s ice cream consumption vs. weather patterns
We also need to have some notion of how Nade’s ice cream consumption might correlate with the weather patterns. Here’s one possible set:
- 0 cones, H (x 1 time)
- 1 cone, H (x 2 times)
- 2 cones, H (x 4 times)
- 0 cones, C (x 5 times)
- 1 cone, C (x 3 times)
- 2 cones, C (x 1 time)
Training the start probabilities
This is the likelihood of starting at a particular weather pattern. With the above training set, we could simply count the number of times a pattern started on a hot day, and then divide this by the number of total patterns.
Two sets started with a hot day, and two sets started with a cold day, so:
- P(Start on H) = 0.5
- P(Start on C) = 0.5
Training the transition probabilities
This is the likelihood of going from a hot day to another hot day, a hot day to a cold day, a cold day to a cold day, and a cold day to a hot day. That was a bit of a mouthful, so let’s instead do some counting.
- Number of transitions total, over all 4 weather patterns: 9 x 4 = 36
- Number of transitions from H to H: 19
- Number of transitions from H to C: 2
- Number of transitions from C to C: 12
- Number of transitions from C to H: 3
Now that we have the counts, we can easily compute the transition probabilities:
- P(H to H) = 19 / 36 = 0.53
- P(H to C) = 2 / 36 = 0.06
- P(C to C) = 12 / 36 = 0.33
- P(C to H) = 3 / 36 = 0.08
Training the emission probabilities
This is the likelihood of Nade consuming a particular number of ice cream cones, given that it is a Hot/Cold day; in other words, we would like to compute probabilities like P(3 cones | Hot).
- Number of Hot days: 1 + 2 + 4 = 7
- Number of Cold days: 5 + 3 + 1 = 9
We can use the number of Hot/Cold days as denominators for the conditional probabilities:
- P(0 cones | Hot day) = 1 / 7 = 0.14
- P(1 cone | Hot day) = 2 / 7 = 0.29
- P(2 cones | Hot day) = 4 / 7 = 0.57
- P(0 cones | Cold day) = 5 / 9 = 0.56
- P(1 cone | Cold day) = 3 / 9 = 0.33
- P(2 cones | Cold day) = 1 / 9 = 0.11
That’s it! We’ve trained our HMM using the training data above. Wasn’t that bad, right? 🙂
Major (pesky?) nuances in training
There are some practical complications that I did not explore but are probably worth calling out.
Counting naively, like we did above, typically leads to overly restrictive probabilities. For example, if none of the weather sequences started with a Cold day, then the start probability would be 0. The same can be said of the transition probabilities — imagine if there are no days transitioning from Hot to Cold. The issue of having 0 occurrences of an event in training data is so mind-boggingly common in natural language data, that there is actually a family of more advanced techniques to compute these probabilities. Techniques that massage the probabilities to deal with this issue technically fall under smoothing. I won’t explore smoothing in this post, but there are some pretty neat things out there. If you’re interested, I recommend checking it out!
Also, in many cases, calculating the emission probabilities is not as trivial as counting, like we did above. Counting works most easily when the events are clearly discrete. But when considering an audio signal, like the sound byte for “salt”, virtually no two samples are exactly identical, even when coming from the same person. In these cases, more advanced techniques are required.
The Natural Language Toolkit (NLTK) is “a leading platform for building Python programs to work with human language data”. The creators of NLTK also happen to have written Natural Language Processing With Python, which you can also get for free on their website (and here’s their original version for Python 2).
NLTK comes equipped with an HMM-based sequence tagger. Here’s an example of its usage, identifying “parts-of-speech” (such as nouns, adjectives, verbs, etc.) as hidden states for each observed word in a body of literature, taken literally from here:
def demo_pos(): # demonstrates POS tagging using supervised training print() print('HMM POS tagging demo') print() print('Training HMM...') labelled_sequences, tag_set, symbols = load_pos(20000) trainer = HiddenMarkovModelTrainer(tag_set, symbols) hmm = trainer.train_supervised(labelled_sequences[10:], estimator=lambda fd, bins: LidstoneProbDist(fd, 0.1, bins)) print('Testing...') hmm.test(labelled_sequences[:10], verbose=True) def load_pos(num_sents): from nltk.corpus import brown sentences = brown.tagged_sents(categories='news')[:num_sents] tag_re = re.compile(r'[*]|--|[^+*-]+') tag_set = set() symbols = set() cleaned_sentences =  for sentence in sentences: for i in range(len(sentence)): word, tag = sentence[i] word = word.lower() # normalize symbols.add(word) # log this word # Clean up the tag. tag = tag_re.match(tag).group() tag_set.add(tag) sentence[i] = (word, tag) # store cleaned-up tagged token cleaned_sentences += [sentence] return cleaned_sentences, list(tag_set), list(symbols)
(Software) Natural Language Toolkit (NLTK)
(Video Lecture) Hidden Markov Models by mathematicalmonk, on Youtube
Please feel free to leave a question or a comment below. Also, I’d be happy to correct any errata here (I am still a newbie, after all!). Thanks everyone for reading!