Natural Language Processing (NLP) Fundamentals: Three Fundamental Tasks (of Hidden Markov Models [HMMs])
This blog post is part of a series, titled Natural Language Processing (NLP) Fundamentals.
Disclaimer: I am a developer at Amazon.com for their Alexa technologies, on devices such as Amazon Echo. As a consequence, there may be hints of Amazon sprinkled throughout this blog series. However, I should note that this blog series is not sponsored by Amazon.com, and represent my opinions only, and not those of the company.
In my last post, I talked about Hidden Markov Models (HMMs), describing what they are, what they are used for, and how to train one. That’s all well and merry, but I left out some aspects that I think some readers might benefit from.
In this next post, I will describe fundamental tasks that can be approached using HMMs.
Three Fundamental Tasks of Hidden Markov Models (HMMs)
Hidden Markov Models are traditionally associated with three fundamental tasks: learning, likelihood, and decoding. I’ll describe them a bit formally first:
- Learning: Given a set of training data, determine the parameters (specifically: the start, transition, and emission probabilities) of the HMM.
- Likelihood: Given a fully trained HMM, determine the likelihood of a sequence of observed variables.
- Decoding: Given a fully trained HMM and a sequence of observed variables, determine the most likely sequence of hidden states.
In my humble opinion, these problems are much easier to understand as examples. Recall my previous HMM example about using ice cream consumption to predict the weather. Remember that? Just in case, let me reproduce it below.
A Silly Ice Cream HMM Example
Imagine that you are interested in figuring out the weather in SomeWonderland, for each day from January 1, 2016 to January 10, 2016. While Google or other modern and savvy tech tools do not provide the weather tracking data for SomeWonderland, you have Nade’s diary, which (strangely) details the number of ice cream cones that he consumed each day from January 1, 2016 to January 10, 2016. Can you use Nade’s ice cream cone consumption to determine if a particular day was hot or cold?
This problem could be approached using an HMM, with the weather as hidden states, and Nade’s ice cream consumption as observations.
Learning
Given the following untrained HMM:
And the following training data:
- Many sets of 10-day weather patterns at SomeWonderland.
- Nade’s ice cream consumption for many days, with the weather for each day.
Find:
- The start probabilities (of entering the Hot and Cold states; red arrows above),
- The transition probabilities (of moving to/from the Hot and Cold states; blue arrows above), and
- The emission[🍦🍦] probabilities (of having a particular number of ice cream cones on either a Hot or Cold day; 🍦🍦s above)
We talked about how to solve learning tasks in my last post on HMMs, but it might be worth revisiting it for a refresher!
In my last post, note that we discussed a brute force approach (essentially simple counting) to solving learning problems. In practice, it is more efficient to employ the forward-backward algorithm.
Likelihood
Given the following trained HMM:
What is the probability of observing Nade consuming [1🍦, 2🍦, 1🍦] ice cream cones?
Here’s one brute force way to solve the problem above:
- Consider all possible 3-day weather sequences: [H, H, H], [H, H, C], [H, C, H], etc.
- For each 3-day weather sequence, consider the probability of the ice cream consumption sequence, [1🍦, 2🍦, 1🍦].
- Add up all probabilities from #2.
The brute force approach above is unpractical in some cases, and in many others altogether impossible. One efficient way at approaching likelihood problems is via the forward algorithm.
Decoding
Given the following trained HMM:
What was the most likely weather pattern for January 1, 2016 to January 3, 2016 if Nade consumed [1🍦, 2🍦, 1🍦] ice cream cones?
Notice:
- We are looking for the most likely weather pattern, but we are not looking for the probability of that weather pattern occurring.
- To use the trained HMM, we need a method to calculate the probability of an entire sequence, using the following information:
- The start probabilities (of entering the Hot and Cold states; red arrows above),
- The transition probabilities (of moving to/from the Hot and Cold states; blue arrows above), and
- The emission[🍦🍦] probabilities (of having a particular number of ice cream cones on either a Hot or Cold day; 🍦🍦s above)
In mathematical notation, we can express the problem like this:
I promise this is not as scary as it looks! The argmax simply says “pick out the WeatherPattern that maximizes the argument to the right”. The argument to the right is simply “the probability of a particular weather sequence, considering that we know how much ice cream Nade ate for the corresponding days”.
We have the following probability expression:
Which we can expand using Baye’s theorem,
Remember that we are looking to find a WeatherPattern that maximizes this expression. We know the ConsumptionPattern, because it is observed. In other words, we should treat WeatherPattern as a variable, and ConsumptionPattern as a known. In this case, maximizing the above expression would be just like maximizing the following, simpler expression:

Some quick food for thought before I move on: the probability to the left is sometimes known as the likelihood probability, an the probability to the right is sometimes known as the prior probability.
Now, this looks closer to values we have from our HMM, but we are not all the way there yet! Notice that the above expression is a statement about an entire sequence of states. We need an expression that just talks about states and their transitions.
In order to use our HMM, we need to make some simplifying assumptions:
- Observation Independence Assumption: The probability of an observation (the number of ice cream cones Nade ate that day) depends only on its associated hidden state (the weather that day).
- Markov Assumption: The probability of a hidden state (“Was it Hot or Cold for this day?”) depends only on its previous hidden state (“Was it Hot or Cold the previous day?”).
If we further expand and then simplify “Equation 1 – Generative probability expression” using the above assumptions (I’ll leave this as an exercise for any super motivated readers; feel free to leave a comment below with your findings! I bet other readers will thank you), you may find the following, usable expression:
Now, this is a much more manageable expression! Notice:
- The probability expression on the left corresponds to emission[🍦🍦] probabilities
- The probability expression on the right corresponds to transition probabilities (when considering days 2 and on) and start probabilities (when considering day 1)
Overall, this looks like:
If you have made it this far, congratulations 🙂 We have an equation that we can use to solve HMM decoding tasks.
Solving the decoding task by hand is similar to solving the likelihood task by hand. We can use the following brute force approach:
- Consider all possible 3-day weather sequences: [H, H, H], [H, H, C], [H, C, H], etc.
- For each weather sequence, calculate the probability of that sequence occurring, given that Nade consumed [1🍦, 2🍦, 1🍦] ice cream cones.
- Pick out the sequence that has the highest probability from step #2.
Again, please note that the brute force approach above is unpractical in some cases, and in many others altogether impossible. One efficient way at approaching decoding problems is via the Viterbi algorithm (a dynamic programming algorithm, for those that know DP :D).
Conclusion
That’s it for this time, folks! We covered a bunch of important takeaways in this post:
- Hidden Markov Models (and many other machine learning models, as it turns out) can be characterized by three fundamental tasks:
- Learning: Given a set of training data, determine the parameters (specifically: the start, transition, and emission probabilities) of the HMM.
- Likelihood: Given a fully trained HMM, determine the likelihood of a sequence of observed variables.
- Decoding: Given a fully trained HMM and a sequence of observed variables, determine the most likely sequence of hidden states.
- Hidden Markov Models can be used to solve decoding problems by making two important assumptions:
- Observation Independence Assumption: The probability of an observation depends only on its associated hidden state.
- Markov Assumption: The probability of a hidden state depends only on its previous hidden state.
- There are three algorithms that can solve each task more efficiently:
- Learning tasks can be solved using the forward-backward algorithm,
- Likelihood tasks can be solved using the forward algorithm, and
- Decoding tasks can be solved using the Viterbi algorithm.
More Reading
(Book) Jurafsky and Martin. Chapter 8, Hidden Markov Models. Speech and Language Processing, 2nd Ed.
(Video) Jurafsky. Lecture 14, Introduction to N-Grams. Natural Language Processing on Coursera.
Questions? Comments?
Please feel free to leave a question or a comment below. Also, I’d be happy to correct any errata here (I am still a newbie, after all!). Thanks everyone for reading!
1 Comment »