Natural Language Processing (NLP) Fundamentals: Three Fundamental Tasks (of Hidden Markov Models [HMMs])

This blog post is part of a series, titled Natural Language Processing (NLP) Fundamentals.

Disclaimer: I am a developer at Amazon.com for their Alexa technologies, on devices such as Amazon Echo. As a consequence, there may be hints of Amazon sprinkled throughout this blog series. However, I should note that this blog series is not sponsored by Amazon.com, and represent my opinions only, and not those of the company.

In my last post, I talked about Hidden Markov Models (HMMs), describing what they are, what they are used for, and how to train one. That’s all well and merry, but I left out some aspects that I think some readers might benefit from.

WhatDidYouHide

In this next post, I will describe fundamental tasks that can be approached using HMMs.

Three Fundamental Tasks of Hidden Markov Models (HMMs)

Hidden Markov Models are traditionally associated with three fundamental tasks: learning, likelihood, and decoding. I’ll describe them a bit formally first:

  1. Learning: Given a set of training data, determine the parameters (specifically: the start, transition, and emission probabilities) of the HMM.
  2. Likelihood: Given a fully trained HMM, determine the likelihood of a sequence of observed variables.
  3. Decoding: Given a fully trained HMM and a sequence of observed variables, determine the most likely sequence of hidden states.

In my humble opinion, these problems are much easier to understand as examples. Recall my previous HMM example about using ice cream consumption to predict the weather. Remember that? Just in case, let me reproduce it below.

A Silly Ice Cream HMM Example

Imagine that you are interested in figuring out the weather in SomeWonderland, for each day from January 1, 2016 to January 10, 2016. While Google or other modern and savvy tech tools do not provide the weather tracking data for SomeWonderland, you have Nade’s diary, which (strangely) details the number of ice cream cones that he consumed each day from January 1, 2016 to January 10, 2016. Can you use Nade’s ice cream cone consumption to determine if a particular day was hot or cold?

IceCream-Intro.jpg

This problem could be approached using an HMM, with the weather as hidden states, and Nade’s ice cream consumption as observations.

HMMs-IceCream1

Learning

Given the following untrained HMM:

HMMs-IceCream1

And the following training data:

  • Many sets of 10-day weather patterns at SomeWonderland.
  • Nade’s ice cream consumption for many days, with the weather for each day.

Find:

  • The start probabilities (of entering the Hot and Cold states; red arrows above),
  • The transition probabilities (of moving to/from the Hot and Cold states; blue arrows above), and
  • The emission[🍦🍦] probabilities (of having a particular number of ice cream cones on either a Hot or Cold day; 🍦🍦s above)

We talked about how to solve learning tasks in my last post on HMMs, but it might be worth revisiting it for a refresher!

In my last post, note that we discussed a brute force approach (essentially simple counting) to solving learning problems. In practice, it is more efficient to employ the forward-backward algorithm.

Likelihood

Given the following trained HMM:

HMMs-IceCream_Trained.jpg

What is the probability of observing Nade consuming [1🍦, 2🍦, 1🍦] ice cream cones?

Here’s one brute force way to solve the problem above:

  1. Consider all possible 3-day weather sequences: [H, H, H], [H, H, C], [H, C, H], etc.
  2. For each 3-day weather sequence, consider the probability of the ice cream consumption sequence, [1🍦, 2🍦, 1🍦].
  3. Add up all probabilities from #2.

The brute force approach above is unpractical in some cases, and in many others altogether impossible. One efficient way at approaching likelihood problems is via the forward algorithm.

Decoding

Given the following trained HMM:

HMMs-IceCream_Trained.jpg

What was the most likely weather pattern for January 1, 2016 to January 3, 2016 if Nade consumed [1🍦, 2🍦, 1🍦] ice cream cones?

Notice:

  • We are looking for the most likely weather pattern, but we are not looking for the probability of that weather pattern occurring.
  • To use the trained HMM, we need a method to calculate the probability of an entire sequence, using the following information:
    • The start probabilities (of entering the Hot and Cold states; red arrows above),
    • The transition probabilities (of moving to/from the Hot and Cold states; blue arrows above), and
    • The emission[🍦🍦] probabilities (of having a particular number of ice cream cones on either a Hot or Cold day; 🍦🍦s above)

In mathematical notation, we can express the problem like this:

IceCream_LikelihoodEquation

I promise this is not as scary as it looks! The argmax simply says “pick out the WeatherPattern that maximizes the argument to the right”. The argument to the right is simply “the probability of a particular weather sequence, considering that we know how much ice cream Nade ate for the corresponding days”.

We have the following probability expression:

IceCream_ProbabilityExpression

Which we can expand using Baye’s theorem,

IceCream_ProbabilityExpressionBayes

Remember that we are looking to find a WeatherPattern that maximizes this expression. We know the ConsumptionPattern, because it is observed. In other words, we should treat WeatherPattern as a variable, and ConsumptionPattern as a known. In this case, maximizing the above expression would be just like maximizing the following, simpler expression:

IceCream_ProbabilityExpressionGenerative
Equation 1 – Generative probability expression.

Some quick food for thought before I move on: the probability to the left is sometimes known as the likelihood probability, an the probability to the right is sometimes known as the prior probability.

Now, this looks closer to values we have from our HMM, but we are not all the way there yet! Notice that the above expression is a statement about an entire sequence of states. We need an expression that just talks about states and their transitions.

In order to use our HMM, we need to make some simplifying assumptions:

  1. Observation Independence Assumption: The probability of an observation (the number of ice cream cones Nade ate that day) depends only on its associated hidden state (the weather that day).
  2. Markov Assumption: The probability of a hidden state (“Was it Hot or Cold for this day?”) depends only on its previous hidden state (“Was it Hot or Cold the previous day?”).

If we further expand and then simplify “Equation 1 – Generative probability expression” using the above assumptions (I’ll leave this as an exercise for any super motivated readers; feel free to leave a comment below with your findings! I bet other readers will thank you), you may find the following, usable expression:

IceCream_HmmExpression.jpg

Now, this is a much more manageable expression! Notice:

  • The probability expression on the left corresponds to emission[🍦🍦] probabilities
  • The probability expression on the right corresponds to transition probabilities (when considering days 2 and on) and start probabilities (when considering day 1)

Overall, this looks like:

IceCream_HmmFinal_Equation

If you have made it this far, congratulations 🙂 We have an equation that we can use to solve HMM decoding tasks.

Solving the decoding task by hand is similar to solving the likelihood task by hand. We can use the following brute force approach:

  1. Consider all possible 3-day weather sequences: [H, H, H], [H, H, C], [H, C, H], etc.
  2. For each weather sequence, calculate the probability of that sequence occurring, given that Nade consumed [1🍦, 2🍦, 1🍦] ice cream cones.
  3. Pick out the sequence that has the highest probability from step #2.

Again, please note that the brute force approach above is unpractical in some cases, and in many others altogether impossible. One efficient way at approaching decoding problems is via the Viterbi algorithm (a dynamic programming algorithm, for those that know DP :D).

Conclusion

That’s it for this time, folks! We covered a bunch of important takeaways in this post:

  • Hidden Markov Models (and many other machine learning models, as it turns out) can be characterized by three fundamental tasks:
    • Learning: Given a set of training data, determine the parameters (specifically: the start, transition, and emission probabilities) of the HMM.
    • Likelihood: Given a fully trained HMM, determine the likelihood of a sequence of observed variables.
    • Decoding: Given a fully trained HMM and a sequence of observed variables, determine the most likely sequence of hidden states.
  • Hidden Markov Models can be used to solve decoding problems by making two important assumptions:
    • Observation Independence Assumption: The probability of an observation depends only on its associated hidden state.
    • Markov Assumption: The probability of a hidden state depends only on its previous hidden state.
  • There are three algorithms that can solve each task more efficiently:

More Reading

 

(Book) Jurafsky and Martin. Chapter 8, Hidden Markov Models. Speech and Language Processing, 2nd Ed.

(Video) Jurafsky. Lecture 14, Introduction to N-Grams. Natural Language Processing on Coursera.

 

Questions? Comments?

Please feel free to leave a question or a comment below. Also, I’d be happy to correct any errata here (I am still a newbie, after all!). Thanks everyone for reading!

1 Comment »

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s