Boston Meetup: Anticipating the Future by Watching Unlabeled Video

Event Synopsis by

In many computer vision applications, machines will need to reason beyond the present, and predict the future. This task is challenging because it requires leveraging extensive commonsense knowledge of the world that is difficult to write down.

Carl Vondrick, Ph.D. candidate at MIT, believes that a promising resource for efficiently obtaining this knowledge is through the massive amount of readily available unlabeled video. In his talk, he will present a large scale framework he recently developed that capitalizes on temporal structure in unlabeled video to anticipate both actions and objects in the future. He experimentally validates this idea on two challenging video datasets, and results suggest that learning with unlabeled video helps forecast actions and anticipate objects.


  • Humans¬†can use complex understanding (beliefs, affordances, common sense, physics) to anticipate actions. Can computers also anticipate actions from video?
  • Where can we extract this information automatically? Written text and Unlabeled videos (Youtube)
  • Here’s the problem Carl tried to solve: Given an unlabeled video, can a computer understand and label what is happening? e.g., “She was __ a __ in a ___ in order to ___” (“She was sitting in a chair in a hospital in order to visit a doctor” is a hypothesis that can be given a score)
  • — 1.81 billion webpages,145TB of data, freely available
  • Language Models — look at probabilities of n-grams
  • KenLM Language Model Toolkit ( — it helps answer the question “what is the probability of this sentence being true?”
  • Combining Text and Vision: We can use a “factograph” — object, scene, action, motivation
    • P(“__ a ___ in a __”)
    • perform an inference on factograph — optimize an objective function
    • “max marginal” approach — objective SVM, takes into account a weighted set of features (from CV), and from corpus sourced from the web, terms that refer to the meanings of objects and their relationships between them (up to 3 objects)
  • Prediction space: low level (pixels) to semantics (tasks)
  • Two papers: Ali Sharif Razavian, “CNN Features off-the-shelf: An astounding baseline for recognition” and ¬†Jeff Donahue, et. al. “DeCAF: A Deep Convolutional … “
  • “Will she make the train?” — futures are not easily deterministic. Considering multiple futures can improve the anticipation model.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s