Aug 30

Dense Trajectories for Action Recognition

Here, I take a qualitative look at one of the most successful video representations in use today: dense trajectories. You can find the original CVPR 2011 paper by Wang, Klaser, Schmid, and Liu here: Action Recognition by Dense Trajectories.

High Jump with action trajectories

I’m mostly interested in evaluating how well the provided code does at finding good trajectories for videos from the Olympic Sports dataset, which have lots of camera motion, both deliberate (pan, zoom, etc) and accidental (shake).   The example videos I show here show that this approach is really vulnerable to this noise, which probably requires further preprocessing to remove errant trajectories that aren’t firing on the motion of interest. Their “Motion Boundaries” descriptor might help distinguish camera motion from “signal”, but I doubt it will solve the entire problem.

Experimental Setup

I use the OpenCV-based C++ implementation of dense trajectories provided by the first author: http://lear.inrialpes.fr/people/wang/dense_trajectories

I ran the code on uncompressed AVI footage of Olympic Sports, using all the default settings.  This corresponds to extracting trajectories that are exactly 15 frames long (which is less than a second on the clips I study here).  Each trajectory is formed by tracking an interest point.  Once a track is lost, a new point is added to take its place, and the algorithm attempts to always ensure at least one point is tracked in every 5×5 pixel cell of the image (if I understand the README correctly).  Some trajectories are discarded if they have moved too little or too much (as measured by variance in the position over time).

I then post-processed the results using a custom Matlab script that allows visualizing the recovered trajectories, on top of the original footage.  Because each clip can have thousands of trajectories active at any particular frame, I chose to randomly subsample the trajectories for visualization, so in the results below you’ll only see at most 300 trajectories at any time.  Otherwise, in some cases it is hard to see the original pixels underneath the trace visualizations of each individual trajectory.

Again, I emphasize that I’m not attempting to study the descriptive power of these trajectories here, only review qualitatively how well the algorithm holds up to camera motion. Essentially, I want to know if the recovered trajectories nicely follow the “signal” in the image stream (e.g. moving limbs, thrown objects), or do they track noise (panning, zooming, etc.).


Snatch: For this simple clip, the trajectories look great.

Download Video: MP4

Basketball Layup: another good result, though not that much coverage of the ball in the air.

Download Video: MP4

Unfortunately, the above were taken with high-res still cameras, and the story does not play out for other scenarios. In particular, when significant motion occurs, the results don’t look so great.

For example, here’s the trajectories found for another Basketball Layup clip.  Small movements in camera position cause many artifacts, though the player’s main movements are (more or less) covered.

Download Video: MP4

Shot Put:  shaky zoom causes scattered, noisy trajectories. hardly any coverage of the thrown object.

Download Video: MP4

Discus Throw: zoom triggers lots of trajectories

Download Video: MP4

Long Jump: significant sideways panning makes for terrible trajectories

Download Video: MP4

High Jump: again, panning trajectories dominant the real signal

Download Video: MP4

Platform Dive: awful lighting and shaky, zoomed-in camera yield horrendous noise artifacts

Download Video: MP4

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>