In this tutorial, we dive into the fundamentals of Optical Flow, look at some of its applications and implement its two main variants (sparse and dense). We also briefly discuss more recent approaches using deep learning and promising future directions.

Recent breakthroughs in computer vision research have allowed machines to perceive its surrounding world through techniques such as object detection for detecting instances of objects belonging to a certain class and semantic segmentation for pixel-wise classification.

However, for processing real-time video input, most implementations of these techniques only address relationships of objects within the same frame $(x, y)$ disregarding time information $(t)$. In other words, they re-evaluate each frame independently, as if they are completely unrelated images, for each run. However, what if we do need the relationships between consecutive frames, for example, we want to track the motion of vehicles across frames to estimate its current velocity and predict its position in the next frame? Sparse optical flow of traffic (Each arrow points in the direction of predicted flow of the corresponding pixel).

Or, alternatively, what if we require information on human pose relationships between consecutive frames to recognize human actions such as archery, baseball, and basketball?

In this tutorial, we will learn what Optical Flow is, how to implement its two main variants (sparse and dense), and also get a big picture of more recent approaches involving deep learning and promising future directions.

## What is optical flow?

Let us begin with a high-level understanding of optical flow. Optical flow is the motion of objects between consecutive frames of sequence, caused by the relative movement between the object and camera. The problem of optical flow may be expressed as:

where between consecutive frames, we can express the image intensity $(I)$ as a function of space $(x, y)$ and time $(t)$. In other words, if we take the first image $I(x, y, t)$ and move its pixels by $(dx, dy)$ over $t$ time, we obtain the new image $I(x + dx, y + dy, t + dt)$.

First, we assume that pixel intensities of an object are constant between consecutive frames.

Second, we take the Taylor Series Approximation of the RHS and remove common terms.

Third, we divide by $dt$ to derive the optical flow equation:

where $u = dx/dt$ and $v = dy/dt$.

$dI/dx, dI/dy$, and $dI/dt$ are the image gradients along the horizontal axis, the vertical axis, and time. Hence, we conclude with the problem of optical flow, that is, solving $u (dx/dt)$ and $v (dy/dt)$ to determine movement over time. You may notice that we cannot directly solve the optical flow equation for $u$ and $v$ since there is only one equation for two unknown variables. We will implement some methods such as the Lucas-Kanade method to address this issue.

### Sparse vs Dense Optical Flow

Sparse optical flow gives the flow vectors of some "interesting features" (say few pixels depicting the edges or corners of an object) within the frame while Dense optical flow, which gives the flow vectors of the entire frame (all pixels) - up to one flow vector per pixel. As you would've guessed, Dense optical flow has higher accuracy at the cost of being slow/computationally expensive. Left: Sparse Optical Flow - track a few "feature" pixels; Right: Dense Optical Flow - estimate the flow of all pixels in the image.

## Implementing Sparse Optical Flow

Sparse optical flow selects a sparse feature set of pixels (e.g. interesting features such as edges and corners) to track its velocity vectors (motion). The extracted features are passed in the optical flow function from frame to frame to ensure that the same points are being tracked. There are various implementations of sparse optical flow, including the Lucas–Kanade method, the Horn–Schunck method, the Buxton–Buxton method, and more. We will be using the Lucas-Kanade method with OpenCV, an open source library of computer vision algorithms, for implementation.

### 1. Setting up your environment

If you do not already have OpenCV installed, open Terminal and run:

pip install opencv-python

Now, clone the tutorial repository by running:

git clone https://github.com/chuanenlin/optical-flow.git

Next, open sparse-starter.py with your text editor. We will be writing all of the code in this Python file.

### 4. Shi-Tomasi Corner Detector - selecting the pixels to track

For the implementation of sparse optical flow, we only track the motion of a feature set of pixels. Features in images are points of interest which present rich image content information. For example, such features may be points in the image that are invariant to translation, scale, rotation, and intensity changes such as corners.

The Shi-Tomasi Corner Detector is very similar to the popular Harris Corner Detector which can be implemented by the following three procedures:

1. Determine windows (small image patches) with large gradients (variations in image intensity) when translated in both $x$ and $y$ directions.
2. For each window, compute a score $R$.
3. Depending on the value of $R$, each window is classified as a flat, edge, or corner.

If you would like to know more on a step-by-step mathematical explanation of the Harris Corner Detector, feel free to go through these slides.

Shi and Tomasi later made a small but effective modification to the Harris Corner Detector in their paper Good Features to Track.

The modification is to the equation in which score $R$ is calculated. In the Harris Corner Detector, the scoring function is given by:

$$\begin{array}{c}{R=\operatorname{det} M-k(\operatorname{trace} M)^{2}}\newline \ {\operatorname{det} M=\lambda_{1} \lambda_{2}}\newline \ {\operatorname{trace} M=\lambda_{1}+\lambda_{2}}\end{array}$$

Instead, Shi-Tomasi proposed the scoring function as:

$$R=\min \left(\lambda_{1}, \lambda_{2}\right)$$

which basically means if $R$ is greater than a threshold, it is classified as a corner. The following compares the scoring functions of Harris (left) and Shi-Tomasi (right) in $λ1-λ2$ space. Comparison of Harris and Shi-Tomasi scoring functions on λ1-λ2 space. Source

For Shi-Tomasi, only when $λ1$ and $λ2$ are above a minimum threshold $λmin$ is the window classified as a corner.

The documentation of OpenCV’s implementation of Shi-Tomasi via goodFeaturesToTrack() may be found here.

### Tracking Specific Objects

There may be scenarios where you want to only track a specific object of interest (say tracking a certain person) or one category of objects (like all 2 wheeler-vehicles in traffic). You can easily modify the code to track the pixels of the object(s) you want by changing the prev variable.

You can also combine Object Detection with this method to only estimate the flow of pixels within the detected bounding boxes. This way you can track all objects of a particular type/category in the video.

### 5. Lucas-Kanade: Sparse Optical Flow

Lucas and Kanade proposed an effective technique to estimate the motion of interesting features by comparing two consecutive frames in their paper An Iterative Image Registration Technique with an Application to Stereo Vision. The Lucas-Kanade method works under the following assumptions:

1. Two consecutive frames are separated by a small time increment ($dt$) such that objects are not displaced significantly (in other words, the method work best with slow-moving objects).
2. A frame portrays a “natural” scene with textured objects exhibiting shades of gray that change smoothly.

First, under these assumptions, we can take a small 3x3 window (neighborhood) around the features detected by Shi-Tomasi and assume that all nine points have the same motion.

This may be represented as

where $q_1, q_2, …, q_n$ denote the pixels inside the window (e.g. $n$ = 9 for a 3x3 window) and $I_x(q_i)$, $I_y(q_i)$, and $I_t(q_i)$ denote the partial derivatives of image $I$ with respect to position $(x, y)$ and time $t$, for pixel $q_i$ at the current time.

This is just the Optical Flow Equation (that we described earlier) for each of the n pixels.

The set of equations may be represented in the following matrix form where $Av = b$:

Take note that previously (see "What is optical flow?" section), we faced the issue of having to solve for two unknown variables with one equation. We now face having to solve for two unknowns ($V_x$ and $V_y$) with nine equations, which is over-determined.

Second, to address the over-determined issue, we apply least squares fitting to obtain the following two-equation-two-unknown problem:

where $Vx = u = dx/dt$ denotes the movement of $x$ over time and $Vy = v = dy/dt$ denotes the movement of y over time. Solving for the two variables completes the optical flow problem.

In a nutshell, we identify some interesting features to track and iteratively compute the optical flow vectors of these points. However, adopting the Lucas-Kanade method only works for small movements (from our initial assumption) and fails when there is large motion. Therefore, the OpenCV implementation of the Lucas-Kanade method adopts pyramids.

In a high-level view, small motions are neglected as we go up the pyramid and large motions are reduced to small motions - we compute optical flow along with scale. A comprehensive mathematical explanation of OpenCV’s implementation may be found in Bouguet’s notes and the documentation of OpenCV’s implementation of the Lucas-Kanade method via calcOpticalFlowPyrLK() may be found here.

### 6. Visualizing

And that’s it! Open Terminal and run

python sparse-starter.py

to test your sparse optical flow implementation. 👏

In case you have missed any code, the full code can be found in sparse-solution.py.

## Implementing Dense Optical Flow

We’ve previously computed the optical flow for a sparse feature set of pixels. Dense optical flow attempts to compute the optical flow vector for every pixel of each frame. While such computation may be slower, it gives a more accurate result and a denser result suitable for applications such as learning structure from motion and video segmentation. There are various implementations of dense optical flow. We will be using the Farneback method, one of the most popular implementations, with using OpenCV, an open source library of computer vision algorithms, for implementation.

### 1. Setting up your environment

Next, open dense-starter.py with your text editor. We will be writing all of the code in this Python file.

### 4. Farneback Optical Flow

Gunnar Farneback proposed an effective technique to estimate the motion of interesting features by comparing two consecutive frames in his paper Two-Frame Motion Estimation Based on Polynomial Expansion.

First, the method approximates the windows (see Lucas Kanade section of sparse optical flow implementation for more details) of image frames by quadratic polynomials through polynomial expansion transform. Second, by observing how the polynomial transforms under translation (motion), a method to estimate displacement fields from polynomial expansion coefficients is defined. After a series of refinements, dense optical flow is computed. Farneback’s paper is fairly concise and straightforward to follow so I highly recommend going through the paper if you would like a greater understanding of its mathematical derivation. Dense optical flow of three pedestrians walking in different directions. Source

For OpenCV’s implementation, it computes the magnitude and direction of optical flow from a 2-channel array of flow vectors $(dx/dt, dy/dt)$, the optical flow problem. It then visualizes the angle (direction) of flow by hue and the distance (magnitude) of flow by value of HSV color representation. The strength of HSV is always set to a maximum of 255 for optimal visibility. The documentation of OpenCV’s implementation of the Farneback method via calcOpticalFlowFarneback() may be found here.

### 5. Visualizing

And that’s it! Open Terminal and run

python dense-starter.py

to test your dense optical flow implementation. 👏

In case you have missed any code, the full code can be found in dense-solution.py.

## Optical Flow using Deep Learning

While the problem of optical flow has historically been an optimization problem, recent approaches by applying deep learning have shown impressive results. Generally, such approaches take two video frames as input to output the optical flow (colour-coded image), which may be expressed as: Generation equation of optical flow computed with a deep learning approach. Output of a deep learning model: colour-coded image; colour encodes the direction of pixel while intensity indicates their speed.

where $u$ is the motion in the $x$ direction, $v$ is the motion in the $y$ direction, and $f$ is a neural network that takes in two consecutive frames $I_{t-1}$ (frame at time = $t-1)$ and $I_t$ (frame at time = $t)$ as input. Architecture of FlowNetCorr, a convolutional neural network for end-to-end learning of optical flow. Source

Computing optical flow with deep neural networks requires large amounts of training data which is particularly hard to obtain. This is because labeling video footage for optical flow requires accurately figuring out the exact motion of each and every point of an image to subpixel accuracy. To address the issue of labeling training data, researchers used computer graphics to simulate massive realistic worlds. Since the worlds are generated by instruction, the motion of each and every point of an image in a video sequence is known. Some examples of such include MPI-Sintel, an open-source CGI movie with optical flow labeling rendered for various sequences, and Flying Chairs, a dataset of many chairs flying across random backgrounds also with optical flow labeling. Synthetically generated data for training Optical Flow Models – MPI-Sintel dataset. Source Synthetically generated data for training Optical Flow Models – Flying Chairs dataset. Source

Solving optical flow problems with deep learning is an extremely hot topic at the moment, with variants of FlowNet, SPyNet, PWC-Net, and more each outperforming one another on various benchmarks.

### Optical Flow application: Semantic Segmentation

The optical flow field is a vast mine of information for the observed scene. As the techniques of accurately determining optical flow improve, it is interesting to see applications of optical flow in junction with several other fundamental computer visions tasks. For example, the task of semantic segmentation is to divide an image into series of regions corresponding to unique object classes yet closely placed objects with identical textures are often difficult for single frame segmentation techniques. If the objects are placed separately, however, the distinct motions of the objects may be highly helpful where discontinuity in the dense optical flow field correspond to boundaries between objects.