The idea is to account for incidental variances to acheive a smooth, yet
extremely fast-forwarded view of the subject. Specifically, the variances being
Face position: The orientation of the face within the frame.
Lighting: Differences in illumination colour and/or white balance.
Pose: Differences in facial pose, and lighting direction.
As usual, I’m attacking this problem in Python. The code is relatively short
(~250 lines), although I’m using dlib,
OpenCV and numpy to do the heavy
lifting. Source code is available
As a first step, lets account for facial position by rotating, translating and
scaling images to match the first. Here’s the code to do that:
Here, the get_landmarks() function uses dlib to use facial landmark
…and orthogonal_procrustes() generates a transformation matrix which maps
one set of landmarks features onto another. This transformation is used by
warp_im() to translate, rotate and scale images to line up with the first
image. For more details refer to my Switching Eds post which uses an identical approach in steps 1 and 2
to align the images.
After correcting for face position, you get a video that looks like this:
There are still a few obvious discontinuities. One variance we can easily iron
out is the overall change in colour on the face due to different lighting
and/or white balance settings.
The correction works by computing a mask for each image:
…based on the convex hull of the
landmark points. This is then multiplied by the image itself:
The sum of the pixels in the masked image is then divided by the sum of the
values in the mask, to give an average colour for the face:
Images’ RGB values are then scaled such that their average face colour matches
that of the average face colour of the first image:
…where ref_color is the color of the first face, saved from the first
Here’s the first 5 seconds with color correction applied:
The above is looking pretty good, but there are still some issues causing lack
Minor variations in facial pose.
Changes in lighting direction.
Given these perturbations are more or less random for each frame, the best we
can do is select a subset of frames which is in some sense smooth.
To solve this I went the graph theory route:
Here I’ve split the video into 10 frame layers, with full connections from each
layer to the next. The weights of each edge measures how different the two
frames are, with the goal being to find the shortest path from Start to
End; frames on the selected path are used in the output video. By doing this
the total “frame difference” is minimized. Because the path length is fixed by
the graph structure, this is equivalent to minimizing the average frame
The metric used for frame difference is the euclidean norm between the pair of
images, after being masked to the face area. Here’s the code to calculate
weights, a dict of dicts where weight[n1][n2] gives the weight of the edge
between node n1 and n2: