Color-based Skin Tracking Under Varying Illumination:
Approach

IVC Logo

The following is a brief overview of our approach to color-based skin tracking under varying illumination. Readers are referred to the technical report. for a detailed description of the approach.

Introduction

Locating and tracking patches of skin-colored pixels through an image sequence is a tool used in many face recognition and gesture tracking systems. An important challenge of any skin-color tracking system is to accommodate varying illumination conditions that may occur within an image sequence. Some robustness may be achieved via the use of luminance invariant color-spaces [1,2]; however, this method can withstand only changes that skin-color distributions undergo within a narrow set of conditions.

The conditions that we are concerned with in this paper are broader than those assumed in many previous systems. In particular, we are concerned with three conditions: 1.) time-varying illumination, 2.) multiple sources, with time-varying illumination, and 3.) single or multiple colored sources. Most previous skin segmentation and tracking systems address only condition 1, defined over a narrow range (white light). Nevertheless, conditions 2 and 3 are also important, and have to be addressed in order to build a general purpose skin-color tracker. We will now list a few common scenarios that may lead to consideration of some, all, or a combination of the conditions cited above.

Consider a person driving a car at night. Illumination from street lights and traffic lights will be at least in part responsible for the color appearance of his/her skin. Hence if we want to build a skin color tracking system that would be used in surveilling the driver, we need to account for varying illuminant intensity and color.

Skin-color person tracking is also useful in indexing multimedia content such as movies. In this case, multiple colored lights with varying intensity play a direct role, since many movies are filmed with theatrical lighting to dramatize the effects of the screenplay.

Still another example of time-varying color illuminant is apparent in observing a person walking down a corridor with windows or lights that are significantly spaced apart. The color appearance of the person's skin will smoothly change as they move towards and then away from various light sources along the corridor.

Finally, it should be noted that it is not necessary to have colored lights to achieve effects equivalent to those that occur with colored lighting. Equivalent effects commonly arise due to surface inter-reflectance. For instance, consider a person walking down a corridor that has colored walls and/or carpet, or a person wearing colorful clothing. These surfaces reflect a color tinge onto the person's skin.

These are a few examples of applications that motivate our approach. Even though we agree that the majority of everyday lighting effects are due to white light attenuation, we hold that it is important to consider alternatives as well, in order to have a robust skin-color tracker that can handle a wider variety of environmental conditions.

To address these issues, we have developed a new technique that allows for a more general representation of skin-color. An explicit second order Markov model is used to predict evolution of the skin color distribution over time. Histograms are dynamically updated based on feedback from the current segmentation and based on predictions of the Markov model. The parameters of the discrete-time dynamic Markov model are estimated using Maximum Likelihood Estimation, and also evolve over time. Quantitative evaluation of the method was conducted on labeled ground-truth video sequences taken from popular movies, and the results are encouraging.

Overview of Approach

The goal is to track a moving skin-color distribution as defined by an adaptive color histogram in color space. Tracking is done by predicting the future parameters of the distribution and applying a warping on the distribution based on those predictions. The algorithm has three stages: initialization, learning, and then steady-state prediction/tracking.

The initialization stage segments the first frame of the image sequence to give an initial estimate for the skin-color distribution to be tracked. This is done by using a two-class Bayes' classifier. The prior histograms used for classification are precomputed off-line using the database provided by Jones and Rehg \cite{Database}. The resulting crude estimate is then refined with binary image processing. The final result of the initialization phase is the binary mask for the skin color regions to be tracked.

The learning stage uses an EM process over the first few frames in the video sequence. At each frame, the estimation step is histogram-based segmentation and the maximization step is histogram adaptation. This process defines the evolution of the distribution in discrete time. The evolution of the distribution is implicitly defined in terms of translation, rotation, and scaling of the samples in color space. The transformation parameters are easily estimated via standard statistical methods. Given the evolution of parameters, we can estimate the motion model for the distribution, and hence predict further deformations. The motion model that we use for the predictions is a second order discrete-time Markov model. The Markov model parameters are estimated by maximum likelihood estimation.

Once a motion model is learned we proceed to the prediction/tracking stage. At this stage, in addition to segmentation and distribution estimation, changes in translation, scaling and rotation of the distribution are {\em predicted} given the Markov model estimated in the learning stage. The parameters of Markov model are re-estimated over time as well. By predicting parametric changes, we can get a better estimate of the true distribution at the next time step. Even though adaptive histograms are used for segmentation, we cannot apply the predictions to the histograms directly due to the problems with resolution and sampling. Instead the predictions are propagated via a transformation applied on the samples directly. The newly transformed samples are used to estimate the histogram at the next frame.

Experiments


© 2000 Image and Video Computing Group - Boston University
Last Modified: March 14, 2000