Title: Specialized Mappings Architecture with Applications to Vision-Based Estimation of Articulated Body Pose
Author: Romer Rosales
Abstract:
A fundamental task of vision systems is to infer the state of the
world given some form of visual observations. From a computational
perspective, this often involves facing an ill-posed problem; e.g.,
information is lost via projection of the 3D world into a 2D image.
Solution of an ill-posed problem requires additional information,
usually provided as a model of the underlying process. It is
important that the model be both computationally feasible as well as
theoretically well-founded. In this thesis, a probabilistic, nonlinear
supervised computational learning model is proposed: the Specialized
Mappings Architecture (SMA). The SMA framework is demonstrated in a
computer vision system that can estimate the articulated pose
parameters of a human body or human hands, given images obtained via
one or more uncalibrated cameras.
The SMA consists of several specialized forward mapping functions that
are estimated automatically from training data, and a possibly known
feedback function. Each specialized function maps certain domains of
the input space (e.g., image features) onto the output space (e.g.,
articulated body parameters). A probabilistic model for the
architecture is first formalized. Solutions to key algorithmic
problems are then derived: simultaneous learning of the specialized
domains along with the mapping functions, as well as performing
inference given inputs and a feedback function. The SMA employs a
variant of the Expectation-Maximization algorithm and approximate
inference. The approach allows the use of alternative conditional
independence assumptions for learning and inference, which are derived
from a forward model and a feedback model.
Experimental validation of the proposed approach is conducted in the
task of estimating articulated body pose from image
silhouettes. Accuracy and stability of the SMA framework is tested
using artificial data sets, as well as synthetic and real video
sequences of human bodies and hands.