OptiPose: Attention-Based Pose Optimizer

Abstract

Accurate tracking of the 3D pose of animals from video recordings is critical for many behavioral studies, yet there is a dearth of publicly available datasets that the computer vision community could use for model development. We here introduce the Rodent3D dataset that records animals exploring their environment and/or interacting with each other with multiple cameras and modalities (RGB, depth, thermal infrared). Rodent3D consists of 200 minutes of multimodal video recordings from up to three thermal and three RGB-D synchronized cameras (approximately 4 million frames). For the task of optimizing estimates of pose sequences provided by existing pose estimation methods, we provide a baseline model called OptiPose. While deep-learned attention mechanisms have been used for pose estimation in the past, with OptiPose, we propose a different way by representing 3D poses as tokens for which deep-learned context models pay attention to both spatial and temporal keypoint patterns. Our experiments show how OptiPose is highly robust to noise and occlusion and can be used to optimize pose sequences provided by state-of-the-art models for animal pose estimation.

Open Access Article - IJCV

OptiPose

MuSeq Pose Kit

alpha

MuSeq Pose Kit

Multi-view Sequential Pose Kit is a 3D pose estimation and optimization software that includes a feature-rich annotation tool complete with interpolation and 3D reprojection. It includes UI wrappers for most of the OptiPose functionalities.
The software supports I/O plugins to add support for other toolkits. As of now, it supports DeepLabCut and OptiPose data formats. Whereas, for video I/O, it comes with multi-threaded OpenCV (inaccurate, fast seeking) and DeffCode (accurate, slow seeking) plugins.
The software supports configurable 3D/2D plots powered by VisPy and Matplotlib. It supports abstract 3D scene plots (demonstrated in the GIF), real-time line plots, 3D occupancy plots, and gaze tracking.

Task and Motivation

Occlusion aware systems have been one of the primary goals of computer vision scientists. We want our systems to make rational decisions under uncertainty and perform at least as well as humans.
If we ask this question to anyone, they can probably pinpoint the location of the snout based on the locations of the other keypoints. This is because we have a general understanding of a rodent's posture. Therefore, it is essential to encode the spatio-temporal relations among the keypoints in the model. A standard method for encoding such relations is the self-attention mechanism.
Recent advancements in natural language processing have shown impressive levels of understanding of the grammar of the language. The methods can also be used in computer vision by rephrasing the pose optimization task as a language modeling task. Just as different permutations of words form a sentence, different permutations of the poses form an activity.

Qualitative Comparison

Using filtering is not enough to optimize poses under long-term occlusions. Interpolation and Kalman filtering will make the pose seem unnatural.
Since OptiPose has an understanding of the spatio-temporal relations among sequence of poses, it can reasonably predict the locations of occluded keypoints.
OptiPose is also robust when used on noisy data that is not essentially caused by occlusion. The figure on the bottom is generated by adding random noise to the AcinoSet dataset and using OptiPose to filter the poses.

Automatically Extracting Relevant Events

OptiPose automatically extracts interesting events by analyzing the optimized 3D poses
Left: By analyzing and filtering the azimuth of the HeadDirection and BodyDirection vectors, OptiPose extracts frame accurate Head Oscillation events from the session
Right: By analyzing and filtering the elevation of the HeadDirection and BodyDirection vectors, OptiPose extracts frame accurate Rearing events from the session

Head Oscillation

Rearing

Reconstruction and Tracking by OptiPose on Rodent3D

Optimizes the reconstruction from fairly noisy inputs (Right) by encoding the postural information of the subject.
Realtime plots of the spherical coordinates of HeadDirection, BodyDirection, and MovementDirection vectors.
Realtime tracking of an approximate location of where the subject is looking. This information can be useful in experiments with audio-visual stimuli.

3D Approximation of the Rodent's View

Conclusion and Future Work

Self-Attention-Based Models can capture the postural information of the animals and can be used for pose estimation/optimization.
The current version of the architecture uses 3D poses as inputs but we can use CNNs to provide the input pose embeddings for our model.
Using OptiPose as the backbone, we can train a supervised sequence-to-sequence model for classifying video frames in to animal activities.