2024 | FICTION: 4D Future Interaction Prediction from Video

type

status

date

slug

summary

Task

Given an observation of a person performing an activity, the model can anticipate all subsequent interactions, up to a future time (3 minutes), including its location and its body poses.

input

observation video till time

3D locations

body pose

output:

a set of points such that the person interacts with the object at the 3D point at the timestamp .

the distribution of the likely body poses

Method

Video representation

: EgoVLPv2

: linear layer

Pose Representation

: linear layer

Object Bounding Boxes

Assign an object index at a voxel location containing that object.

Actor is also viewed as an object

: linear layer

Multimodel Transformer Encoder

choose the first output of the transformer as the output representation.

Decoding future interaction location

use a simple location decoder linear layer which maps to a vector having dimensions.

a voxel is marked 1 when the corresponding location has a future interaction; 0 otherwise.

CVAE for pose distribution

location query , mapped into .

Training

where is the desired ground truth body pose.

Loss Function

MSE Loss between the predicted SMPL parameters and the ground truth parameters

convert the SMPL parameters to 3D body joints, and compute the joint error.

KL divergence between the predicted parameters and .

Inference

sample multiple and decode.

Dataset Construction

3D object bounding boxes

object segmentation in 2D video

use mapping between the video pixels to the 3D locations

obtain 3D bounding boxes

body pose

extract pose from exo video.

interaction

use the annotated narrations.

use Llama-3.1-8B to classify all the narrations into either a touch-based interaction or a non-touch interaction.

match the object mentioned in the narration to the object detection vocabulary

use the narration timestamps