💷2024 | FICTION: 4D Future Interaction Prediction from Video
type
status
date
slug
summary
tags
category
icon
password
Task
Given an observation of a person performing an activity, the model can anticipate all subsequent interactions, up to a future time (3 minutes), including its location and its body poses.
input
- observation video till time
- 3D locations
- body pose
output:
- a set of points such that the person interacts with the object at the 3D point at the timestamp .
- the distribution of the likely body poses
Method
Video representation
- : EgoVLPv2
- : linear layer
Pose Representation
- : linear layer
Object Bounding Boxes
Assign an object index at a voxel location containing that object.
Actor is also viewed as an object
- : linear layer
Multimodel Transformer Encoder
choose the first output of the transformer as the output representation.
Decoding future interaction location
use a simple location decoder linear layer which maps to a vector having dimensions.
a voxel is marked 1 when the corresponding location has a future interaction; 0 otherwise.
CVAE for pose distribution
location query , mapped into .
Training
where is the desired ground truth body pose.
Loss Function
- MSE Loss between the predicted SMPL parameters and the ground truth parameters
- convert the SMPL parameters to 3D body joints, and compute the joint error.
- KL divergence between the predicted parameters and .
Inference
sample multiple and decode.
Dataset Construction
3D object bounding boxes
- object segmentation in 2D video
- use mapping between the video pixels to the 3D locations
- obtain 3D bounding boxes
body pose
extract pose from exo video.
interaction
use the annotated narrations.
- use Llama-3.1-8B to classify all the narrations into either a touch-based interaction or a non-touch interaction.
- match the object mentioned in the narration to the object detection vocabulary
- use the narration timestamps