SlotSSMs: Slot State Space Models

1Rutgers University, 2KAIST
Video Decomposition on TikTok Dataset
TikTok Dataset
Video Decomposition on UT Egocentric Dataset
UT Egocentric Dataset

Emergent Scene Decomposition from Depth Estimation Tasks. Colors represent the ID of slots used for predicting each position. SlotSSM is capable of exploiting the inherent modular structure of real-world videos for efficient inference, without explicit segmentation supervision.


Abstract

Recent State Space Models (SSMs) such as S4, S5, and Mamba have shown remarkable computational benefits in long-range temporal dependency modeling. However, many sequence modeling problems, the underlying process is inherently modular and it is of interest to have inductive biases that mimic this modular structure. In this paper, we introduce SlotSSMs, a novel framework for incorporating independent mechanisms into SSMs to preserve or encourage separation of information. Unlike conventional SSMs that maintain a monolithic state vector, SlotSSMs maintains the state as a collection of multiple vectors called slots. Crucially, the state transitions are performed independently per slot with sparse interactions across slots implemented via the bottleneck of self-attention. In experiments, we evaluate our model in object-centric video understanding, 3D visual reasoning, and video prediction tasks, which involve modeling multiple objects and their long-range temporal dependencies. We find that our proposed design offers substantial performance gains over existing sequence modeling methods.


Method

Method Diagram

SlotSSMs vs existing models. (a) SlotSSMs incorporate modularity through independent state transitions and sparse interactions via self-attention. (b) Traditional SSMs utilize a monolithic state vector for all past information. (c) Multi-slot Transformer-based models offer modularity but with high computational complexity. (d) Multi-slot RNN-based models have modular states but can't parallelize training (red lock). SlotSSMs combine parallelizable training, memory efficiency, and modularity for efficient temporal modeling.


Architecture

Architecture Diagram

SlotSSMs are fully parallelized sequential process models that combine SSMs and Transformers. Each layer comprises:

  1. Slot Encoder: Utilizes a Transformer to extract compact slot representations from inputs of any size.
  2. SlotSSM: Independently updates these slots over time using separate state transitions.
  3. Slot Mixer: Introduces inter-slot interactions through self-attention mechanisms.

Multi-Object Video Prediction

video prediction

Performance of SlotSSMs on multi-object video prediction task, demonstrating significant gains over single-state baselines and comparable performance to multi-slot transformer models. This highlights the necessity of modular state representation in video modeling.


Long-Context Reasoning

Data Samples
We introduce the Blinking Color Balls Benchmark, specifically designed to assess the capability to model multi-object interactions and long-range dependencies with sequence lengths extending up to 2560.
Long-Context Reasoning Performance

SlotSSMs demonstrate their efficiency in long-context reasoning, outperforming existing models in terms of prediction accuracy and computational efficiency.


Unsupervised Object-Centric Learning

Object-Centric Learning Performance

We propose the OC-SlotSSMs variant for unsupervised object-centric representation learning. OC-SlotSSMs outperform existing methods in both unsupervised object segmentation and downstream property prediction.


3D Visual Reasoning

3D Visual Reasoning

We evaluate SlotSSMs on the CATER dataset, a challenging 3D visual reasoning benchmark. OC-SlotSSMs achieve superior performance on both direct training and pre-training + fine-tuning settings.


Emergent Modularity in Real-World Videos

 

Emergent Modularity in Real-World Videos

 

We applied OC-SlotSSMs to a depth estimation task on real-world datasets. SlotSSMs is capable of exploiting modular representations to understand scene structures in real-world videos, without explicit segmentation supervision. We show slot decomposition and depth estimation in the videos.

Note that, in this task, our goal is not to surpass existing depth estimation models but to use this task, manageable with our lab resources, to showcase the emerging modularity in SlotSSMs for real-world video processing. For the TikTok dataset, we manually changed the colors for two background slots to grey for a more aesthetically pleasing visualization. Details of this experiment will be updated in the arXiv manuscript soon.


BibTeX

@article{jiang2024slot,
    title = {Slot State Space Models},
    author = {Jiang, Jindong and Deng, Fei and Singh, Gautam and Lee, Minseung and Ahn, Sungjin},
    journal = {arXiv preprint arXiv:2406.12272},
    year = {2024}
  }