SlotSSMs: Slot State Space Models

Abstract

Recent State Space Models (SSMs) such as S4, S5, and Mamba have shown remarkable computational benefits in long-range temporal dependency modeling. However, many sequence modeling problems, the underlying process is inherently modular and it is of interest to have inductive biases that mimic this modular structure. In this paper, we introduce SlotSSMs, a novel framework for incorporating independent mechanisms into SSMs to preserve or encourage separation of information. Unlike conventional SSMs that maintain a monolithic state vector, SlotSSMs maintains the state as a collection of multiple vectors called slots. Crucially, the state transitions are performed independently per slot with sparse interactions across slots implemented via the bottleneck of self-attention. In experiments, we evaluate our model in object-centric learning, 3D visual reasoning, and long-context video understanding tasks, which involve modeling multiple objects and their long-range temporal dependencies. We find that our proposed design offers substantial performance gains over existing sequence modeling methods.

Method

SlotSSMs vs existing models. (a) SlotSSMs incorporate modularity through independent state transitions and sparse interactions via self-attention. (b) Traditional SSMs utilize a monolithic state vector for all past information. (c) Multi-slot Transformer-based models offer modularity but with high computational complexity. (d) Multi-slot RNN-based models have modular states but can't parallelize training (red lock). SlotSSMs combine parallelizable training, memory efficiency, and modularity for efficient temporal modeling.

Architecture

SlotSSMs are fully parallelized sequential process models that combine SSMs and Transformers. Each layer comprises:

Slot Encoder: Utilizes a Transformer to extract compact slot representations from inputs of any size.
SlotSSM: Independently updates these slots over time using separate state transitions.
Slot Mixer: Introduces inter-slot interactions through self-attention mechanisms.

Multi-Object Video Prediction

Performance of SlotSSMs on multi-object video prediction task, demonstrating significant gains over single-state baselines and comparable performance to multi-slot transformer models. This highlights the necessity of modular state representation in video modeling.

Long-Context Reasoning

We introduce the Blinking Color Balls Benchmark, specifically designed to assess the capability to model multi-object interactions and long-range dependencies with sequence lengths extending up to 2560.

SlotSSMs demonstrate their efficiency in long-context reasoning, outperforming existing models in terms of prediction accuracy and computational efficiency.

Unsupervised Object-Centric Learning

We propose the OC-SlotSSMs variant for unsupervised object-centric representation learning. OC-SlotSSMs outperform existing methods in both unsupervised object segmentation and downstream property prediction.

3D Visual Reasoning

We evaluate SlotSSMs on the CATER dataset, a challenging 3D visual reasoning benchmark. OC-SlotSSMs achieve superior performance on both direct training and pre-training + fine-tuning settings.

Emergent Modularity in Real-World Videos

TikTok Dataset Results

SlotSSM (Ours) SAVi++

Waymo Dataset Results

SlotSSM (Ours) SAVi++

UT Egocentric Dataset Results

SlotSSM (Ours) SAVi++

SlotSSMs: Slot State Space Models

NeurIPS 2024

Abstract

Method

Architecture

SlotSSMs are fully parallelized sequential process models that combine SSMs and Transformers. Each layer comprises:

Multi-Object Video Prediction

Performance of SlotSSMs on multi-object video prediction task, demonstrating significant gains over single-state baselines and comparable performance to multi-slot transformer models. This highlights the necessity of modular state representation in video modeling.

Long-Context Reasoning

SlotSSMs demonstrate their efficiency in long-context reasoning, outperforming existing models in terms of prediction accuracy and computational efficiency.

Unsupervised Object-Centric Learning

We propose the OC-SlotSSMs variant for unsupervised object-centric representation learning. OC-SlotSSMs outperform existing methods in both unsupervised object segmentation and downstream property prediction.

3D Visual Reasoning

We evaluate SlotSSMs on the CATER dataset, a challenging 3D visual reasoning benchmark. OC-SlotSSMs achieve superior performance on both direct training and pre-training + fine-tuning settings.

Emergent Modularity in Real-World Videos

TikTok Dataset Results

Waymo Dataset Results

UT Egocentric Dataset Results

We applied OC-SlotSSMs to a depth estimation task on real-world datasets. SlotSSMs is capable of exploiting modular representations to understand scene structures in real-world videos, without explicit segmentation supervision. We show slot decomposition and depth estimation in the videos.

BibTeX