Structured State-Space Models (SSMs) have emerged as a promising alternative to transformer models, particularly for processing long sequences. While SSMs are often praised for their ability to capture long-range dependencies, recent research indicates that they are subject to inherent limitations that affect their scalability and performance.
SSMs suffer from a strong recency bias, meaning they disproportionately weight the most recent information in a sequence. This leads to less attention being paid to information further back in the sequence, and in extreme cases, it can even be completely ignored. Empirical studies demonstrate that this bias significantly limits the models' ability to retrieve distant information and simultaneously causes robustness issues. For example, in certain scenarios, SSMs may struggle to extract crucial information from the beginning of a long sequence and utilize it for later processing.
While deeper SSM architectures can facilitate the learning of longer contexts, they also carry the risk of over-smoothing. As the depth of the models increases, the representations of individual tokens in the sequence become increasingly similar and lose their distinctiveness. This effect resembles the over-smoothing phenomenon observed in Graph Neural Networks (GNNs). In extreme cases, all token representations converge to a single point in the vector space, severely impairing the model's ability to distinguish between different tokens and thus capture the meaning of the sequence.
The simultaneous occurrence of recency bias and over-smoothing presents a fundamental dilemma for the scalability of SSMs. While deeper models can theoretically process longer contexts, increasing depth simultaneously leads to over-smoothing and thus a loss of information. This conflict limits the possibilities of improving SSM performance by simply increasing the size of the architecture.
To address this dilemma, new approaches have been developed, such as the polarization technique. This technique aims to mitigate both recency bias and over-smoothing by specifically manipulating two channels of the state transition matrices in SSMs. Specifically, these channels are set to the values zero and one to control the information flow within the model and thus address the mentioned problems. Experiments show that this polarization technique improves the accuracy of associative retrieval of distant tokens in the sequence and allows SSMs to benefit from deeper architectures without succumbing to the negative effects of over-smoothing.
Research on state-space models is dynamic and promising. Understanding the inherent limitations like recency bias and over-smoothing is crucial for the development of more robust and scalable models. Innovative approaches like the polarization technique open up new avenues to overcome these challenges and unlock the full potential of SSMs for processing long sequences. Further development of initialization strategies and architectural variants promises additional improvements in the future and solidifies the position of SSMs as a serious alternative to transformer models in the field of Natural Language Processing and beyond.
Bibliographie: https://openreview.net/forum?id=pymXpl4qvi https://openreview.net/pdf?id=pymXpl4qvi https://github.com/radarFudan/Awesome-state-space-models https://proceedings.neurips.cc/paper_files/paper/2023 https://neurips.cc/virtual/2024/calendar https://arxiv.org/abs/2310.01698 https://nips.cc/virtual/2024/papers.html https://arxiv-sanity-lite.com/?rank=pid&pid=2203.02026 https://www.paperdigest.org/data/neurips-2023-full.html https://iclr.cc/virtual/2024/session/19809