Linear sequence modeling methods like linear attention, state-space modeling, and linear RNNs offer significant efficiency advantages by reducing the complexity of training and inference. Traditionally, however, these methods compress the entire input sequence into a single, fixed-size memory state. This can lead to performance degradation in tasks that require a high degree of information retrieval, so-called "recall-intensive" tasks.
A new approach called Mixture-of-Memories (MoM) promises a remedy here. Inspired by neuroscientific findings, particularly the brain's ability to maintain a robust long-term memory while minimizing "memory interference," MoM uses multiple independent memory states. A so-called router network directs the input tokens to specific memory states. This approach significantly increases the total memory capacity while minimizing interference between the stored information.
As a result, MoM achieves significantly better results than existing linear sequence modeling techniques, especially in recall-intensive tasks. Remarkably, despite the use of multiple memory states, the computational complexity of each individual state remains linear. This allows MoM to retain the advantage of linear complexity during training and constant complexity during inference.
Experimental results show that MoM significantly outperforms current linear sequence models on various downstream language tasks, especially on recall-intensive tasks. In some cases, MoM even achieves performance comparable to that of Transformer models. Transformer models, known for their power in natural language processing, are often more computationally intensive due to their quadratic complexity.
The router network plays a crucial role in the MoM model. It decides which memory state is best suited for processing a particular input token. This dynamic assignment of tokens to memory states allows for more efficient use of the available memory capacity and minimizes the risk of interference. The architecture of the router network and the criteria for assigning tokens are crucial for the performance of the entire MoM model.
The development of MoM opens up promising possibilities for the efficient processing of sequence data. Further research could focus on optimizing the router network, scaling the model to larger datasets, and applying MoM in various application areas. The combination of linear complexity and high performance makes MoM an attractive approach for applications where both efficiency and accuracy are critical.
Bibliography: - https://www.arxiv.org/abs/2502.13685 - https://huggingface.co/papers/2502.13685 - https://arxiv.org/html/2502.13685v1 - http://paperreading.club/page?id=285633 - https://github.com/Event-AHU/Mamba_State_Space_Model_Paper_List - https://llm-random.github.io/posts/moe_mamba/ - https://nips.cc/virtual/2024/poster/96794 - https://www.researchgate.net/publication/388919803_LASP-2_Rethinking_Sequence_Parallelism_for_Linear_Attention_and_Its_Hybrid - https://openreview.net/pdf?id=25Ioxw576r