The development of ever-larger language models (LLMs) has achieved impressive progress in the field of natural language processing in recent years. However, these models require exponentially more computing power and memory capacity, which makes their use in resource-constrained environments, such as for real-time applications, difficult.
Approaches like Mixture of Experts (MoE) decouple the number of parameters from the computational complexity by activating only a part of the model for each input. This improves training efficiency but at the same time leads to higher memory access costs during inference, which increases latency. Product Key Memory (PKM), on the other hand, uses an extremely large but sparsely populated memory matrix to keep the number of activated parameters low. However, PKM suffers from performance degradation compared to MoE.
This is where UltraMem comes in. This novel architectural concept builds on PKM and extends it with several innovative components to overcome the challenges of MoE and PKM. UltraMem enables the integration of extremely large, ultra-sparse memory layers that significantly improve the computational efficiency and scalability of LLMs without sacrificing model performance.
UltraMem is based on the principle of product keys, which enable efficient addressing of the memory matrix. In contrast to conventional MLP layers, which use a one-dimensional logical address, PKM uses a two-dimensional address generated by the product of two key vectors. This reduces the complexity of key calculation.
However, UltraMem goes one step further and addresses the scaling problems of PKM. Instead of a single, huge memory matrix, UltraMem uses several smaller, distributed memory modules. This enables parallel execution of memory and transformer layers and reduces latency. In addition, UltraMem uses a Tucker decomposition of the keys to improve retrieval accuracy and virtual memory management to minimize physical memory requirements.
Compared to MoE, UltraMem offers significant advantages in terms of inference speed and memory access. Experiments show that UltraMem can be up to six times faster than MoE with the same number of parameters and computational complexity. The inference speed of UltraMem is almost identical to that of a dense model with equivalent computational resources.
Furthermore, UltraMem scales similarly well as MoE and in some cases even shows superior scalability. The distributed architecture of UltraMem allows the training of networks with millions of memory locations and paves the way for even larger and more powerful language models.
UltraMem represents a promising approach for the development of ultra-efficient language models. The combination of distributed memory modules, Tucker decomposition, and virtual memory management allows for a significant reduction in inference latency and memory requirements without sacrificing model performance. Future research could focus on further optimizing the architecture and applying UltraMem to various NLP tasks.
Bibliography: Huang, Z., Min, Q., Huang, H., Zhu, D., Zeng, Y., Guo, R., & Zhou, X. (2024). Ultra-Sparse Memory Network. arXiv preprint arXiv:2411.12364. Cannistraci, C. V., & Hebb, D. O. (2023). Ultra-Sparse Network Advantage in Deep Learning via Cannistraci-Hebb Brain-Inspired Training With Hyperbolic Meta-Deep Community-Layered Epitopology. Basak, B., Dasgupta, P., & Pal, A. (2024). Efficient Low-Memory Implementation of Sparse CNNs Using Encoded Partitioned Hybrid Sparse Format. ACM Transactions on Embedded Computing Systems, 23(6), 1-30. Chen, Y., Krishna, T., Emer, J., & Sze, V. (2018). Persistent memory residual networks for image restoration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (pp. 1417-1425). Palm, G. (2014). Neural Assemblies: An Alternative Approach to Artificial Intelligence. MIT OpenCourseWare. Sommer, F. T., & Palm, G. (2009). Improved bidirectional retrieval in sparse associative memories. Neural Networks, 22(2), 170-177. Tripathy, R., Candela, I. M., & Romberg, J. (2024). Ultra-Sparse Near-Additive Emulators. ```