Artificial intelligence (AI) has made rapid progress in recent years, particularly in the field of machine learning. One promising approach is reinforcement learning (RL), where AI models learn to perform optimal actions by interacting with an environment. Large language models (LLMs) have demonstrated that they can learn complex tasks like multi-step reasoning and self-reflection through RL with simple rule-based rewards. However, existing approaches, known as "zero-shot RL" methods, reach their limits because they are restricted to the model's own outputs, thus limiting learning beyond initial capabilities.
A new research approach called LUFFY (Learning to reason Under oFF-policY guidance) promises to overcome this hurdle. LUFFY extends zero-shot RL with "off-policy reasoning traces." These traces allow the model to learn from the experiences of other agents or from pre-recorded data, instead of just its own actions. By combining off-policy demonstrations with on-policy rollouts during training, LUFFY finds a dynamic balance between imitation and exploration.
A central element of LUFFY is "policy shaping" using regularized importance sampling. This technique prevents superficial and rigid imitation during mixed-policy training and promotes the learning of important, but potentially less frequent, actions. The results are impressive: LUFFY achieves an average gain of over 7.0 points across six mathematical benchmarks and an advantage of over 6.2 points in out-of-distribution tasks. Compared to supervised fine-tuning (SFT) based on imitation, LUFFY performs significantly better, especially in terms of generalization.
The analysis shows that LUFFY not only imitates effectively but also explores beyond the provided demonstrations. This opens up a scalable path to training generalizable reasoning models with off-policy guidance. The ability to learn from external data while simultaneously developing its own strategies is a crucial step towards more robust and adaptable AI systems. LUFFY demonstrates the potential of off-policy learning to expand the boundaries of AI thinking and enable more complex reasoning abilities.
The implications of this research are far-reaching. From improving mathematical skills to solving complex problems in various fields, LUFFY could pave the way for a new generation of AI models capable of independent learning and adaptation to new situations.
For companies like Mindverse, which specialize in the development of AI solutions, these advancements open up new opportunities. Integrating off-policy learning methods into customized solutions such as chatbots, voicebots, AI search engines, and knowledge systems could significantly enhance their performance and flexibility.
Bibliographie: Yan, J., Li, Y., Hu, Z., Wang, Z., Cui, G., Qu, X., Cheng, Y., & Zhang, Y. (2025). Learning to Reason under Off-Policy Guidance. *arXiv preprint arXiv:2504.14945*. Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press. Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015). Trust region policy optimization. In *International conference on machine learning* (pp. 1889-1897). PMLR. Pacchiano, A., Ball, P., Parker-Holder, J., Choromanski, K., & Roberts, S. (2020). On the almost sure convergence of stochastic gradient descent in non-convex problems. *arXiv preprint arXiv:2006.11807*. Hausknecht, M., & Stone, P. (2016). Deep reinforcement learning in parameterized action space. *arXiv preprint arXiv:1511.04143*. ```