Autonomous agents have shown significant potential in automating complex, multi-step decision-making tasks. However, even state-of-the-art Vision-Language Models (VLMs) like GPT-4o fall short of human performance, particularly in complex web environments and tasks involving long-term planning.
The main challenges for autonomous AI agents typically lie in the following areas:
To address these limitations, Reflective Monte Carlo Tree Search (R-MCTS) has been developed - a novel test-time algorithm designed to enhance the ability of AI agents, such as those based on GPT-4o, to spontaneously explore the decision space. R-MCTS extends traditional MCTS with two key aspects:
Furthermore, the agent's performance can be improved by fine-tuning GPT-4o through self-learning. This utilizes the tree traversals generated by R-MCTS without requiring human-provided labels.
On the challenging VisualWebArena benchmark, the R-MCTS agent based on GPT-4o achieved a relative improvement of 6% to 30% on various tasks compared to the previous state-of-the-art.
It is shown that the knowledge gained through the test-time search can be effectively transferred back to GPT-4o through fine-tuning. The fine-tuned GPT-4o achieves 97% of the performance of R-MCTS while requiring four times less computation at test time.
Qualitative results demonstrate that the fine-tuned GPT-4o model is able to explore the environment, assess a state, and revert to viable states when it recognizes that the current state cannot lead to success. R-MCTS and self-learning prove to be promising approaches for enhancing the reasoning and planning capabilities of VLMs for agent-based applications.