April 22, 2025

InfiGUI-R1: A New Approach to Multimodal GUI Agents

Listen to this article as Podcast
0:00 / 0:00
InfiGUI-R1: A New Approach to Multimodal GUI Agents

From Reactive Actors to Deliberate Thinkers: InfiGUI-R1 Revolutionizes Multimodal GUI Agents

Multimodal large language models (MLLMs) have driven the development of graphical user interface agents (GUI agents), showing promising results in automating tasks on computer devices. The ability to process visual and linguistic information allows these agents to perform complex tasks based on user instructions. While initial successes have been achieved in integrating reasoning mechanisms into GUI agents, challenges remain regarding robustness and adaptability, especially in complex GUI environments.

Many current approaches rely on manually designed reasoning templates. However, these templates can reach their limits in dynamic and unpredictable GUI environments. Another weakness lies in the reactive nature of many existing agents. These often act as "Reactive Actors," primarily relying on implicit reasoning. This implicit reasoning is often not deep enough to handle complex GUI tasks that require planning and error recovery.

To address these challenges, researchers present the innovative approach InfiGUI-R1. This MLLM-based GUI agent was developed using the Actor2Reasoner framework. This framework follows a two-stage training approach that gradually evolves agents from Reactive Actors to Deliberate Thinkers.

The first stage, called "Reasoning Injection," focuses on establishing a basic reasoning mechanism. This utilizes "Spatial Reasoning Distillation." This method transfers spatial reasoning abilities from teacher models to MLLMs. By using trajectories with explicit reasoning steps, the models learn to integrate visual-spatial GUI information with logical reasoning before generating actions.

In the second stage, "Deliberation Enhancement," the basic reasoning mechanism is refined into a deliberate thinker. This is done using reinforcement learning. Two central approaches characterize this stage: "Sub-goal Guidance" rewards models for generating precise intermediate goals, while "Error Recovery Scenario Construction" creates training scenarios based on identified error-prone steps. This allows the agent to learn to recognize errors and react appropriately.

Initial test results show that InfiGUI-R1 achieves impressive performance in GUI grounding and trajectory tasks. The ability to handle complex tasks in dynamic GUI environments underscores the potential of this novel approach. The development of InfiGUI-R1 represents an important step towards robust and adaptable GUI agents and opens up new possibilities for automating complex tasks on computer devices.

The research results and source code are publicly available, offering developers and researchers the opportunity to build upon and further develop this promising technology.

Bibliographie: - https://huggingface.co/papers/2504.14239 - https://twitter.com/_akhaliq/status/1914609694460580207 - https://x.com/_akhaliq/status/1914609793433509909 - https://arxiv.org/list/cs.CL/new - https://arxiv.org/list/cs/new - http://lonepatient.top/2025/04/22/arxiv_papers_2025-04-22.html