Automating tasks on mobile devices using so-called GUI (Graphical User Interface) agents holds enormous potential. These intelligent systems should be able to independently perform complex processes on smartphones and tablets, from booking a flight to managing calendar entries. However, the challenge lies in the enormous variety of apps and the individual needs of users. Conventional approaches that rely on extensive datasets for training or fine-tuning models reach their limits here. A new approach based on human demonstrations now promises a solution.
Instead of trying to achieve universal generalization through ever-larger datasets, the new approach focuses on learning from concrete examples. Human users demonstrate the desired task, and the agent learns from this how to perform similar tasks independently in the future. This approach promises greater adaptability to individual needs and specific app environments.
To advance research in this area, LearnGUI was developed, the first comprehensive dataset specifically designed for demonstration-based learning of mobile GUI agents. LearnGUI includes 2,252 offline tasks and 101 online tasks, provided with high-quality human demonstrations. This dataset offers researchers a standardized environment to develop and evaluate new methods.
LearnAct is an advanced multi-agent framework that automatically extracts knowledge from demonstrations and uses this knowledge to improve task completion. The framework consists of three specialized agents:
- DemoParser: This agent analyzes the human demonstrations and extracts relevant knowledge. - KnowSeeker: This agent searches for the most relevant knowledge for the respective task. - ActExecutor: This agent executes the task using the extracted knowledge.By combining these three agents, LearnAct enables efficient and targeted learning from demonstrations.
Initial experiments with LearnAct show promising results. In offline tests, a single demonstration was able to increase the accuracy of the Gemini-1.5-Pro model from 19.3% to 51.7%. In online tests, the framework improved the success rate of UI-TARS-7B-SFT from 18.1% to 32.8%. These results underscore the potential of demonstration-based learning for the development of more adaptable, personalized, and deployable mobile GUI agents. LearnAct and LearnGUI lay the foundation for further research in this promising area and could fundamentally change the way we interact with our mobile devices.
For companies like Mindverse, which specialize in the development of AI-powered solutions, these advances open up new possibilities. The development of customized chatbots, voicebots, AI search engines, and knowledge systems could be made significantly more efficient and targeted through the use of demonstration-based learning. Personalized adaptation to individual customer needs and integration into specific application scenarios are thus within reach.
Bibliography: Liu, G., Zhao, P., Liu, L., Chen, Z., Chai, Y., Ren, S., Wang, H., He, S., & Meng, W. (2025). LearnAct: Few-Shot Mobile GUI Agent with a Unified Demonstration Benchmark. arXiv preprint arXiv:2504.13805. https://arxiv.org/abs/2504.13805 https://arxiv.org/html/2504.13805v1 https://www.chatpaper.ai/zh/dashboard/paper/e97b17e7-7e81-4c76-ab40-796af87b1ca0 https://huggingface.co/papers https://github.com/showlab/Awesome-GUI-Agent https://github.com/OSU-NLP-Group/GUI-Agents-Paper-List https://openreview.net/forum?id=QarKTT5brZ https://www.preprints.org/manuscript/202501.0413/v1 https://aclanthology.org/2022.emnlp-main.449.pdf https://www.researchgate.net/publication/386191538_Android_in_the_Zoo_Chain-of-Action-Thought_for_GUI_Agents