October 11, 2024

A Comprehensive Benchmark for Evaluating LLM Workflow Generation

Listen to this article as Podcast
0:00 / 0:00
A Comprehensive Benchmark for Evaluating LLM Workflow Generation
## Large Language Models for Agentic Workflow Generation: A Benchmark and Evaluation ## English Translation: ## Large language models for the generation of agented workflows: a benchmark and evaluation **Large language models (LLMs) have driven significant progress in solving reasoning and planning tasks due to their exceptional ability to handle a variety of tasks. The decomposition of complex problems into executable workflows is a crucial step in this process. Existing frameworks for evaluating workflows either focus only on holistic performance or suffer from limitations such as limited scenario coverage, simplistic workflow structures, and lax evaluation standards.** ## Evaluating Workflow Generation by LLMs **To address these challenges, WorFBench was developed, a unified benchmark for workflow generation with diverse scenarios and complex graph-based workflow structures. Additionally, the developers present WorFEval, a systematic evaluation protocol that uses subsequence and subgraph matching algorithms to accurately quantify the abilities of LLM agents in generating workflows.** **Through comprehensive evaluations of different types of LLMs, significant differences were found between the sequence planning and graph planning capabilities of LLM agents, with even GPT-4 showing a gap of about 15%.** **Furthermore, it was observed that the generated workflows can improve tasks, allowing them to achieve higher performance in less time during inference.** ## Background and Motivation **The motivation for developing WorFBench and WorFEval stems from the observation that many LLM-based agents use similar workflows and components, despite a variety of technical and conceptual challenges. Previous approaches often focused on specific aspects such as search algorithms, tree structures, or components of reinforcement learning, but lacked integration with common agent workflows.** **The goal of the project is therefore to clarify the role of LLMs within agent workflows and to investigate the reusability of LLM profiles.** ## Key Findings **The research results show that:** * **LLM agents are capable of generating complex workflows.** * **There are significant differences between the sequence planning and graph planning capabilities of LLMs.** * **The generated workflows can improve performance on downstream tasks.** ## Outlook **The development of WorFBench and WorFEval represents an important step towards a better understanding and evaluation of the capabilities of LLMs in generating workflows. Future work could focus on the development of LLMs that can generate more complex workflows, as well as exploring the application of these workflows in real-world scenarios.** ## Bibliography https://paperreading.club/page?id=258121 https://resources.mpi-inf.mpg.de/d5/mlite/papers/Benchmark-CoopIS-submitted.pdf https://arxiv.org/html/2406.05804v1 https://medium.com/@pamperherself/agentic-workflow-four-core-mechanisms-and-practical-crewai-code-analysis-d3bae0b78f0e https://resources.mpi-inf.mpg.de/d5/mlite/papers/Benchmark-CoopIS.pdf https://arxiv.org/pdf/2406.05804 https://github.com/microsoft/WindowsAgentArena https://huggingface.co/papers/2407.03502 https://paperswithcode.com/paper/agent-workflow-memory https://www.researchgate.net/publication/381307421_A_Survey_on_LLM-Based_Agentic_Workflows_and_LLM-Profiled_Components