State of GUI Agents
Understanding and navigating GUI in an autonomous fashion will be a significant performance gain for users. In the past few weeks, a few big announcements have been made.
Microsoft has published a new paper, "UFO: A UI-Focused Agent for Windows OS Interaction". In the paper and related source code repository, they presented an agent which can perform different requests from the user. I covered a similar work "AppAgent: Multimodal Agents as Smartphone Users" in my previous post - https://shchegrikovich.substack.com/p/how-do-ai-agents-think. AppAgent focuses on smartphone usage, whereas UFO focuses on any app installed on Windows.
Some interesting points from the UFO paper:
UFO takes a screenshot of an application, labels all elements on the screenshot and sends it to ChatGPT; this looks the same as what AppAgent does.
UFO uses a dual-agent framework. The first agent is responsible for selecting an application to fulfill user's request, the second agent is responsible for executing actions on the selected application.
Usage of Safeguards - The UFO paper proposed a list of actions that can be sensitive, such as file deletion or application installation. These sensitive actions require user approval.
The reasoning is based on ReAct and Chain of thought prompting paradigms.
UFO produces two plans, Global and Local. The local plan is a detailed description of what needs to be done. The global plan is used to orchestrate the entire task.
Both these papers, UFO and AppAgent, are built on top of the multimodal capabilities of LLMs. They use a Few-Shot Prompting for grounding, but the model must be trained to understand GUI. It sounds like there is a place for specialized or at least fine-tuned LLMs.
Llama2D - https://github.com/Llama2D/llama2d. This GitHub repo contains code to create a dataset for fine-tuning LLM to understand GUI better. The algorithm for creating a data set is the following - they took a screenshot of a web page and labelled every text and element on the page with coordinates. It's a work in progress but promising.
Adept.ai has received $350M in funding and created the Adept Fuyu-Heavy model. This is a multimodal model created explicitly for GUI agents. In my previous post, I mentioned the foundational model Fuyu-8B - https://shchegrikovich.substack.com/p/multimodal-capabilities-of-llms.
From the world of open source models, there is a paper called - "CogVLM: Visual Expert for Pretrained Language Models" and the corresponding implementation "CogAgent: A Visual Language Model for GUI Agents". CogVLM-17B is built on Vicuna-7B and adds a trainable visual expert module in Transformers architecture's attention and FFN layers.
Resources:
https://shchegrikovich.substack.com/p/how-do-ai-agents-think - How do AI Agents think?
https://shchegrikovich.substack.com/p/multimodal-capabilities-of-llms - Multimodal capabilities of LLMs
https://arxiv.org/abs/2402.07939 - UFO: A UI-Focused Agent for Windows OS Interaction
https://learnprompting.org/docs/advanced_applications/react - LLMs that Reason and Act
https://github.com/Llama2D/llama2d - 2D Positional Embeddings for Webpage Structural UnderstandingÂ
https://www.adept.ai/blog/adept-fuyu-heavy - Adept Fuyu-Heavy: A new multimodal model
https://arxiv.org/abs/2312.08914 - CogAgent: A Visual Language Model for GUI Agents
https://arxiv.org/abs/2311.03079 - CogVLM: Visual Expert for Pretrained Language Models