How do AI Agents think?

There are three strategies to implement thinking capabilities into AI Agents for GenAI apps.

Dec 24, 2023

The first strategy is based on the reasoning capabilities of LLM itself. In this scenario, the AI Agent can communicate with the user via text responses. To implement this, one needs to define a system prompt like this - "You are a friendly and helpful instructional coach helping teachers plan a lesson. …". This is the most basic AI Agent of all these days.

The second strategy is to add tooling (function calling) into the equation. A few days ago, a paper was released - "AppAgent: Multimodal Agents as Smartphone Users". This paper aims to build an AI Agent that will be able to use a smartphone. So you can ask the agent to set an alarm clock for 06:30 tomorrow, and the agent will execute this for you.

Inside the agent, there is a prompt for self-exploration of a mobile application - https://github.com/mnotgod96/AppAgent/blob/6d608cca862d1544f1de91e789fc63274c00a74f/scripts/prompts.py#L90. The structure of this prompt includes:

- Definition of the role

- Available functions to call - tap, text, long_press and swipe.

- User's request

- Previously executed actions

- Request which function to call

- Output format definition: Observation, Thought, Action and Summary

The app works like this - Take a screenshot, Label each UI element with a number, Send a labelled screenshot and the prompt to the LLM, and Execute the function LLM returned. The capabilities of this scenario allow the agent to think about the next step, execute the next step and reflect on the experience.

A swarm of agents is the third strategy. Instead of relying on one agent, we can build several of them. Suppose you have a research task. You can hire a director(accepts users' requests and delegates them), a manager(sets tasks for the researcher and reviews the result) and researcher(does the job) agents. "Research agent 3.0 - Build a group of AI researchers" - Here is how" - view this video for more information. We can use the Autogen framework from Microsoft to orchestrate a swarm of agents. Autogen provides GroupChatManager and GroupChat abstractions, which select the next agent to speak in a chat by creating a prompt for LLM. In the prompt Autogen defines the roles of agents in the chat, previous messages and asks the LLM to choose the next speaker.

The core components of any AI Agent are Planning, Memory and Tool Use. These components can be delegated to LLM or implemented independently. LLM Compiler, ReAct, Tree of Thoughts and other papers show ways to improve the thinking of AI Agents. It looks like graphs and domain-specific languages are ways to improve agents' quality.

Here is a list of additional resources:

https://openai.com/blog/teaching-with-ai - Teaching with AI
https://appagent-official.github.io/ - AppAgent: Multimodal Agents as Smartphone Users
YouTube - "Research agent 3.0 - Build a group of AI researchers" - Here is how
https://microsoft.github.io/autogen/ - AutoGen, Enable Next-Gen Large Language Model Applications
https://lilianweng.github.io/posts/2023-06-23-agent/ - LLM Powered Autonomous Agents
https://arxiv.org/abs/2312.04511 - An LLM Compiler for Parallel Function Calling Paper
https://platform.openai.com/docs/assistants/ - OpenAI Assistants Docs

Shchegrikovich LLM

Discussion about this post