How do you choose an AI Agent as a business user?
AI Agents revolutionize the way we use computers and do work. They help us write code and conduct research. Microsoft's CEO said that AI Agents will replace traditional SaaS business applications. As end-users of AI agents, we need to understand which is best for us. These new tools cost more than some might expect. For instance, Devin AI Agent, an end-to-end software builder, costs at least 500 USD monthly, and OpenAI Pro costs 200 USD monthly.
In an ideal world, we give an AI agent a task, and it solves it autonomously. This is our expectation from such systems, considering the promise of AGI around the corner. The reality is a bit different, and there are five reasons for this. These days, many AI Agents are designed for specific tasks and fail to generalize to various domains; another example is a change in context or unexpected situation - lack of generalization and robustness. Scalability issues - even a simple task such as writing an article requires a set of sub-goals to be finished: do research, prepare an outline, select a tone, write each paragraph, select images, and rewrite for consistency. Each of these sub-goals might contain many model calls. Coordination and communication - agents talk to agents, agents talk to external systems, and agents talk to users. Agents must excel in all these communications. The last one is - Ethical concerns and safety - biased decision-making or harmful behaviour leads to unintended consequences.
What we really need from AI Agents is value generation, adaptable personalization or flexibility(the agent must adapt to the task context and unique user requirements), trustworthiness(high accuracy and transparency of execution), social acceptability, and standardization(simplified deployment and connection of agents to each other and third-party systems). If the latter two can be fixed with proper engineering practices and time, the first three are more complicated.
If we go deeper, we will realize that the quality of modern AI Agents is based on benchmarks. To prove that we've got a SOTA-level agent, we need to check its score against popular benchmarks, which is a problem. By focusing only on accuracy during benchmarking, agents might miss real-world scenarios where multiple answers are possible. Also, agents under benchmark might use a different configuration; increasing the number of retries can increase agent accuracy. Another example might be using hacks for specific benchmarks, such as rule-based steps for extracting information from a web page based on page URL, which might be completely different in real tasks.
To prepare agents for real-world scenarios, developers need to include more points to consider. One option is to add costs. So, by optimizing two metrics - costs and accuracy - developers will decrease agents' costs and make them better. Restrictions lead to innovations. To sum up - by seeing costs, we will be able to understand the value of an agent better. If developers start using best benchmarking practices such as reproducibility and holding out samples, it will help to increase trust in agents.
References:
AI Agents That Matter - https://arxiv.org/abs/2407.01502
Agents Are Not Enough - https://arxiv.org/abs/2412.16241v1
Satya Nadella | BG2 w/ Bill Gurley & Brad Gerstner -