How to do a data analysis with LLM?
I'm going to compare the data analysis capabilities of two LLMs. Let's start from a simple example - 2 + 2 = ? All LLMs these days will answer 4, which is correct. Now let's submit this query - 13,423 * 3,413,432 = ? Well, it's not that simple. As of the time of this writing, Claude 3.5 Sonnet returns 45,772,470,269, which is correct again. Or not? We need to check with the other tool. ChatGPT 4o returns 45,818,497,736. The difference is around 46M. Ok, the calc says - 45,818,497,736. So the winner is Chat… calculator. Yes, the winner is a calculator.
If you ask me 5 * 5 = ? I'll tell you 25. But do I remember the answer, or am I just good at math? It is hard to tell, but what we can say for sure is that LLMs need special prompting to do the data analysis correctly.
In the recent paper "Hybrid LLM/Rule-based Approaches to Business Insights Generation from Structured Data," the Narrative BI team showed that hybrid methods (rule-based + LLM) work better than each of these methods individually. Interestingly, in the example above, Sonnet tried to use a variant of CoT prompting, whereas ChatGPT used a Python interpreter to write some code. So, it might be a sign that there is a rule inside ChatGPT that triggers the code interpreter to answer the user's request.
"Data Interpreter: An LLM Agent For Data Science" shows how to build your own data interpreter. The paper proposes a data science pipeline to handle user questions. The key elements of the pipeline are dynamic planning with hierarchical structure, tool utilization and generation, and enhancing reasoning with verification and experience.
A data science pipeline consists of these steps: data exploration, feature engineering, model training, evaluation, and visualization. This pipeline can be converted into a graph (DAG). LLM's first task is to generate the DAG—split the user's request into steps and identify dependencies. As a result, we create a graph in which we know how to execute each node.
Once the DAG is ready, we can generate Python code to execute it. Each node in the graph selects a tool, which is a Python class. Once we choose all the tools, we can see the resulting solution in Python and its unit tests.
Because the resulting code might contain bugs, we need dynamic planning. When the compiler finds an error or when the code fails to pass a unit test, we conduct re-planning to improve the solution.
The user's request, Python code, and execution result are used to perform Automated Confidence-based Verification. This is the last step, which tries to find logical flaws before returning the result to the user.
References:
https://arxiv.org/abs/2404.15604 - Hybrid LLM/Rule-based Approaches to Business Insights Generation from Structured Data
https://arxiv.org/abs/2402.18679 - Data Interpreter: An LLM Agent For Data Science