On one side, LLMs show unseen capabilities in reasoning, but on the other - reasoning in LLMs is not ideal. The problem might be in the way LLMs work. LLMs generate answer sequentially in one pass, they can not use loops or conditions. The limit factor here is autoregressive architecture of transformers. In addition, they have problems when reason about the topic which is outside the training set.
There are several ways how to improve the reasoning capabilities of LLMs. The first one is to generate many examples and choose the best one. The second is to use special prompting techniques such as CoT. The third technique - is to use programming languages. Suppose we ask a model to solve an equation for us, such as 5+5. The model might write a Python code to find the answer. In this case, python is used as an intermediate language to help the model think. Just recently, developers started to use Prolog as an intermediate language to improve reasoning.
Why Prolog? Prolog is short for Programming in Logic. It's a declarative programming language good for symbolic reasoning tasks. Prolog is used where reasoning based on rules is required, in NLP tasks, and for theorem proving. Due to it's declarative nature it might be a bit easier for LLMs to generate code, because LLM doesn't need to generate precise control flow.
The "Reliable Reasoning Beyond Natural Language" paper proposes a neurosymbolic approach to improve reasoning. The idea is to convert the user's request to Prolog code, execute it and return the answer to the user. The paper suggests using Chain of thought (CoT) to generate Prolog code and using Multiple Try inference. If the model fails to generate a working Prolog code, then try one more time. This technique is similar to the Program of Thought (PoT) approach.
In the example above, Prolog code was generated during the request; no additional information was used. Using Prolog's primitives, such as facts and rules, we can create a knowledge base. This knowledge base can be used for explainable context gathering and explainable fact validation. The ProSLM paper explains how. The idea is to convert the user's request into Prolog's query(goal) and use the backward chaining method to find facts that lead to the goal.

The novel dataset was developed to measure improvements of using Prolog. The Non-Linear Reasoning (NLR) dataset contains constraint problems, math word problems and following algorithmic instructions tasks. On math word problems GPT4 with CoT managed to solve 12.5% of all problems, but GPT4 and Prolog solved 100%. The new method also managed to improve results on a widely used GSM8k dataset. Prolog has improved results by 3% for GPT4.
References:
https://arxiv.org/abs/2407.11373 - Reliable Reasoning Beyond Natural Language
https://arxiv.org/abs/2409.11589 - ProSLM : A Prolog Synergized Language Model for explainable Domain Specific Knowledge Based Question Answering
https://arxiv.org/abs/2405.17893 - Arithmetic Reasoning with LLM: Prolog Generation & Permutation
https://arxiv.org/abs/2407.14562v1 - Thought-Like-Pro: Enhancing Reasoning of Large Language Models through Self-Driven Prolog-based Chain-of-Though
How does it compare to other symbolic reasoning approaches? Could the llm just write sympy (python) code. What limitations did you think of when reading the papers? FYI. I didn't read the paper yet.
Can RAG help with this problem?