Two methods and Prolog to increase trust of LLM solutions

Nov 17, 2024

Trust is the main aspect of the adoption of any new technology. In the LLM world, trust means factually and logically correct responses. In the real world, human experts do fact-checking, but it is costly and time-consuming. There are several ways to fact-check LLMs' responses at scale.

The simplest option is to use LLM itself for fact-checking. This idea is based on the fact that LLMs are trained on vast data and demonstrate expert-level knowledge in many areas. We only need to make sure that we consider not only text data but other modalities, too.

The 'LRQ-Fact' paper proposes a fact-checking method for text and image modalities. The process consists of four steps: Image descriptions, QA generation for image and text, and rule-based checker. We can use specialized vision-language models such as PaliGemma to describe an image. The result is produced by the rule-based checker, which accepts as input an article to check, an image description, generated QAs, instructions and guidelines(part of a prompt). Here, the rule-based checker plays the role of an LLM-as-judge and returns the final judgment - real or fake with an explanation.

As a side note, QA generation is a fundamental step in LLM app development and is used for many tasks. For instance, UI agents create a screen description and QA for navigation, and RAG systems use QA generation to improve quality.

Provenance: A Light-weight Fact-checker for Retrieval Augmented LLM Generation Output

The previous method focused on the whole article, but instead we can focus on one Question-Answer pair. The 'Provenance' paper does this and it uses cross-encoder models which are faster, more accessible and easier to interpret results. The factuality score is produced by hallucination detection model from Vectara. This end score and threshold are used to make the final decision.

Because Provence is built for RAG systems, it first calculates a score for each query and each piece of relevant context produced by RAG. The cross-encoder used here is a RoBERTa-based model. The idea of this step is to find the most relevant and focused context—source, as it is called in the paper. So, the second cross-encoder is used to compute a score that indicates how well the answer is supported by each source. Aggregation of all these steps produces the end score.

FLARE: FAITHFUL LOGIC-AIDED REASONING AND EXPLORATION

Prolog is an ideal programming language for logical inference. Due to this fact, it can be used to improve reasoning in LLMs and provide logically correct answers. The 'FLARE' paper proposes the following algorithm to improve CoT. The first step is to extend the question from a user. This step adds context by adding explanations, analysis and a draft plan of an answer. The idea is to generate enough facts and relations for Prolog. The second step is to convert a draft plan into Prolog code.

The trick of the paper is not to execute the code. Prolog is used here as a structured, declarative way to formalize reasoning tasks. So, Prolog is used as an intermediate language, which helps LLM reason. Prolog defines Problem Space. The real reasoning algorithm uses formalized problem space to do a simulated search, which produces traces that are used to score faithfulness.

function FLARE_Process(query, knowledge_base):
    # Step 1: Generate a plan from the query
    plan = generate_plan(query)
    print("Generated Plan:", plan)

    # Step 2: Formalize the query into logic rep (facts and relations)
    logic_code = generate_logic_code(plan, knowledge_base)
    print("Logic Code:", logic_code)

    # Step 3: Simulate the search process for reasoning
    result, trace = simulate_search(logic_code)
    print("Search Trace:", trace)

    # Step 4: Measure faithfulness of reasoning
    faithfulness_score = compare_trace_with_code(trace, logic_code)
    print("Faithfulness Score:", faithfulness_score)

    # Step 5: Generate the final answer based on search results
    answer = generate_final_answer(result, trace)
    print("Final Answer:", answer)

    return answer

function generate_plan(query):
    # Analyze and decompose the query
    plan = analyze_query(query)
    plan.steps = define_reasoning_steps(query)
    return plan

function generate_logic_code(plan, knowledge_base):
    facts = extract_facts(knowledge_base, plan)
    relations = define_relations(plan, facts)
    goal = define_goal(plan)
    return { "facts": facts, "relations": relations, "goal": goal }

function simulate_search(logic_code):
    trace = []  # Keep track of reasoning steps
    result = resolve_goal(logic_code.goal, logic_code.facts, logic_code.relations, trace)
    return result, trace

function resolve_goal(goal, facts, relations, trace):
    if goal is in facts:
        trace.append(goal)
        return True
    
    for relation in relations:
        if relation matches goal:
            subgoals = relation.get_subgoals(goal)
            for subgoal in subgoals:
                if not resolve_goal(subgoal, facts, relations, trace):
                    trace.append("Backtracking...")
                    return False
            trace.append(goal)
            return True
    
    return False

References:

https://arxiv.org/abs/2410.04616 - LRQ-Fact: LLM-Generated Relevant Questions for Multimodal Fact-Checking
https://arxiv.org/abs/2410.11900 - FLARE: Faithful Logic-Aided Reasoning and Exploration
https://arxiv.org/abs/2411.01022 - Provenance: A Light-weight Fact-checker for Retrieval Augmented LLM Generation Output

Shchegrikovich LLM

Discussion about this post