Three options for information extraction with LLMs
LLMs are not only good at information retrieval tasks, they also show incredible results for information extraction. The simplest solution to extract data from any document is to use prompting. We send a document as part of the prompt and ask LLM to return specific information. This works fine, but you still need to parse the response from LLM. OpenAI has solved this problem by using JSON response types. In this case, we only need to parse JSON documents.
If prompt magic is not your type of bread - you can use function calls. Initially, functional calling or tools were developed to unite the application logic and reasoning abilities of LLM. We can instruct the LLM about the specific functionality of the application. The information about external functionality is passed as an interface description alongside a prompt - "If you find an email address in a document, please call the function email_found(email:string)".
There are several implementations, such as Fructose, Instruct or extraction implementation in Langchain. All of them simplify functional calling, even for nested structures. However, it only works if the schema is well-known, but this is not always true.
When you do information extraction at scale, you might not have a schema for all possible document types. In this case, TnT-LLM would be very handy. TnT-LLM has got two phases: Taxonomy Generation(Schema) and LLM-Augmented Text Classification. During the first phase, we create summarizations for all documents and use them to generate taxonomy. The taxonomy generation algorithm has got initialization, update and review steps. In the paper, the second phase is used for classification. Taxonomy from the first step is used to create a data set to train an efficient classification model for the second step.
New model development didn't stop with the discovery of LLMs. GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer - is a good example. GLiNER beats LLMs in the NER area.
Sometimes LLM based information extraction methods require new evaluation metrics. Previous methods for Optical Character Recognition tasks drew bounding boxes on the image and used F1 score to measure quality. Multi-modal LLMs can not draw on images and return result immediately. This is one of the reasons behind ANLS* metric.
References:
https://towardsai.net/p/machine-learning/demystifying-information-extraction-using-llm - Demystifying Information Extraction using LLM
https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/document-intelligence-preview-adds-more-prebuilts-support-for/ba-p/4084608 - Document Intelligence preview adds more prebuilts, support for image and figures, and more!
https://arxiv.org/abs/2402.14652 - Cleaner Pretraining Corpus Curation with Neural Web Scraping
https://arxiv.org/abs/2311.08526 - GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer
https://arxiv.org/abs/2403.09029 - Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset
https://arxiv.org/abs/2403.12173 - TnT-LLM: Text Mining at Scale with Large Language Models
https://readmedium.com/claude-3-the-king-of-data-extraction-f06ad161aabf - Claude 3: The king of data extraction
https://towardsdatascience.com/the-definitive-guide-to-structured-data-parsing-with-openai-gpt3-5-0e5ea0e52637 - The Definitive Guide to Structured Data Parsing with OpenAI GPT3.5
https://arxiv.org/abs/2402.03848 - ANLS* -- A Universal Document Processing Metric for Generative Large Language Models
https://github.com/bananaml/fructose - LLM calls as strongly-typed functions
https://github.com/jxnl/instructor - Structured outputs for llms
https://python.langchain.com/docs/use_cases/extraction/ - Extracting structured output