Generating high quality content with LLMs
LLMs were born to be writing assistants, but creating long-form articles like Wikipedia pages or writing sci-fiction books is still a problem for modern LLMs.
The simplest approach to writing an article with LLM is to ask LLM to generate the whole article using knowledge embedded in LLM during training. In this scenario, we prepare a prompt and submit it to LLM. The next step is pre-writing research, such as using external content to ground LLM on a specific topic. Better results could be achieved by preparing an outline("a list of multi-level sections, and collecting a set of reference documents") before content writing. Many SaaS apps available today require an outline review as one of the preparation steps before writing content. The better the outline, the better content we will receive.
Creating an outline is the focus of the STORM paper, a writing system for the Synthesis of Topic Outlines through Retrieval and Multi-perspective Question Asking. Two main hypotheses form STORM - "diverse perspectives lead to varied questions" and "formulating in-depth questions requires iterative research". First of all, STORM finds related content. The next step is to identify the perspectives required to generate content. Using perspectives start a conversation. The topic will be refined during a conversation, and further research will be conducted due to new questions.
Once the article is created using STORM, the next problem arises - can the target audience understand this article? It's good to write content for professionals on the topic, but if we target a general audience, Reducing professionalism in the language might be crucial in this case. "The professionalism of a sentence is gauged by analyzing domain-specific terminology and logical structure, necessitating the quantification of terms and reasoning sequences". ProSwitch paper solved this problem by fine-tuning LLM, "without compromising its text generation skills". The paper shows how to prepare and gather data, do an instruction tuning, and evaluate the result.
HugingFace shared their approach, Cosmopedia, with the world. They replicated Microsoft's training approach for the Phi1-5 model. The proposed approach is different from that of the STORM paper. They used outlines of publicly available courses as a seed and outline to generate synthetic books to train the Cosmo-1 b model.
However, the Cosmopedia data set is used for training machines and is not supposed to be used by people. Humans have other requirements for content—we like well-written stories a lot. There is a paper regarding storytelling and LLMs - SWAG: Storytelling With Action Guidance. To produce engaging content, SWAG uses one LLM to choose the next action for the story(possible actions - "add humour", "add suspense," etc) and another LLM to generate the next part. The first LLM is fine-tuned. Also, it is worth mentioning "Weaver: Foundation Models for Creative Writing". Weaver is a new LLM built specifically to generate human-like content.
References:
https://arxiv.org/abs/2402.14207v1 - Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models
https://huggingface.co/blog/cosmopedia - Cosmopedia: how to create large-scale synthetic data for pre-training
https://arxiv.org/abs/2403.09131 - ProSwitch: Knowledge-Guided Language Model Fine-Tuning to Generate
https://eugeneyan.com/writing/synthetic/ - How to Generate and Use Synthetic Data for Finetuning
https://arxiv.org/abs/2402.03483 - SWAG: Storytelling With Action Guidance
https://arxiv.org/abs/2401.17268 - Weaver: Foundation Models for Creative Writing
https://github.com/mshumer/ai-journalist - Claude-Journalist | An experimental AI-powered journalist