Reducing cost of software development with Code LLMs
What LLMs do especially good is content generation in response of user's request. In the world of software engineering, it can be used for code generation and decision-making.
Just this week StarCoder2 was released. This is an open-source LLM specifically trained for code generation. Compered to the first version 4 times more data were used to train this model. The paper describes how preprocessing was done - deduplication, personal information redaction, malware removal etc. Also we can find an overview of sentinel tokens such as <pr_comment>, these tokens are used to mark specific elements in the training dataset. In the result we received a model which shows state of the art results in 600+ programming languages. OpenCodeInterpreter-SC2-3B is a finetuned version of StarCoder2, which allows us to write prompts and receive meaningful results. A dataset of 68K interactions was used for the fine-tuning. By the way, StarCoder2 provides a tool to check whether your code was used in the data set - https://huggingface.co/spaces/bigcode/in-the-stack.
The role of the developer in code generation is to review the results and accept or reject the changes. The workflow can be almost automated if we switch focus to unit test generation. LLM writes tests, compiler makes sure new code is valid. The test runner can check that new tests work. Code coverage can show us whether these new test cases add value. This is what was done in TestGen-LLM tool from Meta. "In an evaluation on Reels and Stories products for Instagram, 75% of TestGen-LLM’s test cases built correctly, 57% passed reliably, and 25%increased coverage. " - from "Automated Unit Test Improvement using Large Language Models at Meta" paper.
The next area is decision-making. Engineers consume tons of information to make better decisions during the development or operation of software products. Amazon proposed - "Panda: Performance Debugging for Databases using LLM Agents". The idea is to help developers solve database issues faster. Amazon proposed a RAG-based architecture with verification and impact estimation. During production incidents, when there is pressure to fix problems faster, verification and impact assessment can be very handy. Besides these two blocks, there is an exciting overview of how to use telemetry in the grounding of LLM.
These models and tools reduce the cost of development by reducing Lead Time for Changes, Change Failure Rate and Time to Restore Service - 3 out of 4 DORA metrics.
Resources:
https://arxiv.org/abs/2402.19173 - StarCoder 2 and The Stack v2: The Next Generation
https://arxiv.org/abs/2402.01391 - StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback
https://arxiv.org/abs/2401.08500 - Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering
https://www.amazon.science/publications/panda-performance-debugging-for-databases-using-llm-agents - Panda: Performance Debugging for Databases using LLM Agents
https://arxiv.org/abs/2402.09171 - Automated Unit Test Improvement using Large Language Models at Meta
YouTube - BigCode: Building Large Language Models for Code - train model, software, cool
https://arxiv.org/abs/2402.14658 - OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement
https://huggingface.co/m-a-p/OpenCodeInterpreter-SC2-3BÂ - OpenCodeInterpreter-SC2-3B
https://github.com/huggingface/llm-vscode - LLM powered development for VSCode
https://www.mckinsey.com/capabilities/mckinsey-digital/our-insights/technologys-generational-moment-with-generative-ai-a-cio-and-cto-guide - Technology’s generational moment with generative AI: A CIO and CTO guide