RNN vs Transformers or how scalability made possible Generative AI?
LLMs are built on top of the Transformer architecture, but before Transformers, the leading architecture for building NLP apps was Recurrent Neural Networks (RNN), such as Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) networks.
The idea behind RNN is straightforward: to predict the next word in a sentence, we need to know the previous words. If we pass to the RNN this sentence "London is the capital …", then the result will be this "… capital of the United Kingdom". There are some specifics of RNN implementations:
The model defines the position - if you look at the picture for each element of input vector X, there is a specific block. So, from the example above, London will always be sent to the first block. In other words, the importance of words in the sentence is defined by the model architecture but not by the meaning of the sentence.
Hidden state - in the case of RNN, we predict the next word based on previous words. In code, it means we need to store the state. Which also means we need to define the size of this hidden state. Once you have limits on the size of the hidden state, you do compression, and compression means losing information.
Parallel computations are impossible - basically, we need to have gates after each block in the RNN architecture. We can't compute the whole output sequence in parallel, which doesn’t leverage the power of modern GPUs in parallel computing.
All these three points and some others make RNN super hard to scale.
Until the "Attention Is All You Need" paper and the Transformer architecture were developed:
Attention mechanism - in contrast to the predefined positions of the input sequence in RNNs. The attention mechanism allows the model to learn connections between words in the sentence. In the example above, "London is the capital of the United Kingdom", now the model defines what is important in this sentence.
There is no hidden state - no compression. Now, the Decoder(Transformers consist of two parts: the Encoder and the Decoder) dynamically focuses on different words of the sentences at different steps in the pipeline.
GPU parallelism - as the result of the previous two points, the Transformer is well parallelized on modern GPUs. And well means that Model Flop Utilization is around 50%.
The next step in the development of LLMs is the Mixture of Experts or MoE, but that's a bit of a different story.
Resources:
https://d2l.ai/ - Dive into Deep Learning
https://deeprevision.github.io/posts/001-transformer/ - The Transformer Blueprint: A Holistic Guide to the Transformer Neural Network Architecture
https://blog.openthreatresearch.com/demystifying-generative-ai-a-security-researchers-notes/ - Demystifying Generative AI 🤖 A Security Researcher's Notes
https://deeprevision.github.io/posts/001-transformer/ - The Transformer Blueprint: A Holistic Guide to the Transformer Neural Network Architecture
https://arxiv.org/abs/1706.03762 - Attention Is All You Need