How to make the big language model run faster

  • Share this:
post-title
Strategies and technical analysis to improve the running speed of large language models
— — From hardware optimization to algorithm innovation, comprehensively speed up LLM reasoning and training

With the development of Large Language Models (LLMs), people have higher expectations for the ability of Natural Language Processing (NLP).

However, with the rapid increase in the number of model parameters, the speed of model training and inference has become an urgent problem to be solved. This paper aims to explore various technologies for accelerating large language models, from hardware utilization to optimization strategies at the software level.


Why is simple reasoning so slow?

Before we can understand why simple reasoning processes become slow, we need to understand the basic workings of the large language model.

A typical autoregressive generating function handles more tokens with each iteration because a new token is added to the sequence after each iteration. As the sequence grows, so does the time required to process the entire sequence, especially when the model parameters are large, and this token-by-token processing becomes very inefficient.


1. Hardware and Compiler

Hardware is one of the important factors that affect the speed of model reasoning.

Although modern GPUs and TPUs provide powerful parallel processing capabilities, the way models are implemented often fails to take full advantage of these hardware advantages.

In order to make better use of hardware resources, methods such as torch.compileTools such as this to optimize the model code, so that even if you do not go deep into the CUDA kernel-level programming, you can get performance improvements.

If developers are familiar with CUDA programming, writing customized kernel programs will further optimize performance.

2. Batch processing (Batching)

The traditional generation method is to process only one sequence at a time, which means that each sequence needs to be forwarded separately.

Batch processing is to process multiple sequences at the same time, generating the completed part for each sequence in a forward propagation. This method not only reduces the repeated loading of model weights, but also enables the parallel processing capabilities of the hardware to be fully utilized. In order to achieve this, it is usually necessary to fill the sequence to the same length, and use special markers (such as [end]) to mask the filled parts, ensuring that these parts do not affect the final result.

3. Continuous Batching

In standard batch processing, when a sequence is completed ahead of schedule, the position of the sequence is still reserved and random tokens continue to be generated because the entire batch is not completed. Continuous batch processing solves this problem by inserting new sequences into batches when the sequence is complete, instead of generating useless tokens, improving resource utilization.

4. Reduce model weight

By using smaller data types to store model weights, storage overhead and computational costs can be effectively reduced.

For example, semi-precision floating-point numbers (fp16) and brain floating-point formats (bfloat16) are two common options. Fp16 tries to strike a balance between numerical range and precision, while bfloat 16 maintains the numerical range of fp32 at the expense of some precision. For inference, both methods can meet the requirements, depending on the hardware support.

5. Smaller data types

In addition to the above two data types, it is also possible to use a smaller data type than fp16 to store weights. Although this may present additional challenges, it may lead to significant performance improvements in some scenarios.

6. KV cache and multi-query attention mechanism

KV caching technology speeds up the attention mechanism by avoiding double counting of processed tokens. The multi-query attention mechanism allows the model to process multiple queries in a single forward propagation, thereby improving efficiency.

PagedAttention is an attention mechanism designed specifically for long sequences to reduce computational burden.

7. Speculative Decoding

Speculative decoding techniques attempt to predict possible future outputs in advance, thereby reducing the number of actual calculations. Such techniques include threshold decoding, staged speculative decoding, guided generation, and preview decoding. These techniques are designed to improve overall efficiency by reducing unnecessary computing.

8. Optimization during training

In addition to optimization in the inference stage, some measures can also be taken during training to improve model efficiency, such as sparse attention mechanism or exploring non-converter architecture. These methods help reduce computing load and may lead to model performance improvements.


Conclusion

From the above discussion, it can be seen that improving the running speed of the large language model is a systematic project involving many aspects. From hardware selection and optimization to algorithm-level improvement, every step is crucial. In the future, as technology advances, we have reason to believe that even consumer-grade hardware will be able to support larger-scale language models than the existing GPT-4. I hope this article can provide some reference value for those who want to deeply understand and practice accelerated large language model technology.