close
close

Association-anemone

Bite-sized brilliance in every update

How Microsoft’s next-generation BitNet architecture turbocharges LLM efficiency
asane

How Microsoft’s next-generation BitNet architecture turbocharges LLM efficiency


Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn more


One-bit large language models (LLMs) have emerged as a promising approach to make generative AI more affordable and accessible. By representing model weights with a very limited number of bits, 1-bit LLMs dramatically reduce the memory and computational resources required to run them.

Microsoft Research pushed the limits of 1-bit LLMs with its BitNet architecture. One new paperresearchers introduce BitNet a4.8, a new technique that further improves the efficiency of 1-bit LLMs without sacrificing their performance.

Increase LLMs by 1 bit

Traditional LLMs use 16-bit floating point numbers (FP16) to represent their parameters. This requires a lot of memory and computing resources, which limits the affordability and implementation options for LLM. One-bit LLMs address this challenge by drastically reducing the accuracy of model weights while matching the performance of full precision models.

Previous BitNet models used 1.58-bit values ​​(-1, 0, 1) to represent model weights and 8-bit values ​​for activations. This approach significantly reduced memory and I/O costs, but the computational cost of matrix multiplication remained a bottleneck, and optimizing neural networks with extremely small bit parameters is a challenge.

Two techniques help solve this problem. Sparseness reduces the number of computations by pruning activations with smaller magnitudes. This is particularly useful in LLM because activation values ​​tend to have a long distribution with some very large values ​​and many small ones.

ditheringon the other hand, it uses a smaller number of bits to represent activations, reducing the computational and memory cost of processing them. However, simply decreasing the accuracy of activations can lead to significant quantization errors and performance degradation.

Furthermore, combining sparsification and quantization is challenging and presents special problems when training 1-bit LLMs.

“Both quantization and sparsification introduce non-differentiable operations, which makes computing the gradient during training particularly challenging,” Furu Wei, Partner Research Manager at Microsoft Research, told VentureBeat.

Gradient computation is essential for calculating errors and updating parameters when training neural networks. The researchers also had to ensure that their techniques could be efficiently implemented on existing hardware while maintaining the benefits of both sparsification and quantization.

BitNet a4.8

BitNet a4.8 addresses the challenges of optimizing 1-bit LLMs through what the researchers describe as “hybrid quantization and sparsification.” They achieved this by designing an architecture that selectively applies quantization or sparsification of different model components based on the specific distribution pattern of activations. The architecture uses 4-bit enablers for the inputs to the attention and feed-forward network (FFN) layers. It uses 8-bit sparsification for intermediate states, keeping only the first 55% of the parameters. The architecture is also optimized to take advantage of existing hardware.

“With BitNet b1.58, the inference bottleneck of 1-bit LLMs moves from memory/IO to computation, which is constrained by enable bits (ie, 8 bits in BitNet b1.58),” Wei said. “In BitNet a4.8, we’re pushing the enable bits to 4 bits so we can use 4-bit cores (eg INT4/FP4) to get a 2x speedup for LLM inference on GPU devices. The combination of 1-bit model weights from BitNet b1.58 and 4-bit activations from BitNet a4.8 efficiently addresses both memory/IO and computational constraints in LLM inference.”

BitNet a4.8 also uses 3-bit values ​​to represent key (K) and value (V) states in the attention mechanism. The KV cache is a crucial component of transformer designs. Stores the representations of the previous tokens in the sequence. By lowering the precision of the KV cache values, BitNet a4.8 further reduces memory requirements, especially when dealing with long sequences.

BitNet Promise a4.8

Experimental results show that BitNet a4.8 provides comparable performance to its predecessor BitNet b1.58, using less computation and memory.

Compared to full-precision Llama models, BitNet a4.8 reduces memory usage by a factor of 10 and achieves a 4x speedup. Compared to BitNet b1.58, it achieves a 2x speedup through 4-bit activation cores. But design can offer much more.

“The estimated improvement in calculations is based on the existing hardware (GPU),” Wei said. “With hardware optimized specifically for 1-bit LLMs, computational enhancements can be significantly improved. BitNet introduces a new computing paradigm that minimizes the need for matrix multiplication, a primary focus in today’s hardware design optimization.”

The efficiency of BitNet a4.8 makes it particularly suitable for deploying LLMs at the edge and on resource-constrained devices. This can have important privacy and security implications. By activation LLMs on the deviceusers can benefit from the power of these models without having to send their data to the cloud.

Wei and his team continue their work on 1-bit LLMs.

“We continue to develop our research and vision for the era of 1-bit LLMs,” said Wei. “While we currently focus on model architecture and software support (ie, bitnet.cpp), we aim to explore the co-design and co-evolution of model architecture and hardware to fully unlock the potential of LLMs on 1 bit.”