GGUF | 湛蓝与蔚蓝

status

type

date

slug

summary

category

icon

password

binary format that is optimized for quick loading and saving of models

Models initially developed in frameworks like PyTorch can be converted to GGUF format for use with those engines.

As we can see in this graph, unlike tensor-only file formats like safetensors – which is also a recommended model format for the Hub – GGUF encodes both the tensors and a standardized set of metadata. It achieves this by combining the model parameters (weights and biases) with additional metadata for efficient execution. GGUF is clear, extensible, versatile and capable of incorporating new information without breaking compatibility with older models. GGUF is a more recent development that builds upon the foundations laid out by its predecessor file format, GGML.

supports fine-tuning, so users can adapt LLMs to specialized applications and it stores prompt templates for model deployments across applications

Quantization, the process of converting continuous signals into digital formats with fewer possible values, plays a crucial role in GGUF. Quantization enhances efficiency and performance, particularly for hardware with limited resources. By reducing the model size and improving inference speed, quantized models require less computational power, leading to reduced energy consumption. This makes GGUF highly suitable for deployment on edge devices and mobile platforms where power resources are constrained.

For example, one specific quantization technique that is used is GPTQ (Accurate Post-Training Quantization for Generative Pre-trained Transformers). GPTQ reduces the size and computational needs of an LLM by converting its complex data into simpler formats. This allows for deploying LLMs on devices with less memory and processing power.

GGUF is also designed to incorporate new features without compromising compatibility with an earlier version. This capability allows adding new data types and metadata, making GGUF future-proof. As machine learning models evolve, GGUF can accommodate these changes, protecting long-term relevance and adaptability.

GGUF's binary format design significantly improves the speed of loading and saving models, which is particularly vital for applications that require quick deployment and inference. Real-time language conversion services and interactive AI systems, for instance, benefit from GGUF's efficient model file handling. The quicker a model can be loaded and used, the better the user experience in these time-sensitive applications.

GGUF stands out due to its compatibility with advanced tuning techniques like low-rank adaptation (LoRA), quantized low-rank adaptation (QLoRA) and adaptive weight quantization (AWQ). These techniques further optimize model performance and resource utilization.

Moreover, GGUF supports various quant levels, providing flexibility in balancing model accuracy and efficiency. Common quantization schemes that are supported by GGUF include:

2-bit quantization: Offers the highest compression, significantly reducing model size and inference speed, though with a potential impact on accuracy.

4-bit quantization: Balances compression and accuracy, making it suitable for many practical applications.

8-bit quantization: Provides good accuracy with moderate compression, widely used in various applications.

Quants refer to the various quantization levels applied to model weights, such as 2-bit, 4-bit or 8-bit quantization.

QLoRA

Innovations:

4-bit NormalFloat (NF4), a new datatype that is information theoretically optimal for normally distributed weights. This is used only for storage: the computation data type is still bf16, so for the forward and backward pass you de-quantize the storage data type.
Double quantization by quantizing the quantization constants: when quantizing, you need to rescale your values by a constant C to make them fit into a certain range. Double quantization quantizes C, thus saves an average 0.37b per parameter, which is quite significant!
Paged Optimizers to manage memory spikes, by using NVIDIA unified memory (transfers between GPU and CPU) to avoid gradient checkpointing memory spikes that occur when processing a mini-batch with a long sequence length.

Effect:

On compute:

the memory cost is greatly reduced, at the cost of a small computational overhead

On model accuracy: no degradation of performance.

About bf16: this data type is brain float16, introduced by Google brain type, that differently manage mantissa and exponent bits to get fp32-level performance with the size of fp16.

Hyperparameters used for finetuning experiments:

“We find LoRA r is unrelated to final performance if LoRA is used on all layers”
LR: 1e-4 or 2e-4, constant schedule.
Batch size: 16 for models under 13B, 16 or 32 for 33B, 16-64 for 65B
NF4 with double quantization and bf16 computation datatype.
LoRA r = 64, α = 16
We also use Adam beta2 of 0.999, max grad norm of 0.3 and LoRA dropout of 0.1 for models up to 13B and 0.05 for 33B and 65B models.
Target modules: “all linear layers of the base model”
“use group-by-length to group examples of similar lengths in the samebatch (note this will produce a oscillating loss curve)”

Question: the paper says “We find that LoRA dropout 0.05 is useful for small models (7B, 13B), but not for larger models (33B, 65B).” Then why use the opposite in the finetuning experiments?

Completion-Only Supervised Fine-Tuning (SFT)

Advantages:

Simplicity:

Easier to implement compared to more complex fine-tuning strategies that require understanding and following instructions.

Efficiency:

Requires fewer computational resources since only the model learns to generate appropriate completions without handling instruction parsing.

Focused Learning:

The model can specialize in generating high-quality responses for specific types of prompts, enhancing performance in targeted applications.

Disadvantages:

Limited Flexibility:

Unlike instruction-tuned models, completion-only fine-tuned models may not handle varied or complex instructions as effectively.

Potential for Overfitting:

If the dataset isn't sufficiently diverse, the model might overfit to specific prompt-response patterns, reducing its ability to generalize.

Less Robustness:

May struggle with prompts that slightly deviate from the training data format or contain unforeseen variations.

An “outlier” is a hidden state value greater than a certain threshold, and these values are computed in fp16. While the values are usually normally distributed ([-3.5, 3.5]), this distribution can be very different for large models ([-60, 6] or [6, 60]). 8-bit quantization works well for values ~5, but beyond that, there is a significant performance penalty. A good default threshold value is 6, but a lower threshold may be needed for more unstable models (small models or finetuning).

https://huggingface.co/docs/transformers/en/quantization/bitsandbytes?bnb=4-bit#bitsandbytes

Try 4-bit quantization in this notebook and learn more about it’s details in this blog post.

This section explores some of the specific features of 4-bit models, such as changing the compute data type, using the Normal Float 4 (NF4) data type, and using nested quantization.

Compute data type

To speedup computation, you can change the data type from float32 (the default value) to bf16 using the bnb_4bit_compute_dtype parameter in BitsAndBytesConfig:

Copied

Normal Float 4 (NF4)

NF4 is a 4-bit data type from the QLoRA paper, adapted for weights initialized from a normal distribution. You should use NF4 for training 4-bit base models. This can be configured with the bnb_4bit_quant_type parameter in the BitsAndBytesConfig:

Copied

For inference, the bnb_4bit_quant_type does not have a huge impact on performance. However, to remain consistent with the model weights, you should use the bnb_4bit_compute_dtype and torch_dtype values.