12 Oct 2025 23 min read book review

[Notes] LLM Engineer's Handbook

https://www.oreilly.com/library/view/llm-engineers-handbook/9781836200079/

Chapter 4: RAG

RAG system composed of 3 main modules: Ingestion (populate vector DB), Retrieval (retrieve relevant entries), Generation (use retrieved data to augment prompt).

Pre-retrieval optimizations: Better data indexing, query optimization

Data indexing techniques: Sliding window to preserve context near chunk borders, ensure data granularity by removing irrelevant/outdated/inaccurate stuff, add metadata tags (e.g. Dates, external IDs) to filter results effectively during retrieval, Small-to-Big retrieval using a small bit of text to compute the embedding to introduce less noise, while still preserving the full, longer text for LLM’s final answer
Query optimization: Query routing to decide what action to take based on user input, rewrite the query by paraphrasing, replacing less common words, break down longer queries into multiple shorter, focused sub-queries, query expansion by adding additional terms/concepts, hypothetical document embeddings (HyDE) by generating a hypothetical answer and using its embedding to retrieve, self-query by using LLM to extract key entities/relationships in the query & use as filtering params to reduce search space

Retrieval optimizations: Improving embedding model, use DB’s filter/search features

Improve embedding model by: Fine-tuning the embedding model to tailor it to specific jargon/nuance, use instructor models for task-aware embeddings
Filter/Search features: Hybrid search by combining vector (similarity) & keyword-based (BM25 relevance/keyword) search bland - alpha param controls weight between the two methods. Filtered vector search by using metadata constraints to limit search space before search (pre-filtering), or filtering search results by metadata (post-filtering)

Post-Retrieval optimization:

Result compression by removing unnecessary details, re-ranking models (cross-encoder) to evaluate most relevant documents

RAG Design Choices

Batch pipelines: Useful when dealing with large volumes of data that don’t need immediate processing. Data is collected, scheduled processing, and then saved for future use
- Advs: Optimized resource allocation & parallel processing as large volumes of data can be handled more efficiently, simpler than streaming pipeline
- Disadvantages: Feature freshness isn’t there, make redundant predictions
Streaming pipelines: Core elements are distributed event streaming platforms (e.g. kafka) to store events from multiple clients and a streaming engine (e.g. Apache Flink) to process the events. Used in cases where change is unpredictable & frequent (e.g. social media recommendation).

Chapter 5: Supervised Fine-Tuning

Instruction Dataset

SFT uses curated pairs of instructions/answers to allow the model to adopt its broad knowledge base to excel in targeted tasks or specialized domains

Task-specific models: Designed to excel at particular functions (e.g. translation, summarization). Data required is manageable (100-100k samples), much less than domain-specific models
Domain-specific models: Aim to tweak LLM with specialized knowledge & familiarity with vocab of a particular field. Key factors to determine data needs include size of domain (how much specialized knowledge/vocab) and representation of that domain in the model’s pre-training data

Creating instruction dataset

Pairs of instructions/answers follow certain templates (alpaca, openai, etc.)
Alpaca for single-turn instructions/answers, ShareGPT/OpenAI for conversations (multi-turn)
“Raw text” data format doesn’t have question/answer pairs. Continual pre-training where we just continue the self-supervised objective the base model used
High quality data is accurate (factually correct & relevant), diverse (encompass a wide range of use cases, span topics, contexts, text lengths), has complexity (include complex, multistep reasoning problems & tasks)

Refine quality of dataset: Rule-based filtering, data deduping, decontamination, quality evaluation

Rule-based filtering: Use explicit, pre-defined rules to evaluate & filter samples. Length filtering (remove extremely short/long responses), keyword exclusion (remove samples with low-quality/inappropriate content), format checking (ensure samples adhere to expected format; e.g. code samples have correct syntax)
Data deduplication: Exact deduplication by removing samples with the same hash. Fuzzy deduplication via semantic similarity (remove samples with high similarity scores), MinHash deduplication (generate compact representations for each sample & remove similar ones), clustering techniques to group similar vectors & remove samples in the same cluster

Data decontamination

Ensure training dataset doesn’t contain samples that are highly similar to those in the evaluation set. Ensure quality of model evaluation & prevent overfitting

Data quality evaluation

Some aspects like mathematical accuracy can be easily verified using tools like python interpreters. However, evaluating subjective/open-ended content is challenging
LLM-as-a judge by using LLMs to evaluate quality of samples - comparative assessment methods (“is A better than B”) outperform absolute scoring approach (“rate A between 1-4”). Has position bias (LLM judge favors first answer), favor longer answers, intra-model favoritism (prefer models from same family)
Use multiple LLMs as jury to reduce bias
Reward models that take in instruction/answer pairs and return a score as output. Created by adding a linear head on top of decoder-only LLM & trained for this purpose. Can add a classification head to an embedding model (encoder-models)

Data Exploration

Manual exploration reveals errors/inconsistencies that automated processes may miss. Statistical analysis to reveal vocabulary diversity, potential biases, concept representation

Insufficient data: Data generation/augmentation

If available datasets are insufficient, custom data can be created via manual crowdsourcing, synthetic data generation using LLMs. Synthetic data generation begins by preparing set of carefully designed prompts used to generate new, diverse examples
Data augmentation uses pre-existing instructions to increase quality & quantity of sample. Evol-Instruct uses LLMs to generate more complex variants of instructions. In-depth (enhance complexity) & in-breadth (generate new instructions inspired by existing ones) evolving

Creating instruction dataset

Main challenges is unstructured nature of data (need to convert raw text to pairs of instruction/answer), and limited no. of articles that can be crawled
Use LLMs to transform unstructured data. Backtranslation by giving the expected answer and generating its corresponding instruction. Rephrase raw text to ensure the answers are high-quality and properly formatted
To address limited no. of samples, we divide articles into chunks and generate 3 instruction-answer pairs for each chunk.

SFT Technique

Chat templates

Fine-tuning (using alpaca, OpenAI, etc.) vs continual pre-training (using raw text)
Chat templates to present instructions & answers to the model. Base models don’t have chat templates so we can fine-tune with any template. If we fine-tune an instruct model (not recommended), we need to use the same template to avoid performance degradation
Every single whitespace, line break is important. Adding/removing any char will result in wrong tokenization

Params

Fp32 means every learnable param is stored in 32 bit precision (4 bytes)
1e9 (1 billion) bytes = 1gb. A 1-billion param fp32 model takes 4bytes*1b=4b bytes=4GB

Rule of thumb for memory usage

Inference: Param count (in billions) * 4 * 1.3 (memory for intermediate activations, KV cache, etc.)
- E.g. Llama 70b in fp32 takes 7041.3 = 364GB
Training: Param count (in billions) * 16
- E.g. Llama 70b in fp32 takes 1120GB

PEFT: Full finetuning, LoRA, QLoRA

Full-finetuning: Re-train every param. Like pre-training, full fine-tuning uses next-token prediction as the training objective. Similar to continual pre-training; continual pre-training uses raw text, full-finetuning uses q-a pairs
LoRA: Significantly reduced computational resources by having trainable low-rank matrices. Hyperparameters are Rank (r) which determine size of LoRA matrices - larger rank=capture more diverse tasks, but can overfit; Alpha (a) scaling factor which controls strength of which we update the frozen weights with LoRA matrix. Can use multiple-LoRA serving frameworks (LoRAX dynamically attaches right adapter based on task), and merge multiple LoRA adapters with the base model
QLoRA: Quantize base model to NF4 datatype. More memory savings than LoRA, but increased training time. Minor difference in performance compared to LORA

Memory requirements can be estimated via

Memory = Params + Gradients + Optimizer states + Activations

Params (4 bytes/param): Learnable weights & biases (mostly weights in attention mechanism, feed-forward, embedding layers)
Gradients (4 bytes/param): Partial derivatives of loss fn
Optimizer states (8 bytes/param): Values maintained by optimization algorithms (e.g. running avg of past gradients)
Activations (negligible): Intermediate outputs of each layer
So, we’ve around 16 bytes per parameter required for training

Reduce memory usage via model parallelism (spread workload across multiple GPUs), gradient accumulation (take average grad of batches; only hold activations for those batches instead of full dataset), memory-efficient optimizers, activation checkpointing (recalculate some activations on-the-fly instead of storing in memory)

Fine-tuning hyperparameters

Learning rate: Control how much model’s params are updated. Start around 1e-5. Learning rate scheduler (linear, cosine scheduler) adjusts the LR throughout training, gradually decreasing it in later stages
Batch Sizes: No. of samples processed before weights are updated. Larger batch size = more stable gradient estimates, but requires more memory. Use gradient accumulation to handle memory constraints - process samples in batches & accumulate gradients across batches & use it to update the weights
Max length & packing: Longer max len = higher memory usage. Packing merges shorter samples into one single sequence to avoid computation being wasted on PAD tokens. Use attention masks to prevent model from attending to tokens from different samples within the same packed sequence
Epochs: Number of complete passes throughout the entire training set. Too few can lead to underfitting; too many can cause overfitting. Monitor validation performance during training & early stopping if performance plateau/degrade
Optimizers: AamW 8-bit, AdaFactor for memory efficiency. Paged optimizers reduce memory consumption by offloading optimizer states from GPU to CPU RAM, paging in as needed
Weight decay: Penalize large weights as those often mean overfitting to training data
Gradient checkpointing: Reduce memory consumption during training by selectively saving activations at specific layers, recomputing the rest during backward pass as needed

Fine-tuning considerations: Model license - some only allow non-commercial work. Budget - Models with small param size are cheaper to fine-tune. Performance - evaluate base model on domain- or task-specific benchmarks relevant to the use case

Fine-tune with TRL, Axolotol, Sloth
Monitor training with comet ML - Training/validation loss should continuously decrease on average. Gradient norm should be stable/decreasing to show convergence

Chapter 6: Fine-Tuning with Preference Alignment

Preference data comprises a collection of responses to a given instruction, ranked by humans/language models.

For Direct Preference Optimization (DPO), each instruction is paired with one preferred answer and one rejected answer.

Unlike instruction datasets, there's no standardized storage formats like Alpaca for preference datasets.

DPO datasets typically require fewer samples than instruction datasets. As with instruction datasets, the required sample count depends on model size and task complexity

Larger models are more sample-efficient and thus require less data
Complex tasks demand more examples to capture the desired behavior

DPO datasets can be created in 4 main ways

Human-generated, human-evaluated: Ideal for complex tasks, but extremely resource-intensive and difficult to scale
Human-generated, LLM-evaluated: Can be useful if you have a lot of existing human-generated content. Rarely used in practice due to inefficiency; still requires significant human input for response generation
LLM-generated, human-evaluated: Good balance between quality and efficiency. This approach is often preferred because humans are generally better at judging answers than writing them from scratch.
LLM-generated, LLM-evaluated: Scalable and cost-effective. This method can produce massive datasets quickly. However, it requires careful prompt engineering to ensure quality and diversity, and may perpetuate biases or limitations of the generating LLM.

Preferences can emerge naturally from the generation process. E.g. use a high-quality model to generate preferred outputs and a lower-quality or intentionally flawed model to produce less preferred alternatives.

Evaluating Preferences

LLM evaluation for preference datasets can be done via

Absolute scoring: Straightforward, but may suffer from inconsistency across different prompts or evaluation sessions
Pairwise ranking: Involves presenting the LLM with two responses and asking it to choose the better one or rank them. Can lead to more consistent results

We can further improve the accuracy of pairwise ranking by providing a ground-truth answer and using chain-of-thought reasoning. If no ground-truth answer is available, we can prompt the LLM to create a grading note - a description of the expected answer.

Preference Alignment

Techniques to fine-tune models on preference data include

RLHF: A reward model is learned from human feedback. RL algorithms used to optimize a policy (the LLM weights) that maximizes the rewards from the reward model.
- Proximal Policy Optimization (PPO) is one of the most popular RLHF algorithms. Reward is regularized by an additional KL divergence to ensure the outputs stay similar to the original model
Direct Preference Optimization: RLHF objective (maximize expected reward, penalize KL divergence) is reformulated in closed form. No need to learn separate reward model, directly express how the policy should shift its probability distribution using preference pairs
- A simple binary cross-entropy loss function operating directly on the LLM’s output probabilities. Encourage model to assign higher probs to preferred responses
- Simple - can optimize using standard gradient descent

Chapter 7: Evaluating LLMs

General-purpose evaluations are the most popular ones. Domain and task-specific models benefit from more narrow approaches.

General-Purpose evaluations

General-Purpose evaluations cover a breadth of capabilities that are correlated with knowledge and usefulness without focusing on specific tasks or domains. They can be categorized in three phases: during pre-training, after pre-training, and after fine-tuning

During Pre-training: We closely monitor how the model learns

Training loss: Based on cross-entropy loss, measures the difference between the model’s predicted probability distribution and the true distribution of the next token
Validation loss: Calculates the same loss as training loss, but on a held-out validation set to assess generalization
Perplexity: Exponential of the cross-entropy loss, representing how “surprised” the model is by the data (lower is better)
Gradient norm: Monitors the magnitude of gradients during training to detect potential instabilities or vanishing/exploding gradients

After Pre-training: Common to use a suite of evaluations to evaluate the base model

MMLU (knowledge): Tests models on multiple-choice questions across 57 subjects, from elementary to professional levels
HellaSwag (reasoning): Challenges models to complete a given situation with the most plausible ending from multiple choices
ARC-C (reasoning): Evaluates models on grade-school-level multiple-choice science questions requiring causal reasoning
Winogrande (reasoning): Assesses common sense reasoning through pronoun resolution in carefully crafted sentences
PIQA (reasoning): Measures physical common sense understanding through questions about everyday physical interactions

Fine-tuned models also have their own benchmarks. These benchmarks target capabilities connected to the ability of fine-tuned models to understand and answer questions. They test instruction-following, multi-turn conversation, and agentic skills

IFEval (instruction following): Assesses a model’s ability to follow instructions with particular constraints, like not outputting any commas in your answer
Chatbot Arena (conversation): A framework where humans vote for the best answer to an instruction, comparing two models in head-to-head conversations
AlpacaEval (instruction following): Automatic evaluation for fine-tuned models that is highly correlated with Chatbot Arena
MT-Bench (conversation): Evaluates models on multi-turn conversations, testing their ability to maintain context and provide coherent responses
GAIA (agentic): Tests a wide range of abilities like tool use and web browsing, in a multistep fashion

Domain-Specific LLM evaluations

Here are a list of domain-specific evaluations with leaderboards on the Huggingface hub

Open Medical-LLM Leaderboard: Evaluates the performance of LLMs in medical question-answering tasks. It regroups 9 metrics, with 1,273 questions from the US medical license exams (MedQA), 500 questions from PubMed articles (PubMedQA), etc..
BigCodeBench Leaderboard: Evaluates the performance of code LLMs, featuring two main categories: BigCodeBench-Complete for code completion based on structured docstrings, and BigCodeBench-Instruct for code generation from natural language instructions.
Hallucinations Leaderboard: Evaluates LLMs’ tendency to produce false or unsupported information across 16 diverse tasks spanning 5 categories.
Enterprise Scenarios Leaderboard: Evaluates the performance of LLMs on six real-world enterprise use cases, covering diverse tasks relevant to business applications. Benchmarks include FinanceBench (100 financial questions with retrieved context), Legal Confidentiality (100 prompts from LegalBench for legal reasoning), etc.

Task-Specific LLM evaluations

General-purpose and domain-specific evaluations don’t provide insights into how well these models work for a given task. This requires benchmarks specifically designed for this purpose, measuring downstream performance.

Summarization tasks can leverage the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) metric, which measures the overlap between the generated text and reference text using n-grams.
Classification tasks can use the following classic metrics like accuracy, precision, recall, f1 score

We can create a custom benchmark to evaluate our own tasks. 2 main ways to evaluating models with this scheme

Text generation: Model generates text responses and we compare those to predefined answer choices. For example, the model generates a letter (A/B/C/D) as its answer, which is then checked against the correct answer
Evaluation using probabilities: Look at the model’s predicted probabilities for different answer options without requiring text generation. E.g. compare the probabilities for the full text of each answer choice. This approach can capture the relative confidence the model has in different options

If the task is too open-ended, traditional ML metrics and multiple-choice question answering might not be relevant. LLM-as-a-judge can be used.

RAG Evaluation

Evaluation of RAG system requires examining the entire system’s performance

Retrieval accuracy: How well does the system fetch relevant information?
Integration quality: How effectively is the retrieved information incorporated into the generated response?
Factuality and relevance: Does the final output address the query appropriately while seamlessly blending retrieved and generated content?

RAGAS: Open-sourced toolkit for RAG evaluation. Can synthetically generate diverse and complex test datasets. LLM-assisted evaluation metrics include

Faithfulness: Measures the factual consistency of the generated answer against the given context. Break down the answer into individual claims and verify if each claim can be inferred from the provided context. The faithfulness score=verifiable claims/total number of claims in the answer
Answer relevancy: Evaluates how pertinent the generated answer is to the given prompt. An LLM is prompted to generate multiple questions based on the answer and then calculates the mean cosine similarity between these generated questions and the original question. This method helps identify answers that may be factually correct but off-topic or incomplete
Context precision: This metric evaluates whether all the ground-truth relevant items present in the contexts are ranked appropriately. It considers the position of relevant information within the retrieved context, rewarding systems that place the most pertinent information at the top
Context recall: This metric measures the extent to which the retrieved context aligns with the annotated answer (ground truth). It analyzes each claim in the ground truth answer to determine whether it can be attributed to the retrieved context, providing insights into the completeness of the retrieved information.

ARES: Operates in 3 main stages: synthetic data generation, classifier training, and RAG evaluation.

Synthetic data generation: ARES creates datasets that closely mimic real-world scenarios for robust RAG testing. Users can configure this process by specifying document file paths, few-shot prompt files, and output locations for the synthetic queries.
Classifier training: Create high-precision classifiers to determine the relevance and faithfulness of RAG outputs. Users can specify the classification dataset (typically generated from the previous stage), test set for evaluation, label columns, and model choice.
RAG evaluation: Leverage the trained classifiers and synthetic data to assess the RAG model’s performance. Users provide evaluation datasets, few-shot examples for guiding the evaluation, classifier checkpoints, and gold label paths.

Chapter 8: Inference Optimization

Optimizing the inference process is critical for many practical applications. This includes

Reducing the time it takes to generate the first token (latency)
Increasing the number of tokens generated per second (throughput)
Minimizing the memory footprint of LLMs.

The basic inference for a decoder-only model (used by most LLMs now) involves

Tokenizing the input prompt and passing it through an embedding layer and positional encoding
Computing key and value pairs for each input token using the multi-head attention mechanism
Generating output tokens sequentially, one at a time, using the computed keys and values

Steps 1 and 2 are computationally expensive, but consist of highly parallelizable matrix multiplication that can achieve high hardware utilization on GPUs/TPUs.

The real challenge is that token generation in Step 3 is inherently sequential - to generate the next token, you need to have generated all previous tokens. The output sequence grows one token at a time, failing to leverage the parallel computing capabilities of the hardware. Addressing this bottleneck is one of the core focuses of inference optimization.

Model Optimization Strategies

KV Cache: Instead of recalculating key-value pairs for each new token, the model retrieves them from KV cache. When a new token is generated, only the key and value for that single token need to be computed and added to the cache.
- Can take advantage of torch.compile by having a static KV cache. Pre-allocate the KV cache size to a maximum value
Continuous batching: Aka in-flight batching. Prevent idle time by immediately feeding a new request into the batch as soon as one completes. ****The accelerator is always processing a full batch.
- System must occasionally pause or interleave these two types of work: Prefill new waiting requests (build KV tensors for all input tokens), and continue generation for existing requests
- Waiting-served ratio hyperparameter determines how often the system pauses generation to admit new requests & run their prefill
- Prefill too often=generation slows down, less progress per request
- Prefill too rarely=waiting requests queue up, higher latency
Speculative Decoding: Predict multiple token completions in parallel, using a smaller proxy model (draft model). Full model validates speculative completions & keeps draft tokens that are consistent with its own probabilities.
- Both full & draft models must use the same tokenizer. Often, draft model is a distilled/pruned version of the main model
- Prompt lookup decoding is a variant of speculative decoding, used in input-grounded tasks like summarization where there is often overlap between the prompt and output.
- Another approach to creating a small proxy model consists of jointly fine-tuning a small model alongside a large model for maximum fidelity. E.g. Medusa, inserts dedicated speculation heads into the main model
Optimized Attention Mechanisms: Attention mechanism scales quadratically with number of input tokens, as attention operation compares all tokens pairwise
- PagedAttention addresses these memory challenges by treating the KV cache like paged virtual memory, partitioning the KV cache into fixed-size memory pages (blocks) & eliminating the need for contiguous memory allocation.
  - Models can reference these blocks indirectly, using pointers. This allows memory sharing across multiple output sequences generated from the same prompt (e.g. parallel sampling, beam search)
  - GPU memory is managed more efficiently. No memory fragmentation & reallocation needed… if a sequence grows in size, just grab a free new page and link it to the sequence’s page table
  - As blocks are fixed-sized, they’re parallelizeable. Traditionally, when we want to parallelize sequences of differing lengths, we need to pad the shorter sequences, wasting computation
- FlashAttention-2: Split input and output matrices into smaller blocks, ensuring they can fit into the GPU’s on-chip SRAM, which is much faster than high-bandwidth memory (HBM)
  - Entire computation for one block happens on-chip; intermediate results don’t need to be written out to slower HBM
- Online softmax: computes the softmax function independently for each block of the attention scores matrix, rather than for the entire matrix at once. Can calculate attention probabilities without needing to store large intermediate matrices.

Model Parallelism

Data Parallelism: Make copies of the model across GPUs. Used mainly for training; each GPU processes a subset of the data
- For training: Gradients are averaged across GPUs, and model params are updated synchronously. Useful when batch size is too large to fit into a single machine, or aiming to speed up training process using multiple GPUs
- For inference: Help process concurrent requests. Distribute workload across multiple GPUs, reducing latency & increasing throughput as multiple requests can be handled simultaneously
- Effectiveness of DP is limited by model size. Each GPU holds a full copy of the model params, so model must be small enough to fit into single GPU
- Also limited by communication overhead between GPUs: Gradients must be shared and averaged across GPUs during training - this synchronization step can become a communication bottleneck
Pipeline Parallelism: Partitions the model’s layers across different GPUs. Each GPU handles a specific portion of the model
- E.g. First 25% of the model's layers are processed by GPU1, next 25% by GPU2, etc… During forward pass, activations are computed and passed to the next layer
- Number of GPUs used is known as the degree of parallelism
- Significantly reduce memory requirements per GPU
- However, due to the sequential nature of the pipeline, "pipeline bubbles” is an issue - when some GPUs are idle, waiting for activations from the preceding layer. This idle time reduces overall efficiency of the process
- Micro-batching mitigates the impact of pipeline bubbles. Split the input batch into smaller sub-batches. Once the GPU finishes one sub-batch, it can immediately start processing the next sub-batch before the previous one is fully completed
- Using too many micro-batches increases synchronization overhead & memory usage for activations; need to tune no. of micro-batches
Tensor Parallelism: Split weight matrices found in individual layers. Each GPU holds all layers, but only a subset of the tensors that make up those layers - it performs computations on its respective slice
- Inputs are broadcasted to all GPUs, which independently compute their respective outputs. Partial results are aggregated through an all-reduce operation, combining them to form the final output
- TP is efficient for self-attention layers to inherent parallelism of attention heads
- TP is not universally applicable to all layers - layers like Dropout or LayerNorm, which have dependencies spanning the entire input, can't be efficiently partitioned & are replicated across GPUs instead
- TP requires high-speed interconnects between devices to minimize communication overhead

Data, Tensor, and Pipeline Parallelism can be combined

4 pipeline stages + 4-way tensor parallelism (16 GPUs in total): Model’s layers are divided into 4 stages - each stage holds a chunk of consecutive layers. Each stage is handled by 4 GPUs working together via TP.
Pipeline Parallelism: Splits layers across GPUs (Depth-wise). Memory efficient as each GPU only stores certain layers, but not compute efficient due to pipeline bubbles
Tensor Parallelism: Splits each layer’s tensors across GPUs. Compute efficient as all GPUs are kept busy at every step, but not very memory efficient since each GPU needs to store all layers (even if only partial tensors of each layer)
PP provides the greatest memory reduction but sacrifices efficiency
DP provides fastest latency reduction, but has a larger memory footprint

Model Quantization: Represent weights, activations using lower-precision data types

Two main approaches to weight quantization: Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT)
- PTQ: Straightforward - Weights are directly converted to a lower precision format w/o any retraining. May result in performance degradation
- QAT: Simulates quantization during training/fine-tuning stage. During training, forward pass simulates quantization (add quantization noise, e.g. round values int-8) while backward pass still uses higher precision for gradient updates. The model learns to adjust its weights to make output more robust even with quantization. Once QAT is finished, we convert the model to its actual quantized form
- QAT often yields better performance than PTQ, but needs additional computational resources
Naive quantization techniques (e..g absolute maximum quantization, zero-point quantization), involves simply rescaling floating-point values into a limited integer range
- Naive quantization has limitations, particularly with outlier features in LLMs. These extreme weight values significantly impact the quantization process, leading to reduced precision for other values
  - I.e. If we have one extremely large value, most other values get compressed into a very narrow integer range
To address outlier problem, more advanced quantization techniques have been proposed
- E.g. LLM.int8() employs a mixed-precision quantization scheme - Outlier features are processed using FP16, while remaining (non-outlier) values are quantized to INT8.
  - During computation, INT8 weights & activations are multiplied (result is stored in INT32 register), dequantized to FP16, and combined with FP16 outlier results before being passed to next layer
Llama.cpp is a quantization library that can run on a broader range of hardware. Can operate on CPUs and Android devices
- Features its own quantization format, GGUF stores tensors and metadata
- Follows a naming convention based on no. of bits used and specific variants. E.g. IQ1_M (1-bit precision), Q2_K (2-bit precision)
- Instead of quantizing the entire weight tensor once, llama.cpp splits weights into small blocks (blockwise quantization). Each block has its own scaling parameters, and is quantized using its local scale. This reduces quantization error because each block’s numeric range is smaller and more locally representative of its values
GPTQ and EXL2 are two quantization formats dedicated to GPUs, based on the GPTQ algorithm. It optimizes weight quantization for LLMs by refining the Optimal Brain Quantization (OBQ) approach to handle extensive matrices efficiently.
- GPTQ is limited to 4-bit precision
- EXL2 is an advanced mixed-precision quantization method that can mix different quantization levels - it applies multiple quantization levels to each linear layer, prioritizing more important weights with higher bit quantization
  - The algorithm analyzes the trained model to determine which weights are more important for performance. Weights are grouped by importance via magnitude, or significance which is estimated by hessian matrix (2nd derivative of loss function wrt weights). Each group is assigned a different number of bits (Critical: 6-8 bits, Less critical: 2-3 bits)
For mixed-precision quantization, we may have fractional bitrates (e.g. 2.3 bits). They describe an average bit usage per weight, not an exact/literal value for each parameter. Some weights might use fewer bits (e.g. 2 bit), while others use more (e.g. 3 or 4-bit)
Other quantization techniques include Activation-aware weight quantization (AWQ). AWS determines the most important weights based on their activation magnitude instead of weight magnitude. Weights which cause strong activations during inference are considered important
Recent quantization algorithms like QuIP# (Quantization with Incoherence Processing) and HQQ (Half-Quadratic Quantization) offer quantization of models into 1- or 2-bit precision

Chapter 10: Inference Pipeline Deployment

When deploying ML models, we need to understand the four requirements present in every ML application: throughput, latency, data, and infrastructure. There is always a trade-off between the four that will directly impact the user’s experience (e.g. low latency vs high throughput)

Throughput: Average number of requests the system processes and serves. Typically measured in requests per second.
- High throughput requires highly scalable & robust infra
Latency: Time it takes for a system to process an inference request & return the result
- Critical in real-time applications where quick response times are essential, e.g. in live user interactions
- Latency is the sum of network I/O, serialization & deserialization, and the LLM’s inference time.
Data: Characteristics of data, e.g. size & type, determine how the system needs to be configured and optimized for efficient processing
Infrastructure: Underlying hardware, software, networking, system architecture for deployment & operation of ML models
- High throughput: system requires scalable infra to manage large data volumes and high request rates - via parallel computing, distributed systems, high-end GPUs

Throughput & Latency relationship

We mostly care about optimizing throughput for offline training, while we generally care about latency for online inference.
Lower latency translates to higher throughput when service processes multiple queries in parallel successfully
Most ML applications have a batching strategy. Here, a higher latency can translate into higher throughput. When we increase batch size, we have higher latency, but this can increase our throughput

3 inference deployment types - trade-off between latency, throughput, and costs

Online real-time inference: ML service immediately processes the request and returns the result in the same response.
- Synchronous interaction client waits for the result before moving on
- Can implement a REST API or gRPC server
- REST API is more accessible but slower - can be used when serving models to the broader public
- gRPC is faster but has reduced flexibility as we need to implement protobuf schemas in our client application. As protobuf objects can be compiled into bytes, network transfers are much faster. Often adopted for internal services within the same ML system.
- Load balancing is crucial to evenly distribute incoming traffic evenly, while autoscaling ensures the system can handle varying loads.
- Can be challenging to scale and may lead to underutilized resources during low-traffic periods.
Asynchronous inference: Client sends a request to the ML service, which is placed in a queue for processing
- The client doesn’t wait for an immediate response. Instead, the ML service processes the request asynchronously.
- Multiple techniques to send result back to client - put in a different queue or an object storage dedicated to storing results
- Client can either adopt a polling mechanism (check on schedule if there are new results) or a push strategy (notification system to inform client when result is ready; Webhooks/SSE/WebSocket/pubsub)
  - In push, the server initiates the notification when results are ready; in pull, the client initiates it
- Can define a max no. of machines that run in parallel to process messages
- Can handle spikes in requests without timeouts
- Asynchronous: If a job takes significant time to complete, the client isn’t blocked by waiting. However, this means higher latency, less suitable for time-sensitive apps
Offline batch transform: Processing large volumes of data simultaneously, either via a schedule or manual trigger
- ML service pulls data from storage system, processes it in a single operation, and stores the results in a storage - an object storage like S3 or data warehouse like BigQuery
- Batch transform design is optimized for high throughput with permissive latency requirements. Significantly reduce costs when real-time predictions are unnecessary

Architecture in Model Serving

The ML service itself can be implemented as a monolithic server or as multiple microservices
Monolithic: LLM and business logic is bundled into a single service
- Simple & easy to maintain
- Difficult to scale components independently. Infra must be optimized for both GPU (for LLM) and CPU (for business logic). Can lead to inefficient resource use, with GPU being idle when business logic is executed & vice versa
- Limits flexibility as all components must share the same tech stack and runtime environment
Microservices: Breaks down inference pipeline into separate independent services, splitting LLM service & business logic into distinct components
- The main advantage is the ability to scale each component independently. If LLM service needs more GPU resources, it can be scaled horizontally without impacting the other components (i.e. business logic)
- Optimize resource usage & reduce cost, as different types of machines can be used according to each service’s needs
- Each microservice can adopt the most suitable tech stack for its service
- Microservices introduce complexity in deployment & maintenance; each service must be deployed, monitored & maintained separately, more challenging
- Increased network communication between services can introduce latency & potential points of failure

A common strategy is to start with a monolithic design and further decouple it into multiple services as the project grows. When starting out, we can completely decouple the modules of the application at the software level, so it’s easier to move the modules to different microservices when the time comes. Design your software with modularity in mind.

HuggingFace’s specialized inference container - HuggingFace LLM DLC - can be used to deploy our LLM. DLC is powered by HuggingFace’s Text Generation Inference (TGI) engine

Chapter 11: MLOps and LLMOps

The end goal of MLOps is to automate as much as possible - data collection, training, testing, deployment.

DevOps automates the process of shipping software at scale

Deployment Environments: Dev, staging & production environment to test code before shipping to prod
Version control: Track, manage & version all changes made
Continuous Integration: Automatically build application and run tests on each change
Continuous Delivery: Works with CI and automate the infra provisioning and application deployment steps

MLOps involves

Model registry: A centralized repository for storing trained ML models (tools: Comet ML, W&B, MLflow, ZenML)
Feature store: Preprocessing and storing input data as features for both model training and inference pipelines (tools: Hopsworks, Tecton, Featureform)
ML metadata store: This store tracks information related to model training, such as model configurations, training data, testing data, and performance metrics. It is mainly used to compare multiple models and look at the model lineages to understand how they were created (tools: Comet ML, W&B, MLflow)
ML pipeline orchestrator: Automating the sequence of steps in ML projects (tools: ZenML, Airflow, Prefect, Dagster)

LLMOps can improve prompt engineering, fine-tuning, RAG

Human Feedback: Align LLM with your audience’s preferences. Introduce a feedback loop within your application and gather a human feedback dataset to further fine-tune the LLM with techniques such as RLHF. E.g. thumbs-up/thumbs-down button present in most chatbot interfaces.
Guardrails: Protect ML systems against harmful, sensitive, or invalid input and output by adding guardrails
Prompt Monitoring: Trace each step from the user’s input until the generated answer. If something fails or behaves unexpectedly, you can point exactly to the faulty step.