04 Dec 2025 1 min read

Optimizing Retrieval Augmented Generation

RAG system composed of 3 main modules: Ingestion (populate vector DB), Retrieval (retrieve relevant entries), Generation (use retrieved data to augment prompt).

Pre-retrieval optimizations: Better data indexing, query optimization

Data indexing techniques: Sliding window to preserve context near chunk borders, ensure data granularity by removing irrelevant/outdated/inaccurate stuff, add metadata tags (e.g. Dates, external IDs) to filter results effectively during retrieval, Small-to-Big retrieval using a small bit of text to compute the embedding to introduce less noise, while still preserving the full, longer text for LLM’s final answer
Query optimization: Query routing to decide what action to take based on user input, rewrite the query by paraphrasing, replacing less common words, break down longer queries into multiple shorter, focused sub-queries, query expansion by adding additional terms/concepts, hypothetical document embeddings (HyDE) by generating a hypothetical answer and using its embedding to retrieve, self-query by using LLM to extract key entities/relationships in the query & use as filtering params to reduce search space

Retrieval optimizations: Improving embedding model, use DB’s filter/search features

Improve embedding model by: Fine-tuning the embedding model to tailor it to specific jargon/nuance, use instructor models for task-aware embeddings
Filter/Search features: Hybrid search by combining vector (similarity) & keyword-based (BM25 relevance/keyword) search bland - alpha param controls weight between the two methods. Filtered vector search by using metadata constraints to limit search space before search (pre-filtering), or filtering search results by metadata (post-filtering)

Post-Retrieval optimization:

Result compression by removing unnecessary details, re-ranking models (cross-encoder) to evaluate most relevant documents

RAG Design Choices

Batch pipelines: Useful when dealing with large volumes of data that don’t need immediate processing. Data is collected, scheduled processing, and then saved for future use
- Advs: Optimized resource allocation & parallel processing as large volumes of data can be handled more efficiently, simpler than streaming pipeline
- Disadvantages: Feature freshness isn’t there, make redundant predictions
Streaming pipelines: Core elements are distributed event streaming platforms (e.g. kafka) to store events from multiple clients and a streaming engine (e.g. Apache Flink) to process the events. Used in cases where change is unpredictable & frequent (e.g. social media recommendation).