Optimizing Retrieval Augmented Generation
RAG system composed of 3 main modules: Ingestion (populate vector DB), Retrieval (retrieve relevant entries), Generation (use retrieved data to augment prompt).
Pre-retrieval optimizations: Better data indexing, query optimization
- Data indexing techniques: Sliding window to preserve context near chunk borders, ensure data granularity by removing irrelevant/outdated/inaccurate stuff, add metadata tags (e.g. Dates, external IDs) to filter results effectively during retrieval, Small-to-Big retrieval using a small bit of text to compute the embedding to introduce less noise, while still preserving the full, longer text for LLM’s final answer
- Query optimization: Query routing to decide what action to take based on user input, rewrite the query by paraphrasing, replacing less common words, break down longer queries into multiple shorter, focused sub-queries, query expansion by adding additional terms/concepts, hypothetical document embeddings (HyDE) by generating a hypothetical answer and using its embedding to retrieve, self-query by using LLM to extract key entities/relationships in the query & use as filtering params to reduce search space
Retrieval optimizations: Improving embedding model, use DB’s filter/search features
- Improve embedding model by: Fine-tuning the embedding model to tailor it to specific jargon/nuance, use instructor models for task-aware embeddings
- Filter/Search features: Hybrid search by combining vector (similarity) & keyword-based (BM25 relevance/keyword) search bland - alpha param controls weight between the two methods. Filtered vector search by using metadata constraints to limit search space before search (pre-filtering), or filtering search results by metadata (post-filtering)
Post-Retrieval optimization:
- Result compression by removing unnecessary details, re-ranking models (cross-encoder) to evaluate most relevant documents
RAG Design Choices
- Batch pipelines: Useful when dealing with large volumes of data that don’t need immediate processing. Data is collected, scheduled processing, and then saved for future use
- Advs: Optimized resource allocation & parallel processing as large volumes of data can be handled more efficiently, simpler than streaming pipeline
- Disadvantages: Feature freshness isn’t there, make redundant predictions
- Streaming pipelines: Core elements are distributed event streaming platforms (e.g. kafka) to store events from multiple clients and a streaming engine (e.g. Apache Flink) to process the events. Used in cases where change is unpredictable & frequent (e.g. social media recommendation).