Retrieval Augmented Generation based Co-Pilots

6 min readSep 18, 2023

What is Retrieval Augmented Generation (RAG)

Retrieval augmented generation combines a information retrieval system with a generative model. The retrieval system provide relevant context to the generative model in order to produce better answers. In this Sep 2020 blog post from Meta, they combine the Dense Passage Retrienval model with the BART model (a Seq2Seq/Transformer model) to build a RAG system.

Why Retrieval Augmented Generation

One of the biggest limitations of current LLM’s like OpenAI ChatGPT, Anthropic Claude etc is that they can only answers questions based on the data they were trained on so questions related to recent events lead to hallucinations and wrong answers. RAG solves this problem by providing relevant context to the LLM. For example if we ask ChatGPT-3.5 about the popular LLM library LangChain it says that it only has information till September 2021 and there is no such library!!

I’m sorry, but as of my last knowledge update in September 2021, there is no widely known or established concept or technology called “langchain.” It’s possible that it may have emerged after my last update or is a very specialized term used in a specific context that is not widely documented.
If you could provide more context or details about what you mean by “langchain” or its usage, I would be happy to try to help you further.

RAG can solve the above problem by providing the relevant context, for example we can just do a google search and then ask ChatGPT to summarize the results. Perplexity.ai does something similar using Bing Search and ChatGPT.

Simple RAG in 5 lines

Below is the code from LlamaIndex examples to build a simple RAG using LlamaIndex (source) without having OpenAI api access. In The latest release (0.8.31) the below code will default to using the Llama-2–13b chat model using the llama-cpp-python library.

from llama_index import VectorStoreIndex, SimpleDirectoryReader
documents = SimpleDirectoryReader('data').load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query()
print(response)

Below I cover the different components needed to build RAG based copilot as shows in the first system diagram.

Data Processing Pipeline

Enterprise DB (ER DB)

This is the existing database (like Salesforce, MySQL, S3 etc) which contains all the documents.

Pre Processing and Data Extraction

We need to extract the useful information from each document. This step might also involve PII removal (if necessary), chunking etc. Tools for extracting information for PDF and images —

Tesseract — Open source OCR engine, can extract text from various imgae formats (PNG, JPEG etc)

Kraken

PaddleOCR

We can also use SOTA models like Donut and Nougat for extracting information from PDF’s.

Embedding Model/Service

This model converts the input text to an embedding. We can use models like OpenAI embedding model like Ada or open source embeddings. The input to the model is text and output is N dimensional embedding (eg N can be 768, 1024 etc). The models typically also have a max sentence/token length (eg 1024, 8K etc) so if the text is very large it has to be broken into chunks. We can also use the SentenceTransformers library which has a large number of embedding models of different size and performance.

Vector DB

The vector Db stores all the embedding and the corresponding text and builds an index. Given an embedding it can find the top-N most similar embeddings using ANN (Approximate nearest neighbor search) and then return the top-K most similar documents.

As new information is added the vector DB will have to be updated as well. This pipeline will have to be run at a fixed interval, for example every night. Below are some Open source and commercial vector databases —

Milvus — Open Source (Zilliz — Entreprise DB powered by Milvus)

Weviate — Open Source

Chroma — Open Source

Pinecone

Qdrant — Open Source

Inference Process

Query Embedding and Cache

Once the user enters a new query, the system first checks if the query already exists in the cache (using exact query match after lower casing). If the query already exists in the case then we can save time and cost by just fetching the embedding from the cache, if not we use the embedding service.

ANN Search

Here we use the ANN search to find the top-k (k=3 etc) documents from the vectorDB whose embeddings are most similar (using metric like cosine similarity) to the input query embedding.

Contextual Features

These are features like time, location, user features like experience, role etc if available.

Contextual Prompt

We combine the contextual features and the top-3 most similar records to create a contextual system prompt for the LLM model. Below is an example of the system prompt -

f“You are an assistant bot and your job is to help your users with any issues. You will be given inputs like Time of day and context regarding the query. Based on only this information, respond with a concise answer to the user. If you are unsure, ask follow up questions to get more information.

{Contextual features}, {context}”

Final Output

The final consists of the LLM response and also links to the top-3 records so that the user can read them in detail if needed and also contact the person who created those records if needed.

Inference Pipeline Scalability

Enterprise vector databases like Zilliz, Weaviate can be scaled horizontally to efficiently search millions of records. We will also need the internal chat system to be able to handle varying levels of user traffic, ensuring scalability during peak usage.

Graceful Fallback

As we may be using external services (eg OpenAI/Azure API’s) we need to be able to use fallback options in case there is some issue with the API’s. We will need to implement error handling mechanisms to gracefully handle failures and unexpected scenarios.

Training Pipeline

Feedback loop

After every query the user will have the option to indicate if this response was correct using a thumbs up/down vote and this will be used to create an instruction dataset.

Fine Tuning

The LLM will be fine tuned using the instruction dataset using RLHF (Reinforcement learning from human feedback) or any other vendor provided (eg OpenAI) fine tuning API.

Metrics and Continuous Monitoring

System Metrics

Total Response Time — Measure the time it takes for the copilot to respond to user queries.

Response accuracy — Percentage of responses up-voted by the users

Vector search (ANN) time — time to return the top-k results

Business Metrics

Average session time (AST) — The average time of a chat session. Very long session can indicate that the copilot is not able to answer the questions.

Average number of queries need to resolve an issue — Lower indicates the copilot is good at answering users questions.

User churn rate — Measure how many users stop using the copilot over time. A high churn rate might indicate dissatisfaction or inadequate performance.

User Satisfaction (CSAT) — Gather user feedback through surveys to gauge user satisfaction

Feedback Loop Engagement — Track the percentage of users who provide feedback on the copilot’s recommendations

Alerts

Triggers alerts anytime the system or business metrics change significantly (eg more than 3 sigma deviation). The alerts will be sent to the engineering team so that they investigate the issues.

Dashboard

We will need a dashboard where all the metrics can be monitored over time.

Data Quality

The quality of historical records, as inaccurate records can impact the response of the copilot.

Privacy

As we might be sending the information in records to an outside LLM model we need to take steps to protect any sensitive information (PII) present in historical records and also make sure we don’t add any PII information in the generated records.

Security

As the assistant will have access to the company proprietary data we need to implement proper access controls and keep record of people using the copilot.

Model and Input Drift

It has been shown in some recent papers that LLM performance can degrade over time so we need to have proper metrics to detect drift. Also user queries can change over time.

Re-Ranking

In many cases just similarity based retrieval may work however we can further improve the performance by re-ranking the inital set of documents and sending the top-k ranked documents to the LLM.