rw-book-cover

Metadata

Highlights

  • The two key takeaways arising from our work are: 1) validation of a RAG system is only feasible during operation, and 2) the robustness of a RAG system evolves rather than designed in at the start. (View Highlight)
  • In a RAG system, the retrieval system works using embeddings that provide a compressed semantic representation of the docu- ment. An embedding is expressed as a vector of numbers. During the Index process each document is split into smaller chunks that are converted into an embedding using an embedding model. The original chunk and the embedding are then indexed in a database. Software engineers face design decisions around how best to chunk the document and how large a chunk should be. If chunks are too small certain questions cannot be answered, if the chunks are too long then the answers include generated noise. (View Highlight)
  • The Query process takes place at run time. A question expressed as natural language is first converted into a general query. To gen- eralise the query a large language model is used which enables additional context such as previous chat history to be included in the new query. An embedding is then calculated from the new query to use for locating relevant documents from the database. Top-k similar documents are retrieved using a similarity method such as cosine similarity (vector databases have techniques such as inverted indexes to speed up retrieval time). The intuition is that chunks that are semantically close to the query are likely to contain the answer. (View Highlight)
  • Retrieved documents are then re-ranked to maximise the likeli- hood that the chunk with the answer is located near the top. The next stage is the Consolidator which is responsible for processing the chunks. This stage is needed to overcome the limitations of large language models 1) token limit and 2) rate limit. Services such as OpenAI have hard limits on the amount of text to include in a prompt. This restricts the number of chunks to include in a prompt to extract out an answer and a reduction strategy is needed to chain prompts to obtain an answer. These online services also restrict the number of tokens to use within a time frame restricting the latency of a system. Software engineers need to consider these tradeoffs when designing a RAG system. (View Highlight)
  • The final stage ofa RAG pipeline is when the answer is extracted from the generated text. Readers are responsible for filtering the noise from the prompt, adhering to formatting instructions (i.e. an- swer the question as a list of options), and producing the output to return for the query. Implementation of a RAG system requires cus- tomising multiple prompts to process questions and answers. This process ensures that questions relevant for the domain are returned. (View Highlight)
  • Chunking documents sounds trivial. However, the quality ofchunk- ing affects the retrieval process in many ways and in particular on the embeddings of the chunk then affects the similarity and matching of chunks to user queries. There are two ways of chunk- ing: heuristics based (using punctuation, end of paragraph, etc.), and semantic chunking (using the semantics in the text to inform start-end of a chunk). Further research should explore the tradeoffs between these methods and their effects on critical downstream processes like embedding and similarity matching. (View Highlight)
  • From the case studies we identified a set of failure points presented below. The following section addresses the research question What are the failure points that occur when engineering a RAG system? FP1 Missing Content The first fail case is when asking a ques- tion that cannot be answered from the available documents. In the happy case the RAG system will respond with some- thing like “Sorry, I don’t know”. However, for questions that are related to the content but don’t have answers the system could be fooled into giving a response. FP2 Missed the Top Ranked Documents The answer to the question is in the document but did not rank highly enough to be returned to the user. In theory, all documents are ranked and used in the next steps. However, in practice the top K documents are returned where K is a value selected based on performance. FP3 Not in Context - Consolidation strategy Limitations Documents with the answer were retrieved from the data- base but did not make it into the context for generating an answer. This occurs when many documents are returned from the database and a consolidation process takes place to retrieve the answer. FP4 Not Extracted Here the answer is present in the context, but the large language model failed to extract out the correct answer. Typically, this occurs when there is too much noise or contradicting information in the context. FP5 Wrong Format The question involved extracting informa- tion in a certain format such as a table or list and the large language model ignored the instruction. FP6 Incorrect Specificity The answer is returned in the re- sponse but is not specific enough or is too specific to address the user’s need. This occurs when the RAG system designers have a desired outcome for a given question such as teach- ers for students. In this case, specific educational content should be provided with answers not just the answer. Incor- rect specificity also occurs when users are not sure how to ask a question and are too general. 6https://github.com/openai/evals FP7 Incomplete Incomplete answers are not incorrect but miss some ofthe information even though that information was in the context and available for extraction. An example question such as “What are the key points covered in documents A, B and C?” A better approach is to ask these questions separately. (View Highlight)
  • Software engineering best practices are still emerging for RAG sys- tems. Software testing and test case generation are one of the areas for refinement. RAG systems require questions and answers that are application specific often unavailable when indexing unstructured documents. Emerging work has considered using LLMs for gen- erating questions from multiple documents [4]. How to generate realistic domain relevant questions and answers remains an open problem. (View Highlight)

author: arxiv.org title: “Seven Failure Points When Engineering a Retrieval Augmented Generation System” date: 2024-03-06 tags:

  • articles
  • literature-note

rw-book-cover

Metadata

Highlights

  • The two key takeaways arising from our work are: 1) validation of a RAG system is only feasible during operation, and 2) the robustness of a RAG system evolves rather than designed in at the start. (View Highlight)
  • In a RAG system, the retrieval system works using embeddings that provide a compressed semantic representation of the docu- ment. An embedding is expressed as a vector of numbers. During the Index process each document is split into smaller chunks that are converted into an embedding using an embedding model. The original chunk and the embedding are then indexed in a database. Software engineers face design decisions around how best to chunk the document and how large a chunk should be. If chunks are too small certain questions cannot be answered, if the chunks are too long then the answers include generated noise. (View Highlight)
  • The Query process takes place at run time. A question expressed as natural language is first converted into a general query. To gen- eralise the query a large language model is used which enables additional context such as previous chat history to be included in the new query. An embedding is then calculated from the new query to use for locating relevant documents from the database. Top-k similar documents are retrieved using a similarity method such as cosine similarity (vector databases have techniques such as inverted indexes to speed up retrieval time). The intuition is that chunks that are semantically close to the query are likely to contain the answer. (View Highlight)
  • Retrieved documents are then re-ranked to maximise the likeli- hood that the chunk with the answer is located near the top. The next stage is the Consolidator which is responsible for processing the chunks. This stage is needed to overcome the limitations of large language models 1) token limit and 2) rate limit. Services such as OpenAI have hard limits on the amount of text to include in a prompt. This restricts the number of chunks to include in a prompt to extract out an answer and a reduction strategy is needed to chain prompts to obtain an answer. These online services also restrict the number of tokens to use within a time frame restricting the latency of a system. Software engineers need to consider these tradeoffs when designing a RAG system. (View Highlight)
  • The final stage ofa RAG pipeline is when the answer is extracted from the generated text. Readers are responsible for filtering the noise from the prompt, adhering to formatting instructions (i.e. an- swer the question as a list of options), and producing the output to return for the query. Implementation of a RAG system requires cus- tomising multiple prompts to process questions and answers. This process ensures that questions relevant for the domain are returned. (View Highlight)
  • Chunking documents sounds trivial. However, the quality ofchunk- ing affects the retrieval process in many ways and in particular on the embeddings of the chunk then affects the similarity and matching of chunks to user queries. There are two ways of chunk- ing: heuristics based (using punctuation, end of paragraph, etc.), and semantic chunking (using the semantics in the text to inform start-end of a chunk). Further research should explore the tradeoffs between these methods and their effects on critical downstream processes like embedding and similarity matching. (View Highlight)
  • From the case studies we identified a set of failure points presented below. The following section addresses the research question What are the failure points that occur when engineering a RAG system? FP1 Missing Content The first fail case is when asking a ques- tion that cannot be answered from the available documents. In the happy case the RAG system will respond with some- thing like “Sorry, I don’t know”. However, for questions that are related to the content but don’t have answers the system could be fooled into giving a response. FP2 Missed the Top Ranked Documents The answer to the question is in the document but did not rank highly enough to be returned to the user. In theory, all documents are ranked and used in the next steps. However, in practice the top K documents are returned where K is a value selected based on performance. FP3 Not in Context - Consolidation strategy Limitations Documents with the answer were retrieved from the data- base but did not make it into the context for generating an answer. This occurs when many documents are returned from the database and a consolidation process takes place to retrieve the answer. FP4 Not Extracted Here the answer is present in the context, but the large language model failed to extract out the correct answer. Typically, this occurs when there is too much noise or contradicting information in the context. FP5 Wrong Format The question involved extracting informa- tion in a certain format such as a table or list and the large language model ignored the instruction. FP6 Incorrect Specificity The answer is returned in the re- sponse but is not specific enough or is too specific to address the user’s need. This occurs when the RAG system designers have a desired outcome for a given question such as teach- ers for students. In this case, specific educational content should be provided with answers not just the answer. Incor- rect specificity also occurs when users are not sure how to ask a question and are too general. 6https://github.com/openai/evals FP7 Incomplete Incomplete answers are not incorrect but miss some ofthe information even though that information was in the context and available for extraction. An example question such as “What are the key points covered in documents A, B and C?” A better approach is to ask these questions separately. (View Highlight)
  • Software engineering best practices are still emerging for RAG sys- tems. Software testing and test case generation are one of the areas for refinement. RAG systems require questions and answers that are application specific often unavailable when indexing unstructured documents. Emerging work has considered using LLMs for gen- erating questions from multiple documents [4]. How to generate realistic domain relevant questions and answers remains an open problem. (View Highlight)