Nov 25, 2024
Written by
(This whitepaper was published in the 2024 edition of ILTA KM & MT Whitepapers.)
Retrieval-augmented generation (RAG)-based question-answering systems hold great promise for automating information extraction from large document repositories, assisting in accessing and deriving insights from institutional knowledge. However, these systems can pose significant challenges when applied to the legal domain. Legal queries often require precise, exhaustive, and contextually nuanced answers, complicating retrieval and generation tasks. Challenges such as extracting exhaustive information, addressing multi-hop questions, aggregating information from various documents, and navigating overlapping content across documents necessitate tailored approaches for the legal sector.
Moreover, systems must account for document sub-setting and cross-references while balancing accuracy and efficiency. This white paper examines these challenges, highlights the limitations of current RAG systems in legal question answering, and proposes enhancements to better address the unique demands of the legal sector.
Introduction
Retrieval Augmented Generation (RAG)-based systems, which combine information retrieval with Large Language Model (LLM) generation, have shown great potential in addressing complex information extraction needs across various domains. RAG systems can be advantageous in the legal domain, where vast amounts of text, including contracts, policies, and other documents, are constantly processed. However, answering legal questions presents unique challenges that make the direct application of existing RAG frameworks less effective.
Legal queries often require not only the retrieval of specific facts but also a deep contextual understanding and aggregation of scattered information across document repositories. The intricacies of legal language, frequent cross-references within and across documents, and multiple versions or amendments to contracts further complicate the retrieval process.
Furthermore, the nature of legal repositories, which often contain thousands of highly similar documents with only minor changes in text, poses a significant challenge for traditional retrieval models. The issue of content overlap—where the source text is highly homogenous and lacks specificity—can significantly impact retrieval accuracy, especially when the distinguishing factors between documents are as subtle as entity names, dates, metadata, or minor modifications in clause language. Additionally, the need to handle multi-hop questions, where the answer to one part of a question informs the answer to another, further complicates legal question-answering tasks.
The existing design of RAG systems allows for retrieving a small set of relevant passages, but legal queries often require exhaustive information from many documents, increasing latency and cost. Legal queries frequently involve complex question types, such as aggregating information across multiple documents, which current systems are not well-equipped to handle.
By addressing issues such as exhaustive information extraction, aggregate questions, multi-hop questions, and content overlap, we aim to provide insights into developing more robust RAG-based question-answering systems for legal professionals seeking precise and context-aware answers.
This whitepaper describes four significant challenges in developing a robust Retrieval-Augmented Generation (RAG) based legal question-answering system and proposes potential mitigation approaches. The primary goal of this work is to shift the focus from the limitations and enhancements of RAG implementations to the unique challenges posed by the legal domain and how systems within the RAG architecture can be adapted to build a more accurate legal question-answering system. The four challenges we discuss are Exhaustive Information, Aggregate Questions, Multi-hop Questions, and Overlapping Content. In each subsection, we delve into the challenge in greater depth and propose a solution with a more robust approach.
Exhaustive Information
Standard RAG systems are not designed for exhaustive retrieval, often returning a limited number of passages, which can lead to critical omissions in legal contexts where thoroughness is essential. To address this, we propose a clarification mechanism that can independently refine broad queries, narrow the search space, and retrieve information from each document. A clarification mechanism ensures exhaustive coverage and improves precision and recall when handling complex legal queries.
A substantial challenge in applying Retrieval Augmented Generation (RAG) to legal question answering is the need for exhaustive information retrieval across large document sets. Legal queries often require thorough searches to identify and retain all critical details. RAG systems, optimized for efficiency, return a limited number of passages (e.g., the top 5 or 10), potentially omitting essential information. Increasing the number of retrieved documents may improve recall but can add computational overhead and latency, especially when generating answers from multiple passages.
For instance, consider the query: “What are all the indemnification clauses in contracts between Company A and its vendors?” The system must retrieve every relevant contract and identify each indemnification clause. This process requires potentially hundreds of contracts to be retrieved. Standard RAG systems, constrained by passage limits, may miss critical documents if they rank lower in retrieval.
A more effective approach includes a clarification mechanism that prompts users for specific details when the initial query is too broad. Legal questions often require narrowing to avoid overwhelming or irrelevant document retrieval. By asking for clarifications—such as specifying the contract, party, or time frame—the system can reduce unnecessary retrieval and focus on the most relevant documents, improving precision and recall.
Additionally, retrieving information from each document independently, rather than generating an answer based on a combined set of top-ranked passages, ensures exhaustive coverage. In standard RAG systems, answers are often generated from the highest-ranked passages across multiple documents, risking the omission of essential details from lower-ranked but relevant documents. By extracting information from each document individually, the system ensures no relevant content is missed, enabling more accurate responses to complex legal queries.
Aggregate Questions
Legal queries often require exhaustive retrieval and tasks like counting and listing across documents, where LLMs struggle. Our proposed hybrid approach combines document retrieval with SQL-like structured data processing, transforming queries and results into structured formats for precise counting, listing, and filtering tailored to legal analysis.
In the legal domain, many queries are not limited to retrieving facts or passages from a single or small set of documents but instead require aggregation of information across many documents. Questions such as “Which of the contracts contain an indemnity clause?” or “How many contracts were amended in the last year?” introduce an additional layer of complexity for Retrieval Augmented Generation (RAG) systems. These types of aggregate questions demand exhaustive coverage of relevant documents and involve performing aggregation tasks like counting, which LLMs are not known to be great at.
To address aggregate questions in the legal domain, we propose a two-step solution that combines exhaustive retrieval with structured data processing through Structured Query Language (SQL)-like queries. Structured Query Language is a domain-specific language used to manage data, especially in a database management system. This approach leverages RAG systems and LLMs, addressing their limitations in tasks like counting or listing.
First, the system exhaustively retrieves relevant information from each document, ensuring no data is missed. For example, in response to the query “Which contracts contain an arbitration clause?”, the system retrieves relevant passages from all applicable contracts, ensuring comprehensive coverage.
Next, the retrieved data is transformed into a structured, SQL-compatible format. The LLM translates the original query into an SQL-like command. Finally, the SQL query is executed on the structured data, allowing for precise aggregation, such as counting or filtering. This approach delivers accurate and reliable answers to complex legal queries by combining exhaustive retrieval and structured querying.
Multi-hop Questions
Multi-hop legal questions require sequential reasoning across multiple documents, which standard RAG systems handle poorly. We synthesize accurate responses by breaking complex queries into atomic steps and solving them sequentially. This structured process ensures logical progression, overcoming the limitations of standard retrieval methods.
While multi-hop, or multi-part, questions are a general challenge in information retrieval systems; they become particularly pronounced in the legal domain due to the complexity and structure of legal documents. Legal queries often require sequential reasoning across multiple pieces of information within or across documents, whereas answering one part of a question depends on retrieving specific information that informs the next part.
Consider the example: "What is the limitation of liability in the MSAs with ABC Corporation governed by the laws of Delaware?". This multi-hop question requires several steps: First, the system locates the MSAs between the company and ABC Corporation. Next, it identifies which MSAs Delaware law provides governance for. Finally, it extracts the limitation of liability clauses and synthesizes them into a complete answer.
The above is a classic case of multi-hop reasoning, where each step builds on the previous one. The system must retrieve information and logically combine and process multiple layers to reach the final answer. Standard RAG systems struggle because they focus on retrieving relevant context, not performing complex reasoning. Multi-hop questions require the system to follow a logical chain—identifying, cross-referencing, and synthesizing information to conclude.
Multi-hop questions differ from simple queries that collect information from different contexts. In multi-hop queries, the answer to one part informs and narrows down to the next.
One approach to solving multi-hop legal QA involves breaking the query into smaller components. These atomic questions tackle individual parts of the query in sequence. The key steps are:
Step 1: Break Down the Question – The system decomposes the complex question into atomic parts. For example, one atomic question might be: “Which MSAs with ABC Corporation are governed by Delaware law?” Another might be: “What is the limitation of liability in those MSAs?”
Step 2: Recognize the Order of Resolution – The system must follow the logical sequence after identifying the atomic parts. First, determine the MSAs governed by Delaware law, then extract the limitation of liability.
Step 3: Solve Atomic Questions Serially – The system solves the atomic questions individually. Solving the first part provides context for the next, ultimately leading to the final answer.
The challenge lies in identifying atomic questions, which requires understanding the query structure and where relevant information resides, often scattered across documents. This structured reasoning approach overcomes the limitations of standard RAG systems, which struggle with multi-step inference.
Overlapping Content
Legal documents typically contain a lot of overlapping text, making conventional retrieval difficult in large repositories, as standard RAG systems struggle to find relevant context by retrieving most similar documents effectively. To improve accuracy, implementing document subsetting narrows the search space while ensuring relevant context from similar documents is not excluded.
Legal documents, such as contracts, frequently contain substantial portions of identical text, with only a few key variables, such as entity names or dates, differing between documents. This results in a higher probability of retrieving similar yet incorrect documents, lowering retrieval accuracy. Moreover, legal repositories often contain tens of thousands of documents that differ in type, involve various counterparties, and span multiple jurisdictions. Standard RAG systems typically fail to account for this complexity, as they do not leverage metadata to filter source documents, leading to an unnecessarily broad search space and further reducing accuracy.
Document subsetting can be applied before the RAG stage to address considerable text overlap. Each document is tagged with standardized metadata, such as agreement type (e.g., MSA, NDA), party names, and governing jurisdiction. Upon receiving a query, the system uses this metadata to narrow the document set, ensuring only the most relevant documents are searched.
For instance, in a query like “What is the limitation of liability in the MSAs with ABC Corporation governed by Delaware law?”, the system identifies vital entities—"ABC Corporation," "MSA," and "Delaware"—and filters the repository to include only relevant documents. This step reduces overlap and improves retrieval accuracy.
Conclusion
The application of Retrieval-Augmented Generation (RAG) systems in the legal domain shows excellent promise but faces challenges due to the complexity of legal language and document structures. Legal queries often require more than retrieving a few passages—they demand exhaustive information, aggregation, multi-hop reasoning, and handling overlapping content. Standard RAG systems lack the tailored mechanisms to meet these needs, resulting in reduced accuracy in answering legal questions.
We propose enhancements such as exhaustive retrieval, structured data processing for aggregation, breaking down multi-hop queries, and document subsetting. These improvements aim to make RAG systems more effective for legal queries, delivering precise, context-aware answers.
By addressing current limitations and tailoring RAG systems for legal needs, this work contributes to more efficient and accurate legal question answering, enabling professionals to make better decisions. Legal Knowledge Management system developers should expand on these challenges, addressing legal document comparison, summarization, and document versioning and amendments. Additionally, a unified system incorporating all the proposed approaches should be evaluated across benchmarks to validate its impact further.
Related Articles
Interested to Learn more?