快来看,n8n更新了!评估RAG,即优化中的优化
内容来源:https://blog.n8n.io/evaluating-rag-aka-optimizing-the-optimization/
内容总结:
尽管检索增强生成(RAG)技术被广泛视为优化大语言模型(LLM)的关键方案,但其在实际应用中仍存在生成内容与检索信息不符或缺乏依据的问题。例如,某物流企业分析师询问内部AI助手第二季度业绩时,系统虽正确检索到营收下降15%的数据,却擅自将原因归结为苏伊士运河堵塞——这一解释在原始报告中并不存在。
研究表明,RAG的幻觉问题可分为四类:明显矛盾(与检索内容直接冲突)、隐性矛盾(语义层面偏离)、明显虚构(添加无依据信息)和隐性虚构(引入主观推断)。为解决这些问题,需从两大维度评估RAG系统:一是文档相关性(确保检索信息准确完整),二是答案可信度(保证回答与检索内容一致)。
目前业界普遍采用Ragas评估框架,通过上下文召回率和精确度衡量检索质量,同时使用忠实性和响应相关性验证生成内容。值得注意的是,即使生成了正确答案,若其未基于检索内容而依赖模型先验知识,仍被视为幻觉。
自动化工作流平台n8n已内置RAG评估功能,无需调用外部库即可实现:通过"使用OpenAI评估RAG响应准确性"工作流模板检测生成内容与文档的契合度,利用上下文召回工作流量化检索效能。用户可通过测试数据集对比多轮运行结果,精准定位优化方向。
(注:苏伊士运河堵塞案例仅为技术演示,与实际事件无关)
中文翻译:
检索增强生成(RAG)常被视为优化大语言模型(LLM)的首选方案。但即便在智能代理系统中整合了RAG技术,LLM仍可能提出与检索内容相悖或缺乏依据的论断。
以某物流企业的商业分析师使用基于RAG的内部AI助手查询财务报告为例。当询问"我们第二季度业绩如何"时,助手回复:"受苏伊士运河堵塞事件导致的供应链中断影响,我们Q2营收较Q1下降15%。"虽然系统准确检索到财务报告中15%的营收降幅,却擅自将原因归结于苏伊士运河堵塞——这一解释在原始资料中并无记载。
文档检索并不保证信息准确性,因此RAG本身需要优化。这意味着需调整搜索机制以返回更精准的结果(减少干扰信息),并确保LLM响应与检索上下文保持一致。
本文将探讨RAG系统为何会产生幻觉现象,介绍基于Ragas框架的评估体系,最后演示如何在n8n平台实施RAG评估。
四大RAG幻觉类型
在RAG语境下,幻觉的定义略有不同:当LLM未基于检索到的文本数据生成内容,而是依赖预训练知识产出时,即被视为幻觉。Vectara团队(HHEM评估模型创建者)举例说明:若检索上下文显示"法国首都是柏林",而LLM输出"法国首都是巴黎",虽然后者事实正确,但仍属幻觉响应。
根据《RAGTruth》论文,RAG特异性幻觉可分为四类:
- 明显冲突:生成内容与提供信息直接对立或矛盾。这类冲突无需深入语境即可验证,常涉及明显事实错误、拼写错误、数字偏差等
- 隐性冲突:生成内容偏离原始信息,扭曲原本语境含义。常通过替换具有不同含义或严重程度的术语实现,需深入理解上下文才能识别
- 明显无依据信息:生成内容包含源材料未证实的信息,涉及虚构、捏造或缺乏证据支持的幻觉细节
- 隐性无依据信息:生成内容通过添加推断细节、见解或情感延伸至源材料之外。这类信息缺乏可验证性,可能包含主观假设或常见认知而非明确事实
双支柱RAG评估框架
优质RAG实施方案应验证两个维度:
- 确保检索信息准确性 → RAG文档相关性
- 保证LLM应答与检索上下文一致 → RAG答案可靠性
当前主流工具多采用Ragas库提供的专项评估功能。下文将基于该库展开说明:
RAG文档相关性:精准检索语境
上下文召回率评估衡量成功检索相关文档的比例,高召回率意味着遗漏相关文档更少。上下文精确度则评估检索结果中相关文本块的比例。计算召回率需始终依赖参照基准。
召回率与精确度可通过LLM裁判或确定性算法计算:
- 基于LLM的上下文召回率通过用户输入、参照答案和检索上下文三个变量计算。将参照答案分解为多个主张,逐项检验是否能从检索上下文中溯源
- 基于LLM的上下文精确度通过LLM比对检索上下文/文本块与应答内容的相关性
- 非LLM评估法则通过语义相似度、莱文斯坦相似比、字符串比对等指标,直接对比检索内容与参照上下文
RAG答案可靠性:响应与上下文一致性
忠实度评估响应与检索上下文的事实一致性。若所有主张均能得到上下文支持,则视为可信响应。Vectara的HHEM-2.1-Open开源分类模型可检测LLM生成文本的幻觉,通过主张与上下文的交叉验证判断推断合理性。
响应相关度衡量应答与用户问题的关联程度。直接恰当回应原始问题的答案被视为相关,该指标专注答案与问题意图的匹配度(不涉及事实准确性),会对不完整或含冗余信息的答案扣分。
n8n平台的RAG评估
无需外部库或接口调用,n8n原生支持RAG性能评估:
- 内置评估涵盖文档相关性与答案可靠性,可计算检索文档与问题的关联度及答案的依据性
- 基于测试数据集运行评估,支持多轮结果对比以观察指标变化并追溯根源
"使用OpenAI评估RAG响应准确性"工作流模板通过LLM驱动的响应相关度评估,检测应答是否基于检索文档。高分表示LLM遵循上下文,低分可能提示指令缺陷或模型幻觉。
RAG文档相关性工作流采用LLM驱动的上下文召回率计算检索得分,判定工作流性能表现。
如需深入了解n8n评估功能,请参阅《AI工作流评估功能发布》技术博客及评估节点技术文档。
英文来源:
Retrieval augmented generation is often positioned as the go-to solution for optimizing LLMs. But despite the integration of RAG in agentic systems, LLMs may still present unsupported or contradictory claims to the retrieved contents.
Imagine a business analyst at a logistics company using an internal AI assistant powered by RAG to interact with financial reports. When the analyst asks, “What is our Q2 performance?”, the assistant responds: “Our Q2 revenue decreased by 15% compared to Q1 due to supply chain disruptions following the Suez Canal blockage.”
While the system correctly retrieved the financial report, noting a 15% revenue drop, it fabricated a justification by attributing the decline to the Suez Canal blockage—an explanation not present in the source material.
Retrieving documents doesn’t guarantee accuracy, so RAG itself must be optimized. This means tuning the search to return the right results, including less noise, and aligning the LLM response with the context retrieved.
That’s why in this article, we’ll discuss how RAG systems can still hallucinate, and provide a framework for evaluating RAG applications using the Ragas framework. Lastly, we’ll present how to implement RAG evaluations in n8n.
Four types of RAG hallucinations
Hallucinations have a slightly different definition in the context of RAG. We use the term to indicate a response is not supported by or aligned with the retrieved context. It is considered a hallucination when the LLM does not generate content based on the textual data provided to it as part of the RAG retrieval process, but rather generates content based on its pre-trained knowledge.
Vectara, the creators of the HHEM evaluation models, give the following example: if the retrieved context states "The capital of France is Berlin", and the LLM outputs "The capital of France is Paris", then the LLM response is hallucinated, despite it being correct.
We can categorize RAG-specific hallucinations into four categories, as described in the paper titled RAGTruth:
- Evident Conflict: for when generative content presents direct contraction or opposition to the provided information. These conflicts are easily verifiable without extensive context, often involving clear factual errors, misspelled names, incorrect numbers, etc.
- Subtle Conflict: for when generative content presents a departure or divergence from the provided information, altering the intended contextual meaning. These conflicts often involve substitution of terms that carry different implications or severity, requiring a deeper understanding of their contextual applications.
- Evident Introduction of Baseless Information: for when generated content includes information not substantiated in the provided information. It involves the creation of hypothetical, fabricated, or hallucinatory details lacking evidence or support.
- Subtle Introduction of Baseless Information: is when generated content extends beyond the provided information by incorporating inferred details, insights, or sentiments. This additional information lacks verifiability and might include subjective assumptions or commonly observed norms rather than explicit facts.
The two-pillar RAG evaluation framework
A good RAG implementation can validate two things: - Making sure RAG retrieves the right information. This is RAG Document Relevance.
- Ensuring LLM answers are consistent with the context retrieved via RAG. This is RAG Groundedness
Most tools available today use the Ragas library, which provides a set of RAG-specific evaluation functions. We use the evaluations available in the Ragas library in the descriptions below.
RAG Document Relevance: Retrieving the right context
The Context Recall evaluation measures how many of the relevant documents were successfully retrieved. Higher recall means fewer relevant documents were left out. Context Precision is a metric that measures the proportion of relevant chunks in the retrieved contexts. Calculating context recall always requires a reference to compare against.
Both recall and precision can be calculated using a judge LLM or by using deterministic calculations.
LLM-based Context Recall is computed using three variables - the user input, the reference and the retrieved contexts. To estimate context recall from the reference, the reference is broken down into claims, each claim in the reference answer is analyzed to determine whether it can be attributed to the retrieved context or not.
LLM-based Context Precision is used to estimate if a retrieved context is relevant or not by having an LLM compare each of the retrieved contexts or chunks present in retrieved contexts with the response.
Non-LLM-Basedthe Context Recall and Precision compare retrieved contexts or chunks with the reference contexts. These metrics use measures such as semantic similarity, Levenshtein Similarity Ratio and string comparison metrics to determine if a retrieved context is relevant or not.
RAG Groundedness: Evaluating responses against retrieved context
Faithfulness determines how factually consistent a response is with the retrieved context. A response is considered faithful if all its claims can be supported by the retrieved context. Vectara's HHEM-2.1-Open is an open source classifier model that is trained to detect hallucinations from LLM generated text. It can be used to cross-check claims with the given context to determine if it can be inferred from the context.
Response Relevancy measures how relevant a response is to the user input. An answer is considered relevant if it directly and appropriately addresses the original question. This metric focuses on how well the answer matches the intent of the question, without evaluating factual accuracy. It penalizes answers that are incomplete or include unnecessary details.
RAG Evaluations in n8n
You can evaluate RAG performance in n8n without external libraries or calls. The evaluations natively available include both RAG document relevance and answer groundedness. These calculate whether the documents retrieved are relevant to the question and if the answer is grounded in the documents retrieved. RAG evaluations are run against a test dataset, and results can be compared across runs to see how the metrics change and drill down into the reasons for those changes.
The Evaluate RAG Response Accuracy with OpenAI workflow template uses LLM-based response relevancy to assess whether the response is based on the retrieved documents. A high score indicates LLM adherence and alignment whereas a low score could signal inadequate prompt or model hallucination.
The RAG document relevance workflow uses an LLM-based context recall to calculate a retrieval score for each input and determine whether the workflow is performing well or not.
To understand more about evaluation in n8n, check out the Introducing Evaluations for AI Workflows blog post, and our technical documentation on the evaluation node.