基础RAG的局限“把文档分块向量化存储检索相似块喂给LLM”——这个基础RAG流程在原型阶段表现不错但在生产环境中往往暴露出一系列问题-检索精度不足语义相似不等于内容相关常常检索到形似而神非的段落-长文档处理困难分块破坏了文档的语义完整性-多跳推理失败需要多个文档联合推理的问题无法得到正确回答-答案可溯源性差用户无法知道答案来自哪里本文将系统介绍超越基础RAG的高级技术这些技术在生产环境中经过验证能显著提升检索增强系统的质量。—## 技术1Hypothetical Document Embeddings (HyDE)HyDE的核心思路先用LLM生成一个假设答案然后用这个假设答案去检索而不是用原始问题检索。pythonfrom anthropic import Anthropicfrom langchain_community.vectorstores import Chromafrom langchain_anthropic import AnthropicEmbeddingsclass HyDERetriever: 假设文档嵌入检索器 def __init__(self, vectorstore: Chroma): self.client Anthropic() self.vectorstore vectorstore self.embeddings AnthropicEmbeddings() def retrieve(self, question: str, k: int 5) - list: 使用HyDE策略检索相关文档 # Step 1: 生成假设答案不使用任何检索纯LLM生成 hypothetical_answer self._generate_hypothetical_answer(question) # Step 2: 用假设答案进行向量检索而非原始问题 docs self.vectorstore.similarity_search(hypothetical_answer, kk) return docs def _generate_hypothetical_answer(self, question: str) - str: 生成假设答案 response self.client.messages.create( modelclaude-3-haiku-20240307, # 用快速模型生成假设答案 max_tokens300, messages[{ role: user, content: f请为以下问题生成一个可能的答案即使你不确定也要生成一个看起来合理的答案。这个答案将用于文档检索所以应该包含该主题的专业术语和关键词。问题{question}请直接给出答案不要添加我认为等主观表述。 }] ) return response.content[0].text def retrieve_with_comparison(self, question: str) - dict: 对比普通检索和HyDE检索的结果 # 普通检索 normal_docs self.vectorstore.similarity_search(question, k3) # HyDE检索 hyde_answer self._generate_hypothetical_answer(question) hyde_docs self.vectorstore.similarity_search(hyde_answer, k3) return { question: question, hypothetical_answer: hyde_answer, normal_retrieval: [d.page_content[:100] for d in normal_docs], hyde_retrieval: [d.page_content[:100] for d in hyde_docs], }—## 技术2Multi-Query Retrieval多查询检索pythonclass MultiQueryRetriever: 从多个角度重新表述问题扩大检索覆盖范围 def __init__(self, vectorstore: Chroma, num_queries: int 3): self.client Anthropic() self.vectorstore vectorstore self.num_queries num_queries def generate_queries(self, question: str) - list[str]: 从多个角度重新表述同一个问题 response self.client.messages.create( modelclaude-3-haiku-20240307, max_tokens400, messages[{ role: user, content: f为了从不同角度检索信息请将以下问题重新表述为{self.num_queries}个不同版本。每个版本应该使用不同的关键词或视角以覆盖更广的检索范围。原始问题{question}直接输出{self.num_queries}个版本每行一个不需要编号或解释 }] ) queries [q.strip() for q in response.content[0].text.strip().split(\n) if q.strip()] return [question] queries[:self.num_queries] # 包含原始问题 def retrieve(self, question: str, k_per_query: int 3) - list: 使用多查询策略检索对结果去重 queries self.generate_queries(question) all_docs [] seen_content set() for query in queries: docs self.vectorstore.similarity_search(query, kk_per_query) for doc in docs: content_hash hash(doc.page_content[:100]) if content_hash not in seen_content: seen_content.add(content_hash) all_docs.append(doc) return all_docs—## 技术3Contextual Compression上下文压缩pythonclass ContextualCompressor: 只保留与问题相关的内容压缩检索结果 def __init__(self): self.client Anthropic() def compress(self, question: str, document: str, max_length: int 500) - str | None: 从长文档中提取与问题相关的部分 如果文档与问题无关返回None response self.client.messages.create( modelclaude-3-haiku-20240307, max_tokensmax_length, messages[{ role: user, content: f任务从文档中提取与问题相关的内容。问题{question}文档内容{document}要求1. 只提取与问题直接相关的句子和段落2. 保持原文表述不要改写3. 如果文档与问题完全无关只输出[无关]4. 输出长度不超过{max_length}字提取结果 }] ) result response.content[0].text.strip() if result [无关] or len(result) 10: return None return result def compress_batch(self, question: str, documents: list, max_docs: int 5) - list: 批量压缩多个文档过滤无关文档 from concurrent.futures import ThreadPoolExecutor def compress_doc(doc): compressed self.compress(question, doc.page_content) if compressed: doc.page_content compressed return doc return None with ThreadPoolExecutor(max_workers3) as executor: results list(executor.map(compress_doc, documents)) # 过滤None最多返回max_docs个 return [r for r in results if r is not None][:max_docs]—## 技术4Self-RAG自我验证检索pythonclass SelfRAG: 自我验证RAG让LLM主动判断是否需要检索并验证检索结果 def __init__(self, vectorstore: Chroma): self.client Anthropic() self.vectorstore vectorstore self.compressor ContextualCompressor() def answer(self, question: str) - dict: 带自我验证的问答 # Step 1: 判断是否需要检索 retrieval_decision self._decide_retrieval(question) if not retrieval_decision[need_retrieval]: # 不需要检索直接回答 response self.client.messages.create( modelclaude-3-5-sonnet-20241022, max_tokens1024, messages[{role: user, content: question}] ) return { answer: response.content[0].text, retrieval_used: False, confidence: high, } # Step 2: 检索文档 docs self.vectorstore.similarity_search(question, k5) compressed self.compressor.compress_batch(question, docs) if not compressed: # 检索结果无关回退到直接回答 return self._fallback_answer(question, 检索结果与问题无关) context \n\n---\n\n.join([d.page_content for d in compressed]) # Step 3: 基于检索结果生成答案 answer_response self._generate_with_context(question, context) # Step 4: 验证答案是否有充分支持 verification self._verify_answer(question, answer_response, context) if not verification[is_supported]: # 答案没有充分支持标记置信度低 return { answer: answer_response \n\n*注以上答案可能不完整建议进一步核实。*, retrieval_used: True, confidence: low, issues: verification[issues], } return { answer: answer_response, retrieval_used: True, confidence: medium, sources: [d.metadata.get(source, 未知) for d in compressed], } def _decide_retrieval(self, question: str) - dict: 判断问题是否需要检索外部知识 response self.client.messages.create( modelclaude-3-haiku-20240307, max_tokens100, messages[{ role: user, content: f判断以下问题是否需要查询外部文档才能准确回答。常识性问题不需要检索专业知识、最新信息、特定数据需要检索。只输出JSON{{need_retrieval: true/false, reason: 简要说明}}问题{question} }] ) import json, re raw response.content[0].text match re.search(r\{.*\}, raw, re.DOTALL) return json.loads(match.group()) if match else {need_retrieval: True} def _generate_with_context(self, question: str, context: str) - str: 基于上下文生成答案 response self.client.messages.create( modelclaude-3-5-sonnet-20241022, max_tokens1024, system你是一个知识库问答助手。严格基于提供的上下文回答不要添加未在上下文中出现的信息。, messages[{ role: user, content: f上下文\n{context}\n\n问题{question} }] ) return response.content[0].text def _verify_answer(self, question: str, answer: str, context: str) - dict: 验证答案是否有充分的上下文支持 response self.client.messages.create( modelclaude-3-haiku-20240307, max_tokens200, messages[{ role: user, content: f验证答案是否完全基于提供的上下文没有编造信息。输出JSON{{is_supported: true/false, issues: [问题列表]}}上下文摘要{context[:500]}问题{question}答案{answer[:300]} }] ) import json, re raw response.content[0].text match re.search(r\{.*\}, raw, re.DOTALL) return json.loads(match.group()) if match else {is_supported: True} def _fallback_answer(self, question: str, reason: str) - dict: response self.client.messages.create( modelclaude-3-5-sonnet-20241022, max_tokens512, messages[{role: user, content: question}] ) return { answer: response.content[0].text, retrieval_used: False, confidence: medium, note: reason, }—## 技术5混合检索Hybrid Searchpythonfrom rank_bm25 import BM25Okapiimport numpy as npclass HybridRetriever: 混合稠密向量检索和BM25稀疏检索 def __init__(self, documents: list, embeddings_model): self.documents documents self.embeddings_model embeddings_model # 构建BM25索引关键词匹配 tokenized [doc.page_content.split() for doc in documents] self.bm25 BM25Okapi(tokenized) # 构建向量索引语义匹配 self.embeddings embeddings_model.embed_documents( [doc.page_content for doc in documents] ) def retrieve(self, query: str, k: int 5, alpha: float 0.5) - list: 混合检索 alpha: 向量检索权重1-alpha为BM25权重 # BM25分数 bm25_scores self.bm25.get_scores(query.split()) bm25_normalized self._normalize(bm25_scores) # 向量相似度分数 query_embedding self.embeddings_model.embed_query(query) vector_scores np.dot(self.embeddings, query_embedding) vector_normalized self._normalize(vector_scores) # 融合分数Reciprocal Rank Fusion hybrid_scores alpha * vector_normalized (1 - alpha) * bm25_normalized # 返回Top-K top_k_indices np.argsort(hybrid_scores)[::-1][:k] return [self.documents[i] for i in top_k_indices] def _normalize(self, scores: np.ndarray) - np.ndarray: Min-Max归一化 min_s, max_s scores.min(), scores.max() if max_s min_s: return np.zeros_like(scores) return (scores - min_s) / (max_s - min_s)—## 性能对比与选型建议| 技术 | 适用场景 | 延迟开销 | 质量提升 ||------|---------|---------|---------|| HyDE | 专业领域问答 | 200ms | 高 || Multi-Query | 模糊问题 | 300ms | 中高 || Contextual Compression | 长文档 | 400ms | 高 || Self-RAG | 高准确性需求 | 600ms | 很高 || Hybrid Search | 关键词语义混合 | 50ms | 中高 |选型建议- 对延迟敏感优先 Hybrid Search- 需要高准确性Self-RAG Contextual Compression- 用户问题多样化Multi-Query- 专业领域知识库HyDE在实际生产中通常组合使用多种技术例如Multi-Query Contextual Compression Hybrid Search是一个平衡了质量和延迟的常用组合。