Building a RAG system is not only about connecting documents to an LLM. A useful RAG system is a pipeline. Each stage decides whether the final answer has enough context, enough accuracy, and enough traceability.
This first log is only a walkthrough. It introduces why RAG exists, then goes through the full path: indexing, retrieval, reranking, LLM model design, and debugging system design. Each part can be expanded into its own deeper log later.
Short Answer
RAG means Retrieval-Augmented Generation.
The simple idea is this: before the LLM answers, the system first retrieves relevant information from external knowledge sources. Then the retrieved context is passed into the LLM so the model can generate an answer based on that evidence.
A RAG system usually has five major stages:
1. Indexing
Convert raw source documents into searchable knowledge units.
2. Retrieval
Search the indexed knowledge and return candidate chunks.
3. Reranking
Reorder retrieved chunks so the most relevant ones move to the top.
4. LLM Design
Build the prompt, context format, answer rules, and citation behavior.
5. Debugging System
Track where failure happens: source data, chunking, retrieval, reranking, or generation.
- 1IndexingPrepare searchable knowledge units
- 2RetrievalReturn candidate chunks
- 3RerankingMove useful chunks to the top
- 4LLM DesignGenerate a grounded answer
- 5DebuggingTrace failures across the pipeline
The main idea is simple: RAG quality is not decided by one model. It is decided by the whole path from document preparation to final answer generation.
Why RAG Is Needed
Modern LLMs are powerful because they are trained on large volumes of data using sophisticated training techniques. They can reason, summarize, rewrite, classify, and generate useful answers across many domains.
But an LLM still has two important limits.
| Limit | What It Means |
|---|---|
| Static knowledge | The model only knows what was available during training |
| Missing private data | The model does not automatically know your company docs, database records, tickets, or internal knowledge |
| Knowledge cutoff | The model may not know the latest events, latest product changes, or recent documentation updates |
| No built-in source trace | The model may answer fluently without showing where the answer came from |
This is why a powerful model alone is not always enough for production applications.
For example, if the user asks about a new company policy, a recent release note, a private support ticket, or a document that was created after the model was trained, the LLM may not have that information. It may answer with outdated knowledge, incomplete knowledge, or a confident guess.
RAG addresses this problem by retrieving relevant context from external knowledge sources before generation. These sources can be documents, databases, websites, tickets, logs, manuals, or internal notes.
The LLM is still important, but its job changes. Instead of answering only from memory, it answers with retrieved evidence.
| Without RAG | With RAG |
|---|---|
| The model answers from static training knowledge | The model answers with retrieved external context |
| Latest or private information may be missing | Latest or private information can be provided at runtime |
| Harder to verify the answer | Easier to attach citations and source references |
| Wrong answers may look fluent | Answers can be grounded against real documents |
RAG is essential when an LLM-based application needs to be accurate, current, and traceable. Without RAG, the model may still sound good, but the answer can be incomplete, outdated, or unsupported.
Step 1: Indexing Fundamentals
Indexing means preparing your source data so the system can search it later.
It usually includes five parts:
| Part | Purpose |
|---|---|
| Parsing | Extract readable text from PDFs, HTML, Markdown, database records, or documents |
| Chunking | Split the parsed text into smaller units |
| Embedding | Convert each chunk into a vector representation |
| Storing | Save chunks, embeddings, and document references into a database |
| Metadata Design | Attach useful fields for filtering, tracing, and debugging |
Indexing is the foundation of RAG. If the source text is missing, badly parsed, badly chunked, or stored without useful metadata, retrieval cannot fix it later.
Parsing
Parsing turns raw files into clean text.
For example, a PDF may contain headers, page numbers, tables, footnotes, and broken line breaks. The parser needs to extract the useful content while reducing noise.
Bad parsing creates fake retrieval problems. The retriever may fail not because the search algorithm is weak, but because the useful content was never correctly extracted.
Chunking
Chunking decides how the document is sliced.
A chunk that is too small may lose context. A chunk that is too large may contain too many unrelated ideas. Both can reduce retrieval quality.
Common chunking strategies include:
- fixed-size chunking
- paragraph-based chunking
- heading-aware chunking
- sliding-window chunking
- parent-child chunking
The best strategy depends on the document type. API docs, legal documents, product manuals, and support articles usually need different chunking rules.
Embedding
Embedding converts text into vector form.
The goal is to make semantic search possible. When the user asks a question, the system embeds the question and compares it with stored document vectors.
Embedding quality affects whether similar meaning can be found even when the exact keyword is different.
Storing
Storing means saving the prepared knowledge into a searchable backend.
Common storage choices include:
- vector database
- PostgreSQL with pgvector
- Elasticsearch or OpenSearch
- hybrid storage with both vector and keyword indexes
A practical RAG system usually stores both the chunk content and the source reference. Without source reference, the system may answer correctly but cannot explain where the answer came from.
Metadata Design
Metadata makes retrieval more controllable.
Useful metadata can include:
| Metadata | Why It Matters |
|---|---|
| document_id | Trace the chunk back to the original document |
| source_url | Show citation or source location |
| title | Give the LLM better context |
| section | Help retrieval target the correct area |
| domain | Filter by product, department, or topic |
| created_at | Support freshness filtering |
| chunk_index | Rebuild document order when needed |
Metadata is not decoration. It is part of retrieval control and debugging.
Step 2: Retrieval Overview
Retrieval means finding candidate chunks that may answer the user question.
At this stage, the system does not need to produce the final answer yet. It only needs to return useful evidence.
Common retrieval strategies include:
Dense Retrieval
Uses embeddings to find semantically similar chunks. Good when the user question uses different wording from the source document.
Keyword Retrieval
Uses exact or near-exact terms. Good for IDs, names, codes, error messages, and specific technical terms.
Hybrid Retrieval
Combines dense retrieval and keyword retrieval. Often stronger because it handles both meaning and exact terms.
Metadata Filtering
Limits search space by fields like domain, product, document type, date, or permission scope.
Multi-Query Retrieval
Generates multiple query variations and searches with them. Useful when the user question is vague or underspecified.
Parent-Child Retrieval
Retrieves small chunks but returns larger parent context. Useful when small chunks are searchable but not enough for final reasoning.
Retrieval is usually where many RAG systems start to fail. The answer may be wrong because the correct chunk never reached the LLM.
Step 3: Reranking Overview
Reranking happens after retrieval.
The retriever may return 20, 50, or 100 candidate chunks. Reranking reorders them based on how useful they are for the actual question.
The purpose is not to search the whole database again. The purpose is to improve the quality of the candidate list before sending context into the LLM.
| Stage | Role |
|---|---|
| Retrieval | Find possible chunks quickly |
| Reranking | Sort possible chunks more carefully |
| LLM Context Selection | Keep only the best chunks for generation |
Reranking is useful because the top result from vector search is not always the most answerable chunk. A chunk can be semantically similar but still not contain the exact answer.
Common reranking approaches include:
- cross-encoder reranker
- LLM-based reranking
- score fusion
- rule-based boosting with metadata
- source priority ranking
Reranking becomes more important when the knowledge base is large, the question is specific, or the retrieved chunks are noisy.
Step 4: LLM Model Design
The LLM stage is where the system converts retrieved evidence into a final answer.
This part is not only about choosing a model. It also includes prompt design, context layout, output format, refusal rules, and citation behavior.
A basic RAG prompt usually needs:
| Design Area | Purpose |
|---|---|
| Role instruction | Tell the model how to behave |
| Context block | Provide retrieved chunks |
| Answer rules | Control whether the model can infer or must stay grounded |
| Citation rules | Force the model to reference sources |
| Unknown handling | Tell the model what to do when evidence is insufficient |
| Output format | Make the answer stable for the user or application |
The LLM should not be treated as a magic fixer. If the retrieved context is wrong, missing, or noisy, the LLM will usually produce a weak answer.
A better model may hide the problem by writing more fluently, but fluency is not the same as correctness.
Step 5: Debugging System Design
A RAG system needs debugging from the beginning.
When the answer is wrong, the team needs to know which stage failed. Without logs and evaluation checkpoints, every failure looks like an LLM problem.
A practical debugging system should track these questions:
Source Coverage
Does the correct answer exist in the original source documents?
Parsing Coverage
Was the correct text extracted from the original source?
Chunk Coverage
Did the correct information appear inside at least one chunk?
Retrieval Hit
Did the retriever return the correct chunk in top-k?
Reranking Quality
Did reranking move the useful chunk higher or lower?
LLM Grounding
Did the LLM use the retrieved evidence correctly?
The system should record the user question, generated search query, retrieved chunks, reranked order, selected context, final prompt, model answer, and source references.
This makes debugging concrete. Instead of saying "the model is bad", you can say "the correct chunk exists, but retrieval did not return it" or "retrieval worked, but the LLM ignored the evidence."
Practical Build Order
A simple build order looks like this:
| Order | Build Area | Goal |
|---|---|---|
| 1 | Source ingestion | Collect and normalize documents |
| 2 | Parsing | Extract clean text |
| 3 | Chunking | Create searchable knowledge units |
| 4 | Embedding and storing | Save chunks into the retrieval backend |
| 5 | Metadata design | Add filters, tracing fields, and source references |
| 6 | Retrieval | Return candidate chunks |
| 7 | Reranking | Improve candidate order |
| 8 | LLM prompt design | Generate grounded answers |
| 9 | Debugging system | Identify which stage failed |
This order matters because each stage depends on the previous one.
If indexing is broken, retrieval will be weak. If retrieval is weak, reranking has bad candidates. If context is wrong, the LLM has no reliable evidence to answer from.
The Main Principle
RAG should be built as a traceable pipeline, not as one large LLM call.
The reason is practical: LLMs are strong, but their internal knowledge is static and incomplete for many real applications. RAG gives the model external, runtime knowledge so the answer can be more accurate, current, and verifiable.
Indexing decides what knowledge exists. Retrieval decides what evidence is found. Reranking decides what evidence is prioritized. LLM design decides how evidence becomes an answer. Debugging decides whether the team can locate failure precisely.
The practical rule is simple: every stage should produce something inspectable. If you cannot inspect it, you cannot debug it.
构建 RAG 系统不是把文件接到 LLM 后面就结束。真正可用的 RAG 是一条管线。每一个阶段都会影响最后回答有没有足够上下文、有没有准确性、以及能不能追踪问题来源。
这一篇只是总览,不深入展开。它先解释为什么需要 RAG,再把完整路径讲清楚:索引、检索、重排、LLM 模型设计,以及调试系统设计。后面的 log 再逐个模块深入。
简短答案
RAG 的全称是 Retrieval-Augmented Generation,也就是“检索增强生成”。
简单讲,就是在 LLM 回答之前,系统先从外部知识库里检索相关资料。然后把这些资料作为 context 交给 LLM,让模型基于证据生成答案。
一个 RAG 系统一般可以拆成五个主要阶段:
1. 索引
把原始文档转换成可以被搜索的知识单元。
2. 检索
从索引后的知识库里找出候选 chunk。
3. 重排
把检索出来的候选 chunk 重新排序,让更相关的内容排到前面。
4. LLM 设计
设计 prompt、上下文格式、回答规则和引用规则。
5. 调试系统
判断错误发生在哪里:原始资料、切片、检索、重排,还是生成。
- 1索引整理成可搜索的知识单元
- 2检索找出候选 chunk
- 3重排把有用内容排到前面
- 4LLM 设计生成有证据支撑的回答
- 5调试追踪管线中的失败位置
核心想法很简单:RAG 的质量不是由单一模型决定的,而是由从文档处理到最终回答的整条链路决定的。
为什么需要 RAG
现在的 LLM 很强,因为它们使用大量数据和复杂训练技术训练出来。它们可以推理、总结、改写、分类,也可以在很多领域生成有用答案。
但 LLM 还是有几个关键限制。
| 限制 | 意思 |
|---|---|
| 静态知识 | 模型主要知道训练时见过的内容 |
| 缺少私有数据 | 模型不会自动知道你的公司文档、数据库记录、ticket 或内部知识 |
| 知识截止 | 模型可能不知道最新事件、最新产品变化或最近更新的文档 |
| 缺少天然来源追踪 | 模型可以写得很流畅,但不一定说明答案来自哪里 |
所以,强大的模型本身不等于生产环境里一定可靠。
比如用户问的是新的公司政策、最近的 release note、私有客服 ticket,或者模型训练之后才创建的文档,LLM 本身可能根本没有这些信息。它可能会用旧知识回答,也可能给出不完整答案,甚至给出一个看起来很自信的猜测。
RAG 解决的就是这个问题。它会在生成答案之前,先从外部知识源检索相关 context。这些知识源可以是文档、数据库、网站、ticket、日志、产品手册,或者内部笔记。
LLM 依然很重要,但它的角色变了。它不是只靠记忆回答,而是基于检索到的证据回答。
| 没有 RAG | 有 RAG |
|---|---|
| 模型依赖训练时的静态知识 | 模型可以使用运行时检索到的外部 context |
| 最新或私有信息可能缺失 | 最新或私有信息可以在回答前提供 |
| 答案比较难验证 | 可以附上引用和来源 |
| 错误答案也可能写得很流畅 | 答案可以被真实文档约束 |
当一个 LLM 应用需要准确、及时、可追踪时,RAG 就很重要。否则,模型单独回答时可能看起来不错,但内容可能过时、不完整,或者没有证据支撑。
第一步:索引基础
索引就是把源资料整理成系统之后可以搜索的形式。
它通常包含五个部分:
| 部分 | 目的 |
|---|---|
| Parsing | 从 PDF、HTML、Markdown、数据库记录或文档中抽取可读文本 |
| Chunking | 把文本切成更小的知识块 |
| Embedding | 把每个 chunk 转成向量 |
| Storing | 把 chunk、向量和来源信息存进数据库 |
| Metadata Design | 加上方便过滤、追踪和调试的字段 |
索引是 RAG 的地基。如果源文本缺失、解析错误、切片混乱,或者没有设计好 metadata,后面的检索很难补救。
Parsing
Parsing 是把原始文件转换成干净文本。
比如 PDF 里面可能有页眉、页码、表格、脚注和断行问题。parser 要尽量抽出真正有价值的正文,并减少噪音。
解析做不好,会制造假的检索问题。不是检索算法太差,而是正确内容一开始就没有被抽出来。
Chunking
Chunking 决定文档怎么被切开。
chunk 太小,容易失去上下文。chunk 太大,里面会混入太多不相关内容。这两种情况都会影响检索质量。
常见切片策略包括:
- 固定长度切片
- 按段落切片
- 按标题结构切片
- 滑动窗口切片
- parent-child 切片
不同文档类型适合不同策略。API 文档、法律文件、产品手册和客服知识库,不应该默认使用同一种切法。
Embedding
Embedding 是把文字转换成向量。
它的目的,是让系统可以做语义搜索。当用户提问时,系统会把问题也转成向量,再跟已经存好的文档向量做相似度比较。
embedding 的质量会影响系统能不能找到“意思相近但用词不同”的内容。
Storing
Storing 是把处理好的知识存到可搜索的后端。
常见选择包括:
- vector database
- PostgreSQL with pgvector
- Elasticsearch 或 OpenSearch
- 同时使用向量索引和关键词索引的 hybrid storage
实用的 RAG 系统不只存 chunk 内容,也要存来源引用。否则系统可能回答对了,但无法说明答案来自哪里。
Metadata Design
metadata 会让检索更可控。
常见有用字段包括:
| Metadata | 为什么重要 |
|---|---|
| document_id | 追踪 chunk 来自哪一份原文档 |
| source_url | 用来显示引用或来源位置 |
| title | 给 LLM 更多上下文 |
| section | 帮助系统定位正确章节 |
| domain | 按产品、部门或主题过滤 |
| created_at | 支持资料新旧过滤 |
| chunk_index | 需要时可以还原文档顺序 |
metadata 不是装饰。它是检索控制和调试能力的一部分。
第二步:检索总览
检索就是找出可能回答用户问题的候选 chunk。
在这个阶段,系统还不需要生成最终答案。它只需要把可能有用的证据找出来。
常见检索策略包括:
Dense Retrieval
使用 embedding 找语义相近的 chunk。适合用户问题和原文用词不同,但意思接近的情况。
Keyword Retrieval
使用精确词或接近精确词搜索。适合 ID、名称、代码、错误信息和专业术语。
Hybrid Retrieval
结合向量检索和关键词检索。通常更稳,因为它同时处理语义和精确词。
Metadata Filtering
根据 domain、product、document type、date 或 permission scope 缩小搜索范围。
Multi-Query Retrieval
生成多个查询版本再检索。适合用户问题模糊或表达不完整的情况。
Parent-Child Retrieval
用小 chunk 做搜索,但返回更大的 parent context。适合小 chunk 好搜,但单独不够回答的情况。
很多 RAG 问题其实都发生在检索阶段。答案错了,不一定是 LLM 不会答,而是正确 chunk 根本没有被送到 LLM 面前。
第三步:重排总览
重排发生在检索之后。
检索器可能先返回 20、50 或 100 个候选 chunk。重排会根据当前问题重新排序,让更有用的 chunk 排到前面。
它的目的不是重新搜索整个数据库,而是在候选结果里做更精细的排序。
| 阶段 | 作用 |
|---|---|
| Retrieval | 快速找出可能相关的 chunk |
| Reranking | 更仔细地排序候选 chunk |
| LLM Context Selection | 只把最有价值的 chunk 放进最终上下文 |
重排有价值,是因为向量搜索的第一名不一定最能回答问题。一个 chunk 可能语义相近,但不包含真正答案。
常见重排方式包括:
- cross-encoder reranker
- LLM-based reranking
- score fusion
- 基于 metadata 的规则加权
- source priority ranking
当知识库变大、问题更具体、检索结果更吵的时候,重排会变得更重要。
第四步:LLM 模型设计
LLM 阶段负责把检索到的证据转换成最终回答。
这里不只是选择哪个模型。它还包括 prompt 设计、上下文排版、输出格式、拒答规则和引用规则。
一个基础 RAG prompt 通常需要:
| 设计区域 | 目的 |
|---|---|
| Role instruction | 告诉模型应该以什么方式回答 |
| Context block | 放入检索出来的 chunk |
| Answer rules | 控制模型能不能推理,或者必须严格基于证据 |
| Citation rules | 要求模型引用来源 |
| Unknown handling | 当证据不足时,规定模型怎么处理 |
| Output format | 让回答对用户或程序更稳定 |
LLM 不应该被当成万能修复器。如果检索上下文是错的、缺失的或噪音很大,LLM 生成出来的答案通常也不会稳定。
更强的模型可能会把问题包装得更流畅,但流畅不等于正确。
第五步:调试系统设计
RAG 系统一开始就应该设计调试能力。
当回答错误时,团队需要知道是哪一个阶段失败了。如果没有日志和评估点,所有问题最后都会被误判成 “LLM 不行”。
一个实用的调试系统应该能回答这些问题:
Source Coverage
正确答案是否存在于原始文档里?
Parsing Coverage
正确文本是否有被 parser 抽取出来?
Chunk Coverage
正确信息是否出现在至少一个 chunk 里?
Retrieval Hit
检索器是否在 top-k 里返回了正确 chunk?
Reranking Quality
重排有没有把有用 chunk 往前排,还是反而排后了?
LLM Grounding
LLM 有没有正确使用检索到的证据?
系统应该记录用户问题、生成的搜索 query、检索结果、重排顺序、最终选入上下文的 chunk、完整 prompt、模型回答和来源引用。
这样调试才会具体。你不需要笼统地说 “模型很差”,而是可以说 “正确 chunk 存在,但检索没有召回”,或者 “检索成功了,但 LLM 没有使用证据”。
实际构建顺序
一个简单的构建顺序可以是这样:
| 顺序 | 构建区域 | 目标 |
|---|---|---|
| 1 | Source ingestion | 收集并标准化文档 |
| 2 | Parsing | 抽取干净文本 |
| 3 | Chunking | 创建可搜索的知识单元 |
| 4 | Embedding and storing | 把 chunk 存进检索后端 |
| 5 | Metadata design | 加入过滤字段、追踪字段和来源引用 |
| 6 | Retrieval | 返回候选 chunk |
| 7 | Reranking | 改善候选结果排序 |
| 8 | LLM prompt design | 生成基于证据的回答 |
| 9 | Debugging system | 判断失败发生在哪一层 |
这个顺序很重要,因为每一层都依赖上一层。
如果索引坏了,检索就会弱。如果检索弱,重排拿到的候选也会差。如果上下文是错的,LLM 就没有可靠证据可以回答。
核心原则
RAG 应该被当成一条可追踪的管线,而不是一次大型 LLM 调用。
原因很实际:LLM 很强,但它的内部知识是静态的,而且对很多真实应用来说并不完整。RAG 可以在运行时给模型补充外部知识,让答案更准确、更新,也更容易验证。
索引决定系统拥有哪些知识。检索决定系统找到哪些证据。重排决定哪些证据优先。LLM 设计决定证据如何变成答案。调试系统决定团队能不能准确定位失败点。
最实用的规则是:每一个阶段都应该有可以检查的输出。不能检查,就很难调试。