<- Back to Software Development

Step By Step Build Your RAG: System Overview

June 6, 20268 min read
Share

Building a RAG system is not only about connecting documents to an LLM. A useful RAG system is a pipeline. Each stage decides whether the final answer has enough context, enough accuracy, and enough traceability.

This first log is only a walkthrough. It introduces why RAG exists, then goes through the full path: indexing, retrieval, reranking, LLM model design, and debugging system design. Each part can be expanded into its own deeper log later.

Short Answer

RAG means Retrieval-Augmented Generation.

The simple idea is this: before the LLM answers, the system first retrieves relevant information from external knowledge sources. Then the retrieved context is passed into the LLM so the model can generate an answer based on that evidence.

A RAG system usually has five major stages:

1. Indexing

Convert raw source documents into searchable knowledge units.

2. Retrieval

Search the indexed knowledge and return candidate chunks.

3. Reranking

Reorder retrieved chunks so the most relevant ones move to the top.

4. LLM Design

Build the prompt, context format, answer rules, and citation behavior.

5. Debugging System

Track where failure happens: source data, chunking, retrieval, reranking, or generation.

RAG Pipeline Overview
  1. 1IndexingPrepare searchable knowledge units
  2. 2RetrievalReturn candidate chunks
  3. 3RerankingMove useful chunks to the top
  4. 4LLM DesignGenerate a grounded answer
  5. 5DebuggingTrace failures across the pipeline

The main idea is simple: RAG quality is not decided by one model. It is decided by the whole path from document preparation to final answer generation.

Why RAG Is Needed

Modern LLMs are powerful because they are trained on large volumes of data using sophisticated training techniques. They can reason, summarize, rewrite, classify, and generate useful answers across many domains.

But an LLM still has two important limits.

LimitWhat It Means
Static knowledgeThe model only knows what was available during training
Missing private dataThe model does not automatically know your company docs, database records, tickets, or internal knowledge
Knowledge cutoffThe model may not know the latest events, latest product changes, or recent documentation updates
No built-in source traceThe model may answer fluently without showing where the answer came from

This is why a powerful model alone is not always enough for production applications.

For example, if the user asks about a new company policy, a recent release note, a private support ticket, or a document that was created after the model was trained, the LLM may not have that information. It may answer with outdated knowledge, incomplete knowledge, or a confident guess.

RAG addresses this problem by retrieving relevant context from external knowledge sources before generation. These sources can be documents, databases, websites, tickets, logs, manuals, or internal notes.

The LLM is still important, but its job changes. Instead of answering only from memory, it answers with retrieved evidence.

Without RAGWith RAG
The model answers from static training knowledgeThe model answers with retrieved external context
Latest or private information may be missingLatest or private information can be provided at runtime
Harder to verify the answerEasier to attach citations and source references
Wrong answers may look fluentAnswers can be grounded against real documents

RAG is essential when an LLM-based application needs to be accurate, current, and traceable. Without RAG, the model may still sound good, but the answer can be incomplete, outdated, or unsupported.

Step 1: Indexing Fundamentals

Indexing means preparing your source data so the system can search it later.

It usually includes five parts:

PartPurpose
ParsingExtract readable text from PDFs, HTML, Markdown, database records, or documents
ChunkingSplit the parsed text into smaller units
EmbeddingConvert each chunk into a vector representation
StoringSave chunks, embeddings, and document references into a database
Metadata DesignAttach useful fields for filtering, tracing, and debugging

Indexing is the foundation of RAG. If the source text is missing, badly parsed, badly chunked, or stored without useful metadata, retrieval cannot fix it later.

Parsing

Parsing turns raw files into clean text.

For example, a PDF may contain headers, page numbers, tables, footnotes, and broken line breaks. The parser needs to extract the useful content while reducing noise.

Bad parsing creates fake retrieval problems. The retriever may fail not because the search algorithm is weak, but because the useful content was never correctly extracted.

Chunking

Chunking decides how the document is sliced.

A chunk that is too small may lose context. A chunk that is too large may contain too many unrelated ideas. Both can reduce retrieval quality.

Common chunking strategies include:

  • fixed-size chunking
  • paragraph-based chunking
  • heading-aware chunking
  • sliding-window chunking
  • parent-child chunking

The best strategy depends on the document type. API docs, legal documents, product manuals, and support articles usually need different chunking rules.

Embedding

Embedding converts text into vector form.

The goal is to make semantic search possible. When the user asks a question, the system embeds the question and compares it with stored document vectors.

Embedding quality affects whether similar meaning can be found even when the exact keyword is different.

Storing

Storing means saving the prepared knowledge into a searchable backend.

Common storage choices include:

  • vector database
  • PostgreSQL with pgvector
  • Elasticsearch or OpenSearch
  • hybrid storage with both vector and keyword indexes

A practical RAG system usually stores both the chunk content and the source reference. Without source reference, the system may answer correctly but cannot explain where the answer came from.

Metadata Design

Metadata makes retrieval more controllable.

Useful metadata can include:

MetadataWhy It Matters
document_idTrace the chunk back to the original document
source_urlShow citation or source location
titleGive the LLM better context
sectionHelp retrieval target the correct area
domainFilter by product, department, or topic
created_atSupport freshness filtering
chunk_indexRebuild document order when needed

Metadata is not decoration. It is part of retrieval control and debugging.

Step 2: Retrieval Overview

Retrieval means finding candidate chunks that may answer the user question.

At this stage, the system does not need to produce the final answer yet. It only needs to return useful evidence.

Common retrieval strategies include:

Dense Retrieval

Uses embeddings to find semantically similar chunks. Good when the user question uses different wording from the source document.

Keyword Retrieval

Uses exact or near-exact terms. Good for IDs, names, codes, error messages, and specific technical terms.

Hybrid Retrieval

Combines dense retrieval and keyword retrieval. Often stronger because it handles both meaning and exact terms.

Metadata Filtering

Limits search space by fields like domain, product, document type, date, or permission scope.

Multi-Query Retrieval

Generates multiple query variations and searches with them. Useful when the user question is vague or underspecified.

Parent-Child Retrieval

Retrieves small chunks but returns larger parent context. Useful when small chunks are searchable but not enough for final reasoning.

Retrieval is usually where many RAG systems start to fail. The answer may be wrong because the correct chunk never reached the LLM.

Step 3: Reranking Overview

Reranking happens after retrieval.

The retriever may return 20, 50, or 100 candidate chunks. Reranking reorders them based on how useful they are for the actual question.

The purpose is not to search the whole database again. The purpose is to improve the quality of the candidate list before sending context into the LLM.

StageRole
RetrievalFind possible chunks quickly
RerankingSort possible chunks more carefully
LLM Context SelectionKeep only the best chunks for generation

Reranking is useful because the top result from vector search is not always the most answerable chunk. A chunk can be semantically similar but still not contain the exact answer.

Common reranking approaches include:

  • cross-encoder reranker
  • LLM-based reranking
  • score fusion
  • rule-based boosting with metadata
  • source priority ranking

Reranking becomes more important when the knowledge base is large, the question is specific, or the retrieved chunks are noisy.

Step 4: LLM Model Design

The LLM stage is where the system converts retrieved evidence into a final answer.

This part is not only about choosing a model. It also includes prompt design, context layout, output format, refusal rules, and citation behavior.

A basic RAG prompt usually needs:

Design AreaPurpose
Role instructionTell the model how to behave
Context blockProvide retrieved chunks
Answer rulesControl whether the model can infer or must stay grounded
Citation rulesForce the model to reference sources
Unknown handlingTell the model what to do when evidence is insufficient
Output formatMake the answer stable for the user or application

The LLM should not be treated as a magic fixer. If the retrieved context is wrong, missing, or noisy, the LLM will usually produce a weak answer.

A better model may hide the problem by writing more fluently, but fluency is not the same as correctness.

Step 5: Debugging System Design

A RAG system needs debugging from the beginning.

When the answer is wrong, the team needs to know which stage failed. Without logs and evaluation checkpoints, every failure looks like an LLM problem.

A practical debugging system should track these questions:

Source Coverage

Does the correct answer exist in the original source documents?

Parsing Coverage

Was the correct text extracted from the original source?

Chunk Coverage

Did the correct information appear inside at least one chunk?

Retrieval Hit

Did the retriever return the correct chunk in top-k?

Reranking Quality

Did reranking move the useful chunk higher or lower?

LLM Grounding

Did the LLM use the retrieved evidence correctly?

The system should record the user question, generated search query, retrieved chunks, reranked order, selected context, final prompt, model answer, and source references.

This makes debugging concrete. Instead of saying "the model is bad", you can say "the correct chunk exists, but retrieval did not return it" or "retrieval worked, but the LLM ignored the evidence."

Practical Build Order

A simple build order looks like this:

OrderBuild AreaGoal
1Source ingestionCollect and normalize documents
2ParsingExtract clean text
3ChunkingCreate searchable knowledge units
4Embedding and storingSave chunks into the retrieval backend
5Metadata designAdd filters, tracing fields, and source references
6RetrievalReturn candidate chunks
7RerankingImprove candidate order
8LLM prompt designGenerate grounded answers
9Debugging systemIdentify which stage failed

This order matters because each stage depends on the previous one.

If indexing is broken, retrieval will be weak. If retrieval is weak, reranking has bad candidates. If context is wrong, the LLM has no reliable evidence to answer from.

The Main Principle

RAG should be built as a traceable pipeline, not as one large LLM call.

The reason is practical: LLMs are strong, but their internal knowledge is static and incomplete for many real applications. RAG gives the model external, runtime knowledge so the answer can be more accurate, current, and verifiable.

Indexing decides what knowledge exists. Retrieval decides what evidence is found. Reranking decides what evidence is prioritized. LLM design decides how evidence becomes an answer. Debugging decides whether the team can locate failure precisely.

The practical rule is simple: every stage should produce something inspectable. If you cannot inspect it, you cannot debug it.

构建 RAG 系统不是把文件接到 LLM 后面就结束。真正可用的 RAG 是一条管线。每一个阶段都会影响最后回答有没有足够上下文、有没有准确性、以及能不能追踪问题来源。

这一篇只是总览,不深入展开。它先解释为什么需要 RAG,再把完整路径讲清楚:索引、检索、重排、LLM 模型设计,以及调试系统设计。后面的 log 再逐个模块深入。

简短答案

RAG 的全称是 Retrieval-Augmented Generation,也就是“检索增强生成”。

简单讲,就是在 LLM 回答之前,系统先从外部知识库里检索相关资料。然后把这些资料作为 context 交给 LLM,让模型基于证据生成答案。

一个 RAG 系统一般可以拆成五个主要阶段:

1. 索引

把原始文档转换成可以被搜索的知识单元。

2. 检索

从索引后的知识库里找出候选 chunk。

3. 重排

把检索出来的候选 chunk 重新排序,让更相关的内容排到前面。

4. LLM 设计

设计 prompt、上下文格式、回答规则和引用规则。

5. 调试系统

判断错误发生在哪里:原始资料、切片、检索、重排,还是生成。

RAG 管线总览
  1. 1索引整理成可搜索的知识单元
  2. 2检索找出候选 chunk
  3. 3重排把有用内容排到前面
  4. 4LLM 设计生成有证据支撑的回答
  5. 5调试追踪管线中的失败位置

核心想法很简单:RAG 的质量不是由单一模型决定的,而是由从文档处理到最终回答的整条链路决定的。

为什么需要 RAG

现在的 LLM 很强,因为它们使用大量数据和复杂训练技术训练出来。它们可以推理、总结、改写、分类,也可以在很多领域生成有用答案。

但 LLM 还是有几个关键限制。

限制意思
静态知识模型主要知道训练时见过的内容
缺少私有数据模型不会自动知道你的公司文档、数据库记录、ticket 或内部知识
知识截止模型可能不知道最新事件、最新产品变化或最近更新的文档
缺少天然来源追踪模型可以写得很流畅,但不一定说明答案来自哪里

所以,强大的模型本身不等于生产环境里一定可靠。

比如用户问的是新的公司政策、最近的 release note、私有客服 ticket,或者模型训练之后才创建的文档,LLM 本身可能根本没有这些信息。它可能会用旧知识回答,也可能给出不完整答案,甚至给出一个看起来很自信的猜测。

RAG 解决的就是这个问题。它会在生成答案之前,先从外部知识源检索相关 context。这些知识源可以是文档、数据库、网站、ticket、日志、产品手册,或者内部笔记。

LLM 依然很重要,但它的角色变了。它不是只靠记忆回答,而是基于检索到的证据回答。

没有 RAG有 RAG
模型依赖训练时的静态知识模型可以使用运行时检索到的外部 context
最新或私有信息可能缺失最新或私有信息可以在回答前提供
答案比较难验证可以附上引用和来源
错误答案也可能写得很流畅答案可以被真实文档约束

当一个 LLM 应用需要准确、及时、可追踪时,RAG 就很重要。否则,模型单独回答时可能看起来不错,但内容可能过时、不完整,或者没有证据支撑。

第一步:索引基础

索引就是把源资料整理成系统之后可以搜索的形式。

它通常包含五个部分:

部分目的
Parsing从 PDF、HTML、Markdown、数据库记录或文档中抽取可读文本
Chunking把文本切成更小的知识块
Embedding把每个 chunk 转成向量
Storing把 chunk、向量和来源信息存进数据库
Metadata Design加上方便过滤、追踪和调试的字段

索引是 RAG 的地基。如果源文本缺失、解析错误、切片混乱,或者没有设计好 metadata,后面的检索很难补救。

Parsing

Parsing 是把原始文件转换成干净文本。

比如 PDF 里面可能有页眉、页码、表格、脚注和断行问题。parser 要尽量抽出真正有价值的正文,并减少噪音。

解析做不好,会制造假的检索问题。不是检索算法太差,而是正确内容一开始就没有被抽出来。

Chunking

Chunking 决定文档怎么被切开。

chunk 太小,容易失去上下文。chunk 太大,里面会混入太多不相关内容。这两种情况都会影响检索质量。

常见切片策略包括:

  • 固定长度切片
  • 按段落切片
  • 按标题结构切片
  • 滑动窗口切片
  • parent-child 切片

不同文档类型适合不同策略。API 文档、法律文件、产品手册和客服知识库,不应该默认使用同一种切法。

Embedding

Embedding 是把文字转换成向量。

它的目的,是让系统可以做语义搜索。当用户提问时,系统会把问题也转成向量,再跟已经存好的文档向量做相似度比较。

embedding 的质量会影响系统能不能找到“意思相近但用词不同”的内容。

Storing

Storing 是把处理好的知识存到可搜索的后端。

常见选择包括:

  • vector database
  • PostgreSQL with pgvector
  • Elasticsearch 或 OpenSearch
  • 同时使用向量索引和关键词索引的 hybrid storage

实用的 RAG 系统不只存 chunk 内容,也要存来源引用。否则系统可能回答对了,但无法说明答案来自哪里。

Metadata Design

metadata 会让检索更可控。

常见有用字段包括:

Metadata为什么重要
document_id追踪 chunk 来自哪一份原文档
source_url用来显示引用或来源位置
title给 LLM 更多上下文
section帮助系统定位正确章节
domain按产品、部门或主题过滤
created_at支持资料新旧过滤
chunk_index需要时可以还原文档顺序

metadata 不是装饰。它是检索控制和调试能力的一部分。

第二步:检索总览

检索就是找出可能回答用户问题的候选 chunk。

在这个阶段,系统还不需要生成最终答案。它只需要把可能有用的证据找出来。

常见检索策略包括:

Dense Retrieval

使用 embedding 找语义相近的 chunk。适合用户问题和原文用词不同,但意思接近的情况。

Keyword Retrieval

使用精确词或接近精确词搜索。适合 ID、名称、代码、错误信息和专业术语。

Hybrid Retrieval

结合向量检索和关键词检索。通常更稳,因为它同时处理语义和精确词。

Metadata Filtering

根据 domain、product、document type、date 或 permission scope 缩小搜索范围。

Multi-Query Retrieval

生成多个查询版本再检索。适合用户问题模糊或表达不完整的情况。

Parent-Child Retrieval

用小 chunk 做搜索,但返回更大的 parent context。适合小 chunk 好搜,但单独不够回答的情况。

很多 RAG 问题其实都发生在检索阶段。答案错了,不一定是 LLM 不会答,而是正确 chunk 根本没有被送到 LLM 面前。

第三步:重排总览

重排发生在检索之后。

检索器可能先返回 20、50 或 100 个候选 chunk。重排会根据当前问题重新排序,让更有用的 chunk 排到前面。

它的目的不是重新搜索整个数据库,而是在候选结果里做更精细的排序。

阶段作用
Retrieval快速找出可能相关的 chunk
Reranking更仔细地排序候选 chunk
LLM Context Selection只把最有价值的 chunk 放进最终上下文

重排有价值,是因为向量搜索的第一名不一定最能回答问题。一个 chunk 可能语义相近,但不包含真正答案。

常见重排方式包括:

  • cross-encoder reranker
  • LLM-based reranking
  • score fusion
  • 基于 metadata 的规则加权
  • source priority ranking

当知识库变大、问题更具体、检索结果更吵的时候,重排会变得更重要。

第四步:LLM 模型设计

LLM 阶段负责把检索到的证据转换成最终回答。

这里不只是选择哪个模型。它还包括 prompt 设计、上下文排版、输出格式、拒答规则和引用规则。

一个基础 RAG prompt 通常需要:

设计区域目的
Role instruction告诉模型应该以什么方式回答
Context block放入检索出来的 chunk
Answer rules控制模型能不能推理,或者必须严格基于证据
Citation rules要求模型引用来源
Unknown handling当证据不足时,规定模型怎么处理
Output format让回答对用户或程序更稳定

LLM 不应该被当成万能修复器。如果检索上下文是错的、缺失的或噪音很大,LLM 生成出来的答案通常也不会稳定。

更强的模型可能会把问题包装得更流畅,但流畅不等于正确。

第五步:调试系统设计

RAG 系统一开始就应该设计调试能力。

当回答错误时,团队需要知道是哪一个阶段失败了。如果没有日志和评估点,所有问题最后都会被误判成 “LLM 不行”。

一个实用的调试系统应该能回答这些问题:

Source Coverage

正确答案是否存在于原始文档里?

Parsing Coverage

正确文本是否有被 parser 抽取出来?

Chunk Coverage

正确信息是否出现在至少一个 chunk 里?

Retrieval Hit

检索器是否在 top-k 里返回了正确 chunk?

Reranking Quality

重排有没有把有用 chunk 往前排,还是反而排后了?

LLM Grounding

LLM 有没有正确使用检索到的证据?

系统应该记录用户问题、生成的搜索 query、检索结果、重排顺序、最终选入上下文的 chunk、完整 prompt、模型回答和来源引用。

这样调试才会具体。你不需要笼统地说 “模型很差”,而是可以说 “正确 chunk 存在,但检索没有召回”,或者 “检索成功了,但 LLM 没有使用证据”。

实际构建顺序

一个简单的构建顺序可以是这样:

顺序构建区域目标
1Source ingestion收集并标准化文档
2Parsing抽取干净文本
3Chunking创建可搜索的知识单元
4Embedding and storing把 chunk 存进检索后端
5Metadata design加入过滤字段、追踪字段和来源引用
6Retrieval返回候选 chunk
7Reranking改善候选结果排序
8LLM prompt design生成基于证据的回答
9Debugging system判断失败发生在哪一层

这个顺序很重要,因为每一层都依赖上一层。

如果索引坏了,检索就会弱。如果检索弱,重排拿到的候选也会差。如果上下文是错的,LLM 就没有可靠证据可以回答。

核心原则

RAG 应该被当成一条可追踪的管线,而不是一次大型 LLM 调用。

原因很实际:LLM 很强,但它的内部知识是静态的,而且对很多真实应用来说并不完整。RAG 可以在运行时给模型补充外部知识,让答案更准确、更新,也更容易验证。

索引决定系统拥有哪些知识。检索决定系统找到哪些证据。重排决定哪些证据优先。LLM 设计决定证据如何变成答案。调试系统决定团队能不能准确定位失败点。

最实用的规则是:每一个阶段都应该有可以检查的输出。不能检查,就很难调试。

In this series

Step By Step Build Your RAG

View series ->

Part 1 of 9. Move between logs in the same learning sequence.