<- Back to Software Development

RAG Debugging System Design

June 6, 20269 min read
Share

A RAG debugging system is the observability layer for the whole RAG pipeline. It helps us identify where an answer failed: source data, parsing, chunking, metadata, vector storage, retrieval, reranking, prompt design, or LLM generation. Without debugging design, every bad answer looks like "the model is wrong", which is too vague to fix.

Short Answer

RAG debugging means making every stage of the pipeline inspectable.

A practical RAG system should be able to answer these questions:

1. Source Exists

Does the correct answer exist in the original source data?

2. Parsed Correctly

Did parsing extract the correct text, table, heading, or metadata?

3. Chunked Correctly

Did the correct information appear inside a useful chunk?

4. Retrieved Correctly

Did retrieval return the correct chunk in top-k?

5. Reranked Correctly

Did reranking move the correct chunk up or push it down?

6. Generated Correctly

Did the LLM use the evidence correctly and follow the prompt rules?

The goal is not only to collect logs. The goal is to make failure location obvious.

Why Debugging Is Important

RAG has many failure points.

When the final answer is wrong, the problem may come from several different stages:

Failure LocationExample
Source dataThe answer does not exist in the source document
ParsingThe correct paragraph was skipped or table was flattened badly
ChunkingThe answer and its condition were split apart
MetadataThe correct chunk was filtered out by wrong product or version
EmbeddingThe query and chunk were not semantically close
Vector databaseThe record was not stored or indexed correctly
RetrievalThe correct chunk did not appear in top-k
RerankingThe correct chunk was retrieved but ranked too low
PromptThe model was not told how to handle missing evidence
LLM generationThe model ignored evidence or over-inferred

Without a debugging system, the team may guess. Guessing creates random fixes.

A team may change the prompt when the real issue is chunking. Or change the embedding model when the real issue is metadata filtering. Or blame the LLM when the correct chunk never reached the context.

Debugging design prevents this waste.

The Core Debugging Principle

The core principle is simple:

Every RAG stage should produce an inspectable artifact.

That means every stage should have output that can be saved, reviewed, compared, and evaluated.

StageInspectable Artifact
Source ingestionSource document ID and raw content reference
ParsingParsed document object
ChunkingChunk list with structure and metadata
EmbeddingEmbedding model, vector ID, dimension, timestamp
StorageStored record with payload and metadata
RetrievalQuery, filters, top-k chunks, scores
RerankingCandidate list before and after reranking
PromptingFinal prompt sent to the model
GenerationFinal answer, citations, confidence, missing-info signal

If a stage has no artifact, it becomes hard to debug.

Good debugging is not added at the end. It should be part of pipeline design from the beginning.

Debugging Layer 1: Source Coverage

Source coverage checks whether the answer exists in the original data.

This should be the first question:

Does the correct answer exist in the source documents?

If the source does not contain the answer, retrieval and prompt design cannot fix it.

Example failure:

User asks:
"Can enterprise customers get refunds after 30 days?"

Source document only says:
"Enterprise customers with custom contracts should contact the account manager."

In this case, the source does not contain the exact refund rule. The correct behavior is to say the available context is insufficient.

Source coverage should track:

  • document ID
  • source URI
  • document version
  • ingestion time
  • owner
  • domain
  • whether the document is active, archived, or draft

This helps distinguish between a model failure and a data coverage failure.

Debugging Layer 2: Parsing Coverage

Parsing coverage checks whether the useful source content was extracted correctly.

A document may contain the right answer, but the parser may fail to extract it.

Common parsing failures:

Source FormatParsing Risk
PDFBroken line order, missing tables, repeated headers
HTMLNavigation text mixed with content
MarkdownCode blocks or headings parsed incorrectly
TablesRow-column relationship lost
Scanned documentsText extraction incomplete
Database recordsFields mapped incorrectly

Parsing debug logs should store the parsed object.

For example:

{
  "document_id": "doc_refund_policy_learnpro_2026_04",
  "parser_version": "parser_policy_v1",
  "parse_status": "success",
  "sections_detected": 6,
  "tables_detected": 1,
  "warnings": []
}

If parsing drops the table or merges unrelated sections, retrieval may fail later.

The failure looks like retrieval, but the root cause is parsing.

Debugging Layer 3: Chunk Coverage

Chunk coverage checks whether the correct information exists inside the generated chunks.

The key question:

Did the correct answer appear in at least one chunk?

A bad chunk may be:

  • too small
  • too large
  • missing the heading
  • missing the condition
  • mixed with unrelated rules
  • separated from its table header
  • disconnected from source metadata

Example chunking failure:

Chunk A:
Customers can request a refund within 14 days after purchase.

Chunk B:
This only applies if they have completed less than 20% of the course content.

Neither chunk is fully safe alone. The rule and condition are separated.

A useful chunk debug record should include:

{
  "chunk_id": "doc_refund_policy_learnpro_2026_04__general_refund_rule",
  "document_id": "doc_refund_policy_learnpro_2026_04",
  "chunk_strategy": "heading_aware_table_aware_v1",
  "section": "General Refund Rule",
  "text": "Customers can request a refund within 14 days after purchase if they have completed less than 20% of the course content.",
  "metadata": {
    "product": "LearnPro Online Course Platform",
    "domain": "billing",
    "version": "2026.04"
  }
}

If the correct information is not inside any chunk, retrieval cannot find it.

Debugging Layer 4: Retrieval Evaluation

Retrieval evaluation checks whether the correct chunk is returned for the user query.

The main question:

Did the correct chunk appear in top-k?

Useful metrics include:

MetricMeaning
recall@kDid the correct chunk appear within top-k?
precision@kHow many returned chunks were useful?
MRRHow high was the first correct chunk ranked?
hit rateDid retrieval find at least one correct chunk?
filtered recallDid metadata filtering remove the correct chunk?

A useful retrieval debug record should store:

{
  "question": "Can I cancel my monthly subscription and get a refund for this month?",
  "retrieval_strategy": "metadata_filtered_hybrid",
  "filters": {
    "product": "LearnPro Online Course Platform",
    "domain": "billing",
    "language": "en"
  },
  "top_k": 5,
  "results": [
    {
      "rank": 1,
      "chunk_id": "doc_refund_policy_learnpro_2026_04__general_refund_rule",
      "score": 0.82,
      "section": "General Refund Rule"
    },
    {
      "rank": 2,
      "chunk_id": "doc_refund_policy_learnpro_2026_04__subscription_cancellation",
      "score": 0.79,
      "section": "Subscription Cancellation"
    }
  ]
}

This tells us whether retrieval found the right evidence and how high it ranked.

If the correct chunk is not in top-k, fix retrieval, metadata, chunking, embedding, or indexing before blaming the LLM.

Debugging Layer 5: Reranking Evaluation

Reranking evaluation checks whether reranking improves evidence order.

The key question:

Did reranking move the most answerable chunk upward?

A useful reranking debug record stores before and after lists.

{
  "question": "Can I cancel my monthly subscription and get a refund for this month?",
  "before_rerank": [
    {
      "rank": 1,
      "chunk_id": "doc_refund_policy_learnpro_2026_04__general_refund_rule",
      "section": "General Refund Rule"
    },
    {
      "rank": 2,
      "chunk_id": "doc_refund_policy_learnpro_2026_04__subscription_cancellation",
      "section": "Subscription Cancellation"
    }
  ],
  "after_rerank": [
    {
      "rank": 1,
      "chunk_id": "doc_refund_policy_learnpro_2026_04__subscription_cancellation",
      "section": "Subscription Cancellation"
    },
    {
      "rank": 2,
      "chunk_id": "doc_refund_policy_learnpro_2026_04__general_refund_rule",
      "section": "General Refund Rule"
    }
  ],
  "reranker": "cross_encoder_v1"
}

This makes the reranking effect visible.

If reranking pushes the correct chunk down, the reranker is harming the system. If the correct chunk was never retrieved, reranking cannot help.

Debugging Layer 6: Prompt and Generation Trace

The final stage checks what the LLM actually saw and how it answered.

The key questions:

What context did the model receive?
What prompt rules were active?
Did the model follow them?

A useful generation trace should store:

  • user question
  • selected chunks
  • source metadata
  • final prompt
  • model name
  • model parameters
  • final answer
  • cited sources
  • unsupported claims
  • refusal or insufficient-context signal

Example trace:

{
  "question": "Can I cancel my monthly subscription and get a refund for this month?",
  "model": "answer_model_v1",
  "selected_chunks": [
    "doc_refund_policy_learnpro_2026_04__subscription_cancellation"
  ],
  "prompt_template": "support_strict_grounding_v1",
  "answer": "You can cancel your monthly subscription at any time, but the current active month is not refunded.",
  "sources": [
    {
      "title": "Refund and Cancellation Policy",
      "section": "Subscription Cancellation",
      "version": "2026.04"
    }
  ],
  "grounding_status": "supported"
}

This allows us to distinguish between two different failures:

CaseMeaning
Correct context was not providedRetrieval or context selection failure
Correct context was provided but ignoredPrompt or LLM generation failure

That distinction is critical.

A Practical Failure Diagnosis Workflow

When a RAG answer is wrong, debug it in order.

1. Check Source

Confirm whether the answer exists in the original source document.

2. Check Parsing

Confirm whether the correct text, table, and headings were extracted.

3. Check Chunking

Confirm whether the answer exists inside a complete and useful chunk.

4. Check Metadata

Confirm whether product, version, language, permission, and domain fields are correct.

5. Check Retrieval

Confirm whether the correct chunk appears in top-k before reranking.

6. Check Reranking

Confirm whether reranking moves the correct chunk up or down.

7. Check Prompt

Confirm whether the final prompt contains the right evidence and rules.

8. Check Answer

Confirm whether the final answer is grounded in the selected evidence.

This order matters because downstream stages depend on upstream stages.

Do not debug the prompt first if the correct chunk was never retrieved.

Debugging the Reusable Refund Policy Example

Use the same question from previous logs:

Can I cancel my monthly subscription and get a refund for this month?

Expected answer:

The user can cancel the monthly subscription at any time, but the current active month is not refunded.

A debugging table can look like this:

StageCheckExpected Result
SourceDoes the subscription rule exist?Yes
ParsingWas the subscription section extracted?Yes
ChunkingIs the full rule in one chunk or parent context?Yes
MetadataIs product = LearnPro and domain = billing?Yes
RetrievalIs subscription cancellation in top-k?Yes
RerankingIs it ranked above general refund rule?Yes
PromptIs strict grounding active?Yes
GenerationDoes answer cite correct section?Yes

If any row fails, that row becomes the next engineering task.

For example:

Failed RowLikely Fix
Parsing failedImprove parser or table extraction
Chunking failedUse heading-aware or parent-child chunking
Metadata failedFix ingestion metadata mapping
Retrieval failedAdd hybrid search or adjust filters
Reranking failedTune reranker or add rule-based boost
Prompt failedAdd stricter grounding and unknown handling
Generation failedImprove prompt or model selection

This turns one vague bad answer into a concrete pipeline diagnosis.

What to Log in Production

A production RAG system should log enough to debug without exposing sensitive data unnecessarily.

Important logs include:

Log TypeWhat to Store
Query loguser question, timestamp, user scope
Retrieval logstrategy, filters, top-k results, scores
Reranking logbefore and after ranking
Context logselected chunks and source metadata
Prompt logprompt template version and final prompt reference
Answer logfinal answer, citations, grounding status
Feedback loguser rating, correction, expected answer
Pipeline version logparser, chunker, embedding model, retriever, reranker, prompt version

Be careful with privacy and permissions.

For sensitive systems, store references and hashes where possible. Do not casually log private user content into broad-access observability tools.

Common Debugging Mistakes

Many RAG teams fail because they only inspect the final answer.

MistakeResult
Blaming the LLM firstReal indexing or retrieval issue remains unfixed
No source coverage checkMissing data is mistaken as model failure
No chunk inspectionBad chunking hides behind retrieval metrics
No metadata auditWrong filters silently remove correct chunks
Only checking top 5Correct chunk may exist in top 50
No before/after reranking recordReranker impact is unknown
No prompt versioningPrompt changes cannot be compared
No evaluation setImprovements cannot be proven

The most dangerous mistake is changing multiple pipeline stages at once.

If chunking, embedding, retrieval, and prompt all change together, the team cannot know what actually improved or broke the system.

Minimum Debugging System

A minimum useful debugging system does not need to be complicated.

Start with this:

question_id
user_question
expected_answer
source_document_id
expected_chunk_id
retrieval_strategy
metadata_filters
retrieved_top_k
reranked_top_k
selected_context
prompt_template_version
model_answer
source_citations
grounding_status

This is enough to run basic failure analysis.

It can answer:

  • Did the correct chunk exist?
  • Did retrieval find it?
  • Did reranking prioritize it?
  • Did the prompt include it?
  • Did the model use it?

Once this exists, every improvement can be measured more clearly.

The Main Principle

RAG debugging is pipeline debugging.

A wrong answer is not one problem. It is a symptom. The failure may happen in source coverage, parsing, chunking, metadata, storage, retrieval, reranking, prompt design, or generation.

The practical rule is simple: make every stage inspectable. If the correct evidence disappears, the debugging system should show exactly where it disappeared.

RAG debugging system 是整条 RAG pipeline 的 observability layer。它帮助我们判断答案错误到底发生在哪里:source data、parsing、chunking、metadata、vector storage、retrieval、reranking、prompt design,还是 LLM generation。没有 debugging design,每个坏答案都会看起来像 “模型错了”,但这个说法太模糊,无法真正修复问题。

简短答案

RAG debugging 的意思,是让 pipeline 的每个阶段都可以被检查。

一个实用的 RAG 系统应该能回答这些问题:

1. Source Exists

正确答案是否存在于原始资料里?

2. Parsed Correctly

parsing 有没有抽出正确文本、表格、标题或 metadata?

3. Chunked Correctly

正确信息有没有出现在一个有用的 chunk 里?

4. Retrieved Correctly

retrieval 有没有在 top-k 里返回正确 chunk?

5. Reranked Correctly

reranking 有没有把正确 chunk 往前排,还是往后推?

6. Generated Correctly

LLM 有没有正确使用 evidence,并遵守 prompt rules?

目标不只是收集 logs。目标是让 failure location 变得明显。

为什么 Debugging 很重要

RAG 有很多失败点。

当最终答案错误时,问题可能来自很多不同阶段:

Failure LocationExample
Source datasource document 里根本没有答案
Parsing正确 paragraph 被跳过,或 table 被错误 flatten
Chunking答案和条件被切开
Metadata正确 chunk 被错误 product 或 version filter 掉
Embeddingquery 和 chunk 的语义距离不够近
Vector databaserecord 没有正确存储或建立 index
Retrieval正确 chunk 没有出现在 top-k
Reranking正确 chunk 被 retrieved,但排名太低
Prompt模型没有被告知如何处理 missing evidence
LLM generation模型忽略 evidence 或过度推断

如果没有 debugging system,团队只能猜。猜测会导致随机修复。

团队可能在真正问题是 chunking 时去改 prompt。也可能在真正问题是 metadata filtering 时去换 embedding model。或者在正确 chunk 根本没进入 context 时,就怪 LLM。

Debugging design 可以避免这种浪费。

核心 Debugging 原则

核心原则很简单:

Every RAG stage should produce an inspectable artifact.

也就是说,每个阶段都应该有可以保存、查看、比较和评估的输出。

StageInspectable Artifact
Source ingestionSource document ID and raw content reference
ParsingParsed document object
ChunkingChunk list with structure and metadata
EmbeddingEmbedding model, vector ID, dimension, timestamp
StorageStored record with payload and metadata
RetrievalQuery, filters, top-k chunks, scores
RerankingCandidate list before and after reranking
PromptingFinal prompt sent to the model
GenerationFinal answer, citations, confidence, missing-info signal

如果某个阶段没有 artifact,它就很难 debug。

好的 debugging 不是最后才补上。它应该从 pipeline design 一开始就被考虑进去。

Debugging Layer 1:Source Coverage

Source coverage 检查正确答案是否存在于原始资料里。

第一个问题应该是:

Does the correct answer exist in the source documents?

如果 source 里没有答案,retrieval 和 prompt design 都修不了。

失败例子:

User asks:
"Can enterprise customers get refunds after 30 days?"

Source document only says:
"Enterprise customers with custom contracts should contact the account manager."

这种情况下,source 没有给出确切 refund rule。正确行为是说明 available context insufficient。

Source coverage 应该记录:

  • document ID
  • source URI
  • document version
  • ingestion time
  • owner
  • domain
  • document 是 active、archived,还是 draft

这可以帮助我们区分 model failure 和 data coverage failure。

Debugging Layer 2:Parsing Coverage

Parsing coverage 检查有用的 source content 是否被正确抽取。

文档里可能有正确答案,但 parser 可能没有抽出来。

常见 parsing failures:

Source FormatParsing Risk
PDFline order 错、tables missing、重复 headers
HTMLnavigation text 混入正文
Markdowncode blocks 或 headings 解析错误
Tablesrow-column relationship 丢失
Scanned documentstext extraction 不完整
Database recordsfields mapping 错误

Parsing debug logs 应该保存 parsed object。

例如:

{
  "document_id": "doc_refund_policy_learnpro_2026_04",
  "parser_version": "parser_policy_v1",
  "parse_status": "success",
  "sections_detected": 6,
  "tables_detected": 1,
  "warnings": []
}

如果 parsing 丢掉 table,或者把无关 sections 合并,后面的 retrieval 可能会失败。

这个失败看起来像 retrieval 问题,但 root cause 是 parsing。

Debugging Layer 3:Chunk Coverage

Chunk coverage 检查正确信息是否存在于生成出来的 chunks 里。

关键问题:

Did the correct answer appear in at least one chunk?

坏 chunk 可能是:

  • 太小
  • 太大
  • 缺少 heading
  • 缺少 condition
  • 混入无关 rules
  • 和 table header 分离
  • 和 source metadata 断开

chunking failure 例子:

Chunk A:
Customers can request a refund within 14 days after purchase.

Chunk B:
This only applies if they have completed less than 20% of the course content.

这两个 chunk 单独看都不够安全。规则和条件被分开了。

有用的 chunk debug record 应该包含:

{
  "chunk_id": "doc_refund_policy_learnpro_2026_04__general_refund_rule",
  "document_id": "doc_refund_policy_learnpro_2026_04",
  "chunk_strategy": "heading_aware_table_aware_v1",
  "section": "General Refund Rule",
  "text": "Customers can request a refund within 14 days after purchase if they have completed less than 20% of the course content.",
  "metadata": {
    "product": "LearnPro Online Course Platform",
    "domain": "billing",
    "version": "2026.04"
  }
}

如果正确信息没有出现在任何 chunk 里,retrieval 就不可能找到它。

Debugging Layer 4:Retrieval Evaluation

Retrieval evaluation 检查正确 chunk 是否会被用户 query 找回来。

主要问题:

Did the correct chunk appear in top-k?

有用 metrics 包括:

MetricMeaning
recall@kcorrect chunk 是否出现在 top-k
precision@k返回 chunks 里有多少是有用的
MRR第一个 correct chunk 排得多高
hit rateretrieval 有没有找到至少一个 correct chunk
filtered recallmetadata filtering 是否移除了 correct chunk

有用的 retrieval debug record 应该保存:

{
  "question": "Can I cancel my monthly subscription and get a refund for this month?",
  "retrieval_strategy": "metadata_filtered_hybrid",
  "filters": {
    "product": "LearnPro Online Course Platform",
    "domain": "billing",
    "language": "en"
  },
  "top_k": 5,
  "results": [
    {
      "rank": 1,
      "chunk_id": "doc_refund_policy_learnpro_2026_04__general_refund_rule",
      "score": 0.82,
      "section": "General Refund Rule"
    },
    {
      "rank": 2,
      "chunk_id": "doc_refund_policy_learnpro_2026_04__subscription_cancellation",
      "score": 0.79,
      "section": "Subscription Cancellation"
    }
  ]
}

这能告诉我们 retrieval 是否找到了正确 evidence,以及它排多高。

如果 correct chunk 没有在 top-k 里,应该先修 retrieval、metadata、chunking、embedding 或 indexing,不要先怪 LLM。

Debugging Layer 5:Reranking Evaluation

Reranking evaluation 检查 reranking 是否改善 evidence order。

关键问题:

Did reranking move the most answerable chunk upward?

有用的 reranking debug record 会保存 before 和 after lists。

{
  "question": "Can I cancel my monthly subscription and get a refund for this month?",
  "before_rerank": [
    {
      "rank": 1,
      "chunk_id": "doc_refund_policy_learnpro_2026_04__general_refund_rule",
      "section": "General Refund Rule"
    },
    {
      "rank": 2,
      "chunk_id": "doc_refund_policy_learnpro_2026_04__subscription_cancellation",
      "section": "Subscription Cancellation"
    }
  ],
  "after_rerank": [
    {
      "rank": 1,
      "chunk_id": "doc_refund_policy_learnpro_2026_04__subscription_cancellation",
      "section": "Subscription Cancellation"
    },
    {
      "rank": 2,
      "chunk_id": "doc_refund_policy_learnpro_2026_04__general_refund_rule",
      "section": "General Refund Rule"
    }
  ],
  "reranker": "cross_encoder_v1"
}

这样 reranking 的影响才看得见。

如果 reranking 把 correct chunk 往后推,说明 reranker 正在伤害系统。如果 correct chunk 从来没有被 retrieved,reranking 就帮不上忙。

Debugging Layer 6:Prompt and Generation Trace

最后一层检查 LLM 实际看到了什么,以及它怎么回答。

关键问题:

What context did the model receive?
What prompt rules were active?
Did the model follow them?

有用的 generation trace 应该保存:

  • user question
  • selected chunks
  • source metadata
  • final prompt
  • model name
  • model parameters
  • final answer
  • cited sources
  • unsupported claims
  • refusal or insufficient-context signal

示例 trace:

{
  "question": "Can I cancel my monthly subscription and get a refund for this month?",
  "model": "answer_model_v1",
  "selected_chunks": [
    "doc_refund_policy_learnpro_2026_04__subscription_cancellation"
  ],
  "prompt_template": "support_strict_grounding_v1",
  "answer": "You can cancel your monthly subscription at any time, but the current active month is not refunded.",
  "sources": [
    {
      "title": "Refund and Cancellation Policy",
      "section": "Subscription Cancellation",
      "version": "2026.04"
    }
  ],
  "grounding_status": "supported"
}

这可以帮我们区分两个不同失败:

CaseMeaning
Correct context was not providedretrieval 或 context selection failure
Correct context was provided but ignoredprompt 或 LLM generation failure

这个区别非常关键。

实用 Failure Diagnosis Workflow

当 RAG 答案错了,按顺序 debug。

1. Check Source

确认答案是否存在于原始 source document。

2. Check Parsing

确认正确 text、table 和 headings 是否被抽取出来。

3. Check Chunking

确认答案是否存在于完整且有用的 chunk 里。

4. Check Metadata

确认 product、version、language、permission 和 domain fields 是否正确。

5. Check Retrieval

确认 correct chunk 在 reranking 之前是否出现在 top-k。

6. Check Reranking

确认 reranking 把 correct chunk 往前排还是往后排。

7. Check Prompt

确认 final prompt 是否包含正确 evidence 和 rules。

8. Check Answer

确认 final answer 是否基于 selected evidence。

这个顺序很重要,因为下游阶段依赖上游阶段。

如果 correct chunk 根本没有被 retrieved,不要先 debug prompt。

Debugging 可复用 Refund Policy 例子

继续使用前几篇 log 的问题:

Can I cancel my monthly subscription and get a refund for this month?

预期答案:

The user can cancel the monthly subscription at any time, but the current active month is not refunded.

debugging table 可以长这样:

StageCheckExpected Result
SourceDoes the subscription rule exist?Yes
ParsingWas the subscription section extracted?Yes
ChunkingIs the full rule in one chunk or parent context?Yes
MetadataIs product = LearnPro and domain = billing?Yes
RetrievalIs subscription cancellation in top-k?Yes
RerankingIs it ranked above general refund rule?Yes
PromptIs strict grounding active?Yes
GenerationDoes answer cite correct section?Yes

如果任何一行失败,那一行就是下一个 engineering task。

例如:

Failed RowLikely Fix
Parsing failedImprove parser or table extraction
Chunking failedUse heading-aware or parent-child chunking
Metadata failedFix ingestion metadata mapping
Retrieval failedAdd hybrid search or adjust filters
Reranking failedTune reranker or add rule-based boost
Prompt failedAdd stricter grounding and unknown handling
Generation failedImprove prompt or model selection

这会把一个模糊的 bad answer,变成具体的 pipeline diagnosis。

Production 应该记录什么

Production RAG 系统应该记录足够用于 debug 的资料,但也要避免不必要地暴露敏感数据。

重要 logs 包括:

Log TypeWhat to Store
Query loguser question, timestamp, user scope
Retrieval logstrategy, filters, top-k results, scores
Reranking logbefore and after ranking
Context logselected chunks and source metadata
Prompt logprompt template version and final prompt reference
Answer logfinal answer, citations, grounding status
Feedback loguser rating, correction, expected answer
Pipeline version logparser, chunker, embedding model, retriever, reranker, prompt version

注意 privacy 和 permission。

对于敏感系统,尽量存 references 和 hashes。不要随便把私人用户内容写入所有人都能访问的 observability tools。

常见 Debugging 错误

很多 RAG 团队失败,是因为他们只检查最终答案。

MistakeResult
先怪 LLM真正的 indexing 或 retrieval 问题没有被修
没有 source coverage checkmissing data 被误判成 model failure
没有 chunk inspectionbad chunking 藏在 retrieval metrics 后面
没有 metadata audit错误 filters 静悄悄移除 correct chunks
只检查 top 5correct chunk 可能在 top 50
没有 before/after reranking record不知道 reranker 的影响
没有 prompt versioningprompt changes 无法比较
没有 evaluation set无法证明改动真的有效

最危险的错误是一次改多个 pipeline stages。

如果 chunking、embedding、retrieval 和 prompt 同时改变,团队就不知道到底是什么让系统变好或变坏。

最小可用 Debugging System

一个最小可用的 debugging system 不需要很复杂。

先从这些字段开始:

question_id
user_question
expected_answer
source_document_id
expected_chunk_id
retrieval_strategy
metadata_filters
retrieved_top_k
reranked_top_k
selected_context
prompt_template_version
model_answer
source_citations
grounding_status

这些已经足够做基础 failure analysis。

它可以回答:

  • correct chunk 是否存在?
  • retrieval 有没有找到它?
  • reranking 有没有优先排序它?
  • prompt 有没有包含它?
  • model 有没有使用它?

有了这些,每次改进都能更清楚地被衡量。

核心原则

RAG debugging 是 pipeline debugging。

一个错误答案不是单一问题,而是一个症状。失败可能发生在 source coverage、parsing、chunking、metadata、storage、retrieval、reranking、prompt design 或 generation。

实用规则很简单:让每个阶段都可以被检查。如果 correct evidence 消失了,debugging system 应该清楚显示它是在哪个阶段消失的。

In this series

Step By Step Build Your RAG

View series ->

Part 9 of 9. Move between logs in the same learning sequence.