RAG Debugging System Design

A RAG debugging system is the observability layer for the whole RAG pipeline. It helps us identify where an answer failed: source data, parsing, chunking, metadata, vector storage, retrieval, reranking, prompt design, or LLM generation. Without debugging design, every bad answer looks like "the model is wrong", which is too vague to fix.

Short Answer

RAG debugging means making every stage of the pipeline inspectable.

A practical RAG system should be able to answer these questions:

1. Source Exists

Does the correct answer exist in the original source data?

2. Parsed Correctly

Did parsing extract the correct text, table, heading, or metadata?

3. Chunked Correctly

Did the correct information appear inside a useful chunk?

4. Retrieved Correctly

Did retrieval return the correct chunk in top-k?

5. Reranked Correctly

Did reranking move the correct chunk up or push it down?

6. Generated Correctly

Did the LLM use the evidence correctly and follow the prompt rules?

The goal is not only to collect logs. The goal is to make failure location obvious.

Why Debugging Is Important

RAG has many failure points.

When the final answer is wrong, the problem may come from several different stages:

Failure Location	Example
Source data	The answer does not exist in the source document
Parsing	The correct paragraph was skipped or table was flattened badly
Chunking	The answer and its condition were split apart
Metadata	The correct chunk was filtered out by wrong product or version
Embedding	The query and chunk were not semantically close
Vector database	The record was not stored or indexed correctly
Retrieval	The correct chunk did not appear in top-k
Reranking	The correct chunk was retrieved but ranked too low
Prompt	The model was not told how to handle missing evidence
LLM generation	The model ignored evidence or over-inferred

Without a debugging system, the team may guess. Guessing creates random fixes.

A team may change the prompt when the real issue is chunking. Or change the embedding model when the real issue is metadata filtering. Or blame the LLM when the correct chunk never reached the context.

Debugging design prevents this waste.

The Core Debugging Principle

The core principle is simple:

Every RAG stage should produce an inspectable artifact.

That means every stage should have output that can be saved, reviewed, compared, and evaluated.

Stage	Inspectable Artifact
Source ingestion	Source document ID and raw content reference
Parsing	Parsed document object
Chunking	Chunk list with structure and metadata
Embedding	Embedding model, vector ID, dimension, timestamp
Storage	Stored record with payload and metadata
Retrieval	Query, filters, top-k chunks, scores
Reranking	Candidate list before and after reranking
Prompting	Final prompt sent to the model
Generation	Final answer, citations, confidence, missing-info signal

If a stage has no artifact, it becomes hard to debug.

Good debugging is not added at the end. It should be part of pipeline design from the beginning.

Debugging Layer 1: Source Coverage

Source coverage checks whether the answer exists in the original data.

This should be the first question:

Does the correct answer exist in the source documents?

If the source does not contain the answer, retrieval and prompt design cannot fix it.

Example failure:

User asks:
"Can enterprise customers get refunds after 30 days?"

Source document only says:
"Enterprise customers with custom contracts should contact the account manager."

In this case, the source does not contain the exact refund rule. The correct behavior is to say the available context is insufficient.

Source coverage should track:

document ID
source URI
document version
ingestion time
owner
domain
whether the document is active, archived, or draft

This helps distinguish between a model failure and a data coverage failure.

Debugging Layer 2: Parsing Coverage

Parsing coverage checks whether the useful source content was extracted correctly.

A document may contain the right answer, but the parser may fail to extract it.

Common parsing failures:

Source Format	Parsing Risk
PDF	Broken line order, missing tables, repeated headers
HTML	Navigation text mixed with content
Markdown	Code blocks or headings parsed incorrectly
Tables	Row-column relationship lost
Scanned documents	Text extraction incomplete
Database records	Fields mapped incorrectly

Parsing debug logs should store the parsed object.

For example:

{
  "document_id": "doc_refund_policy_learnpro_2026_04",
  "parser_version": "parser_policy_v1",
  "parse_status": "success",
  "sections_detected": 6,
  "tables_detected": 1,
  "warnings": []
}

If parsing drops the table or merges unrelated sections, retrieval may fail later.

The failure looks like retrieval, but the root cause is parsing.

Debugging Layer 3: Chunk Coverage

Chunk coverage checks whether the correct information exists inside the generated chunks.

The key question:

Did the correct answer appear in at least one chunk?

A bad chunk may be:

too small
too large
missing the heading
missing the condition
mixed with unrelated rules
separated from its table header
disconnected from source metadata

Example chunking failure:

Chunk A:
Customers can request a refund within 14 days after purchase.

Chunk B:
This only applies if they have completed less than 20% of the course content.

Neither chunk is fully safe alone. The rule and condition are separated.

A useful chunk debug record should include:

{
  "chunk_id": "doc_refund_policy_learnpro_2026_04__general_refund_rule",
  "document_id": "doc_refund_policy_learnpro_2026_04",
  "chunk_strategy": "heading_aware_table_aware_v1",
  "section": "General Refund Rule",
  "text": "Customers can request a refund within 14 days after purchase if they have completed less than 20% of the course content.",
  "metadata": {
    "product": "LearnPro Online Course Platform",
    "domain": "billing",
    "version": "2026.04"
  }
}

If the correct information is not inside any chunk, retrieval cannot find it.

Debugging Layer 4: Retrieval Evaluation

Retrieval evaluation checks whether the correct chunk is returned for the user query.

The main question:

Did the correct chunk appear in top-k?

Useful metrics include:

Metric	Meaning
recall@k	Did the correct chunk appear within top-k?
precision@k	How many returned chunks were useful?
MRR	How high was the first correct chunk ranked?
hit rate	Did retrieval find at least one correct chunk?
filtered recall	Did metadata filtering remove the correct chunk?

A useful retrieval debug record should store:

{
  "question": "Can I cancel my monthly subscription and get a refund for this month?",
  "retrieval_strategy": "metadata_filtered_hybrid",
  "filters": {
    "product": "LearnPro Online Course Platform",
    "domain": "billing",
    "language": "en"
  },
  "top_k": 5,
  "results": [
    {
      "rank": 1,
      "chunk_id": "doc_refund_policy_learnpro_2026_04__general_refund_rule",
      "score": 0.82,
      "section": "General Refund Rule"
    },
    {
      "rank": 2,
      "chunk_id": "doc_refund_policy_learnpro_2026_04__subscription_cancellation",
      "score": 0.79,
      "section": "Subscription Cancellation"
    }
  ]
}

This tells us whether retrieval found the right evidence and how high it ranked.

If the correct chunk is not in top-k, fix retrieval, metadata, chunking, embedding, or indexing before blaming the LLM.

Debugging Layer 5: Reranking Evaluation

Reranking evaluation checks whether reranking improves evidence order.

The key question:

Did reranking move the most answerable chunk upward?

A useful reranking debug record stores before and after lists.

{
  "question": "Can I cancel my monthly subscription and get a refund for this month?",
  "before_rerank": [
    {
      "rank": 1,
      "chunk_id": "doc_refund_policy_learnpro_2026_04__general_refund_rule",
      "section": "General Refund Rule"
    },
    {
      "rank": 2,
      "chunk_id": "doc_refund_policy_learnpro_2026_04__subscription_cancellation",
      "section": "Subscription Cancellation"
    }
  ],
  "after_rerank": [
    {
      "rank": 1,
      "chunk_id": "doc_refund_policy_learnpro_2026_04__subscription_cancellation",
      "section": "Subscription Cancellation"
    },
    {
      "rank": 2,
      "chunk_id": "doc_refund_policy_learnpro_2026_04__general_refund_rule",
      "section": "General Refund Rule"
    }
  ],
  "reranker": "cross_encoder_v1"
}

This makes the reranking effect visible.

If reranking pushes the correct chunk down, the reranker is harming the system. If the correct chunk was never retrieved, reranking cannot help.

Debugging Layer 6: Prompt and Generation Trace

The final stage checks what the LLM actually saw and how it answered.

The key questions:

What context did the model receive?
What prompt rules were active?
Did the model follow them?

A useful generation trace should store:

user question
selected chunks
source metadata
final prompt
model name
model parameters
final answer
cited sources
unsupported claims
refusal or insufficient-context signal

Example trace:

{
  "question": "Can I cancel my monthly subscription and get a refund for this month?",
  "model": "answer_model_v1",
  "selected_chunks": [
    "doc_refund_policy_learnpro_2026_04__subscription_cancellation"
  ],
  "prompt_template": "support_strict_grounding_v1",
  "answer": "You can cancel your monthly subscription at any time, but the current active month is not refunded.",
  "sources": [
    {
      "title": "Refund and Cancellation Policy",
      "section": "Subscription Cancellation",
      "version": "2026.04"
    }
  ],
  "grounding_status": "supported"
}

This allows us to distinguish between two different failures:

Case	Meaning
Correct context was not provided	Retrieval or context selection failure
Correct context was provided but ignored	Prompt or LLM generation failure

That distinction is critical.

A Practical Failure Diagnosis Workflow

When a RAG answer is wrong, debug it in order.

1. Check Source

Confirm whether the answer exists in the original source document.

2. Check Parsing

Confirm whether the correct text, table, and headings were extracted.

3. Check Chunking

Confirm whether the answer exists inside a complete and useful chunk.

4. Check Metadata

Confirm whether product, version, language, permission, and domain fields are correct.

5. Check Retrieval

Confirm whether the correct chunk appears in top-k before reranking.

6. Check Reranking

Confirm whether reranking moves the correct chunk up or down.

7. Check Prompt

Confirm whether the final prompt contains the right evidence and rules.

8. Check Answer

Confirm whether the final answer is grounded in the selected evidence.

This order matters because downstream stages depend on upstream stages.

Do not debug the prompt first if the correct chunk was never retrieved.

Debugging the Reusable Refund Policy Example

Use the same question from previous logs:

Can I cancel my monthly subscription and get a refund for this month?

Expected answer:

The user can cancel the monthly subscription at any time, but the current active month is not refunded.

A debugging table can look like this:

Stage	Check	Expected Result
Source	Does the subscription rule exist?	Yes
Parsing	Was the subscription section extracted?	Yes
Chunking	Is the full rule in one chunk or parent context?	Yes
Metadata	Is product = LearnPro and domain = billing?	Yes
Retrieval	Is subscription cancellation in top-k?	Yes
Reranking	Is it ranked above general refund rule?	Yes
Prompt	Is strict grounding active?	Yes
Generation	Does answer cite correct section?	Yes

If any row fails, that row becomes the next engineering task.

For example:

Failed Row	Likely Fix
Parsing failed	Improve parser or table extraction
Chunking failed	Use heading-aware or parent-child chunking
Metadata failed	Fix ingestion metadata mapping
Retrieval failed	Add hybrid search or adjust filters
Reranking failed	Tune reranker or add rule-based boost
Prompt failed	Add stricter grounding and unknown handling
Generation failed	Improve prompt or model selection

This turns one vague bad answer into a concrete pipeline diagnosis.

What to Log in Production

A production RAG system should log enough to debug without exposing sensitive data unnecessarily.

Important logs include:

Log Type	What to Store
Query log	user question, timestamp, user scope
Retrieval log	strategy, filters, top-k results, scores
Reranking log	before and after ranking
Context log	selected chunks and source metadata
Prompt log	prompt template version and final prompt reference
Answer log	final answer, citations, grounding status
Feedback log	user rating, correction, expected answer
Pipeline version log	parser, chunker, embedding model, retriever, reranker, prompt version

Be careful with privacy and permissions.

For sensitive systems, store references and hashes where possible. Do not casually log private user content into broad-access observability tools.

Common Debugging Mistakes

Many RAG teams fail because they only inspect the final answer.

Mistake	Result
Blaming the LLM first	Real indexing or retrieval issue remains unfixed
No source coverage check	Missing data is mistaken as model failure
No chunk inspection	Bad chunking hides behind retrieval metrics
No metadata audit	Wrong filters silently remove correct chunks
Only checking top 5	Correct chunk may exist in top 50
No before/after reranking record	Reranker impact is unknown
No prompt versioning	Prompt changes cannot be compared
No evaluation set	Improvements cannot be proven

The most dangerous mistake is changing multiple pipeline stages at once.

If chunking, embedding, retrieval, and prompt all change together, the team cannot know what actually improved or broke the system.

Minimum Debugging System

A minimum useful debugging system does not need to be complicated.

Start with this:

question_id
user_question
expected_answer
source_document_id
expected_chunk_id
retrieval_strategy
metadata_filters
retrieved_top_k
reranked_top_k
selected_context
prompt_template_version
model_answer
source_citations
grounding_status

This is enough to run basic failure analysis.

It can answer:

Did the correct chunk exist?
Did retrieval find it?
Did reranking prioritize it?
Did the prompt include it?
Did the model use it?

Once this exists, every improvement can be measured more clearly.

The Main Principle

RAG debugging is pipeline debugging.

A wrong answer is not one problem. It is a symptom. The failure may happen in source coverage, parsing, chunking, metadata, storage, retrieval, reranking, prompt design, or generation.

The practical rule is simple: make every stage inspectable. If the correct evidence disappears, the debugging system should show exactly where it disappeared.

RAG debugging system 是整条 RAG pipeline 的 observability layer。它帮助我们判断答案错误到底发生在哪里：source data、parsing、chunking、metadata、vector storage、retrieval、reranking、prompt design，还是 LLM generation。没有 debugging design，每个坏答案都会看起来像 “模型错了”，但这个说法太模糊，无法真正修复问题。

简短答案

RAG debugging 的意思，是让 pipeline 的每个阶段都可以被检查。

一个实用的 RAG 系统应该能回答这些问题：

1. Source Exists

正确答案是否存在于原始资料里？

2. Parsed Correctly

parsing 有没有抽出正确文本、表格、标题或 metadata？

3. Chunked Correctly

正确信息有没有出现在一个有用的 chunk 里？

4. Retrieved Correctly

retrieval 有没有在 top-k 里返回正确 chunk？

5. Reranked Correctly

reranking 有没有把正确 chunk 往前排，还是往后推？

6. Generated Correctly

LLM 有没有正确使用 evidence，并遵守 prompt rules？

目标不只是收集 logs。目标是让 failure location 变得明显。

为什么 Debugging 很重要

RAG 有很多失败点。

当最终答案错误时，问题可能来自很多不同阶段：

Failure Location	Example
Source data	source document 里根本没有答案
Parsing	正确 paragraph 被跳过，或 table 被错误 flatten
Chunking	答案和条件被切开
Metadata	正确 chunk 被错误 product 或 version filter 掉
Embedding	query 和 chunk 的语义距离不够近
Vector database	record 没有正确存储或建立 index
Retrieval	正确 chunk 没有出现在 top-k
Reranking	正确 chunk 被 retrieved，但排名太低
Prompt	模型没有被告知如何处理 missing evidence
LLM generation	模型忽略 evidence 或过度推断

如果没有 debugging system，团队只能猜。猜测会导致随机修复。

团队可能在真正问题是 chunking 时去改 prompt。也可能在真正问题是 metadata filtering 时去换 embedding model。或者在正确 chunk 根本没进入 context 时，就怪 LLM。

Debugging design 可以避免这种浪费。

核心 Debugging 原则

核心原则很简单：

Every RAG stage should produce an inspectable artifact.

也就是说，每个阶段都应该有可以保存、查看、比较和评估的输出。

Stage	Inspectable Artifact
Source ingestion	Source document ID and raw content reference
Parsing	Parsed document object
Chunking	Chunk list with structure and metadata
Embedding	Embedding model, vector ID, dimension, timestamp
Storage	Stored record with payload and metadata
Retrieval	Query, filters, top-k chunks, scores
Reranking	Candidate list before and after reranking
Prompting	Final prompt sent to the model
Generation	Final answer, citations, confidence, missing-info signal

如果某个阶段没有 artifact，它就很难 debug。

好的 debugging 不是最后才补上。它应该从 pipeline design 一开始就被考虑进去。

Debugging Layer 1：Source Coverage

Source coverage 检查正确答案是否存在于原始资料里。

第一个问题应该是：

Does the correct answer exist in the source documents?

如果 source 里没有答案，retrieval 和 prompt design 都修不了。

失败例子：

User asks:
"Can enterprise customers get refunds after 30 days?"

Source document only says:
"Enterprise customers with custom contracts should contact the account manager."

这种情况下，source 没有给出确切 refund rule。正确行为是说明 available context insufficient。

Source coverage 应该记录：

document ID
source URI
document version
ingestion time
owner
domain
document 是 active、archived，还是 draft

这可以帮助我们区分 model failure 和 data coverage failure。

Debugging Layer 2：Parsing Coverage

Parsing coverage 检查有用的 source content 是否被正确抽取。

文档里可能有正确答案，但 parser 可能没有抽出来。

常见 parsing failures：

Source Format	Parsing Risk
PDF	line order 错、tables missing、重复 headers
HTML	navigation text 混入正文
Markdown	code blocks 或 headings 解析错误
Tables	row-column relationship 丢失
Scanned documents	text extraction 不完整
Database records	fields mapping 错误

Parsing debug logs 应该保存 parsed object。

例如：

{
  "document_id": "doc_refund_policy_learnpro_2026_04",
  "parser_version": "parser_policy_v1",
  "parse_status": "success",
  "sections_detected": 6,
  "tables_detected": 1,
  "warnings": []
}

如果 parsing 丢掉 table，或者把无关 sections 合并，后面的 retrieval 可能会失败。

这个失败看起来像 retrieval 问题，但 root cause 是 parsing。

Debugging Layer 3：Chunk Coverage

Chunk coverage 检查正确信息是否存在于生成出来的 chunks 里。

关键问题：

Did the correct answer appear in at least one chunk?

坏 chunk 可能是：

太小
太大
缺少 heading
缺少 condition
混入无关 rules
和 table header 分离
和 source metadata 断开

chunking failure 例子：

Chunk A:
Customers can request a refund within 14 days after purchase.

Chunk B:
This only applies if they have completed less than 20% of the course content.

这两个 chunk 单独看都不够安全。规则和条件被分开了。

有用的 chunk debug record 应该包含：

{
  "chunk_id": "doc_refund_policy_learnpro_2026_04__general_refund_rule",
  "document_id": "doc_refund_policy_learnpro_2026_04",
  "chunk_strategy": "heading_aware_table_aware_v1",
  "section": "General Refund Rule",
  "text": "Customers can request a refund within 14 days after purchase if they have completed less than 20% of the course content.",
  "metadata": {
    "product": "LearnPro Online Course Platform",
    "domain": "billing",
    "version": "2026.04"
  }
}

如果正确信息没有出现在任何 chunk 里，retrieval 就不可能找到它。

Debugging Layer 4：Retrieval Evaluation

Retrieval evaluation 检查正确 chunk 是否会被用户 query 找回来。

主要问题：

Did the correct chunk appear in top-k?

有用 metrics 包括：

Metric	Meaning
recall@k	correct chunk 是否出现在 top-k
precision@k	返回 chunks 里有多少是有用的
MRR	第一个 correct chunk 排得多高
hit rate	retrieval 有没有找到至少一个 correct chunk
filtered recall	metadata filtering 是否移除了 correct chunk

有用的 retrieval debug record 应该保存：

{
  "question": "Can I cancel my monthly subscription and get a refund for this month?",
  "retrieval_strategy": "metadata_filtered_hybrid",
  "filters": {
    "product": "LearnPro Online Course Platform",
    "domain": "billing",
    "language": "en"
  },
  "top_k": 5,
  "results": [
    {
      "rank": 1,
      "chunk_id": "doc_refund_policy_learnpro_2026_04__general_refund_rule",
      "score": 0.82,
      "section": "General Refund Rule"
    },
    {
      "rank": 2,
      "chunk_id": "doc_refund_policy_learnpro_2026_04__subscription_cancellation",
      "score": 0.79,
      "section": "Subscription Cancellation"
    }
  ]
}

这能告诉我们 retrieval 是否找到了正确 evidence，以及它排多高。

如果 correct chunk 没有在 top-k 里，应该先修 retrieval、metadata、chunking、embedding 或 indexing，不要先怪 LLM。

Debugging Layer 5：Reranking Evaluation

Reranking evaluation 检查 reranking 是否改善 evidence order。

关键问题：

Did reranking move the most answerable chunk upward?

有用的 reranking debug record 会保存 before 和 after lists。

{
  "question": "Can I cancel my monthly subscription and get a refund for this month?",
  "before_rerank": [
    {
      "rank": 1,
      "chunk_id": "doc_refund_policy_learnpro_2026_04__general_refund_rule",
      "section": "General Refund Rule"
    },
    {
      "rank": 2,
      "chunk_id": "doc_refund_policy_learnpro_2026_04__subscription_cancellation",
      "section": "Subscription Cancellation"
    }
  ],
  "after_rerank": [
    {
      "rank": 1,
      "chunk_id": "doc_refund_policy_learnpro_2026_04__subscription_cancellation",
      "section": "Subscription Cancellation"
    },
    {
      "rank": 2,
      "chunk_id": "doc_refund_policy_learnpro_2026_04__general_refund_rule",
      "section": "General Refund Rule"
    }
  ],
  "reranker": "cross_encoder_v1"
}

这样 reranking 的影响才看得见。

如果 reranking 把 correct chunk 往后推，说明 reranker 正在伤害系统。如果 correct chunk 从来没有被 retrieved，reranking 就帮不上忙。

Debugging Layer 6：Prompt and Generation Trace

最后一层检查 LLM 实际看到了什么，以及它怎么回答。

关键问题：

What context did the model receive?
What prompt rules were active?
Did the model follow them?

有用的 generation trace 应该保存：

user question
selected chunks
source metadata
final prompt
model name
model parameters
final answer
cited sources
unsupported claims
refusal or insufficient-context signal

示例 trace：

{
  "question": "Can I cancel my monthly subscription and get a refund for this month?",
  "model": "answer_model_v1",
  "selected_chunks": [
    "doc_refund_policy_learnpro_2026_04__subscription_cancellation"
  ],
  "prompt_template": "support_strict_grounding_v1",
  "answer": "You can cancel your monthly subscription at any time, but the current active month is not refunded.",
  "sources": [
    {
      "title": "Refund and Cancellation Policy",
      "section": "Subscription Cancellation",
      "version": "2026.04"
    }
  ],
  "grounding_status": "supported"
}

这可以帮我们区分两个不同失败：

Case	Meaning
Correct context was not provided	retrieval 或 context selection failure
Correct context was provided but ignored	prompt 或 LLM generation failure

这个区别非常关键。

实用 Failure Diagnosis Workflow

当 RAG 答案错了，按顺序 debug。

1. Check Source

确认答案是否存在于原始 source document。

2. Check Parsing

确认正确 text、table 和 headings 是否被抽取出来。

3. Check Chunking

确认答案是否存在于完整且有用的 chunk 里。

4. Check Metadata

确认 product、version、language、permission 和 domain fields 是否正确。

5. Check Retrieval

确认 correct chunk 在 reranking 之前是否出现在 top-k。

6. Check Reranking

确认 reranking 把 correct chunk 往前排还是往后排。

7. Check Prompt

确认 final prompt 是否包含正确 evidence 和 rules。

8. Check Answer

确认 final answer 是否基于 selected evidence。

这个顺序很重要，因为下游阶段依赖上游阶段。

如果 correct chunk 根本没有被 retrieved，不要先 debug prompt。

Debugging 可复用 Refund Policy 例子

继续使用前几篇 log 的问题：

Can I cancel my monthly subscription and get a refund for this month?

预期答案：

The user can cancel the monthly subscription at any time, but the current active month is not refunded.

debugging table 可以长这样：

Stage	Check	Expected Result
Source	Does the subscription rule exist?	Yes
Parsing	Was the subscription section extracted?	Yes
Chunking	Is the full rule in one chunk or parent context?	Yes
Metadata	Is product = LearnPro and domain = billing?	Yes
Retrieval	Is subscription cancellation in top-k?	Yes
Reranking	Is it ranked above general refund rule?	Yes
Prompt	Is strict grounding active?	Yes
Generation	Does answer cite correct section?	Yes

如果任何一行失败，那一行就是下一个 engineering task。

例如：

Failed Row	Likely Fix
Parsing failed	Improve parser or table extraction
Chunking failed	Use heading-aware or parent-child chunking
Metadata failed	Fix ingestion metadata mapping
Retrieval failed	Add hybrid search or adjust filters
Reranking failed	Tune reranker or add rule-based boost
Prompt failed	Add stricter grounding and unknown handling
Generation failed	Improve prompt or model selection

这会把一个模糊的 bad answer，变成具体的 pipeline diagnosis。

Production 应该记录什么

Production RAG 系统应该记录足够用于 debug 的资料，但也要避免不必要地暴露敏感数据。

重要 logs 包括：

Log Type	What to Store
Query log	user question, timestamp, user scope
Retrieval log	strategy, filters, top-k results, scores
Reranking log	before and after ranking
Context log	selected chunks and source metadata
Prompt log	prompt template version and final prompt reference
Answer log	final answer, citations, grounding status
Feedback log	user rating, correction, expected answer
Pipeline version log	parser, chunker, embedding model, retriever, reranker, prompt version

注意 privacy 和 permission。

对于敏感系统，尽量存 references 和 hashes。不要随便把私人用户内容写入所有人都能访问的 observability tools。

常见 Debugging 错误

很多 RAG 团队失败，是因为他们只检查最终答案。

Mistake	Result
先怪 LLM	真正的 indexing 或 retrieval 问题没有被修
没有 source coverage check	missing data 被误判成 model failure
没有 chunk inspection	bad chunking 藏在 retrieval metrics 后面
没有 metadata audit	错误 filters 静悄悄移除 correct chunks
只检查 top 5	correct chunk 可能在 top 50
没有 before/after reranking record	不知道 reranker 的影响
没有 prompt versioning	prompt changes 无法比较
没有 evaluation set	无法证明改动真的有效

最危险的错误是一次改多个 pipeline stages。

如果 chunking、embedding、retrieval 和 prompt 同时改变，团队就不知道到底是什么让系统变好或变坏。

最小可用 Debugging System

一个最小可用的 debugging system 不需要很复杂。

先从这些字段开始：

question_id
user_question
expected_answer
source_document_id
expected_chunk_id
retrieval_strategy
metadata_filters
retrieved_top_k
reranked_top_k
selected_context
prompt_template_version
model_answer
source_citations
grounding_status

这些已经足够做基础 failure analysis。

它可以回答：

correct chunk 是否存在？
retrieval 有没有找到它？
reranking 有没有优先排序它？
prompt 有没有包含它？
model 有没有使用它？

有了这些，每次改进都能更清楚地被衡量。

核心原则

RAG debugging 是 pipeline debugging。

一个错误答案不是单一问题，而是一个症状。失败可能发生在 source coverage、parsing、chunking、metadata、storage、retrieval、reranking、prompt design 或 generation。

实用规则很简单：让每个阶段都可以被检查。如果 correct evidence 消失了，debugging system 应该清楚显示它是在哪个阶段消失的。

RAG Debugging System Design

Short Answer

1. Source Exists

2. Parsed Correctly

3. Chunked Correctly

4. Retrieved Correctly

5. Reranked Correctly

6. Generated Correctly

Why Debugging Is Important

The Core Debugging Principle

Debugging Layer 1: Source Coverage

Debugging Layer 2: Parsing Coverage

Debugging Layer 3: Chunk Coverage

Debugging Layer 4: Retrieval Evaluation

Debugging Layer 5: Reranking Evaluation

Debugging Layer 6: Prompt and Generation Trace

A Practical Failure Diagnosis Workflow

1. Check Source

2. Check Parsing

3. Check Chunking

4. Check Metadata

5. Check Retrieval

6. Check Reranking

7. Check Prompt

8. Check Answer

Debugging the Reusable Refund Policy Example

What to Log in Production

Common Debugging Mistakes

Minimum Debugging System

The Main Principle

简短答案

1. Source Exists

2. Parsed Correctly

3. Chunked Correctly

4. Retrieved Correctly

5. Reranked Correctly

6. Generated Correctly

为什么 Debugging 很重要

核心 Debugging 原则

Debugging Layer 1：Source Coverage

Debugging Layer 2：Parsing Coverage

Debugging Layer 3：Chunk Coverage

Debugging Layer 4：Retrieval Evaluation

Debugging Layer 5：Reranking Evaluation

Debugging Layer 6：Prompt and Generation Trace

实用 Failure Diagnosis Workflow

1. Check Source

2. Check Parsing

3. Check Chunking

4. Check Metadata

5. Check Retrieval

6. Check Reranking

7. Check Prompt

8. Check Answer

Debugging 可复用 Refund Policy 例子

Production 应该记录什么

常见 Debugging 错误

最小可用 Debugging System

核心原则

Step By Step Build Your RAG