RAG Parsing and Structure Design

Parsing is the first real transformation step in a RAG system. It turns raw source material into text that can be chunked, embedded, stored, retrieved, and inspected. If parsing only produces one large plain text blob, the system may still work, but it becomes harder for the model to understand context and harder for us to debug failures.

Short Answer

Parsing is not only text extraction. It is also structure recovery.

A useful parser should try to preserve:

document title
section headings
paragraph order
tables
lists
source location
metadata
relationships between content blocks

The goal is not to make the data look pretty. The goal is to store the source knowledge in a shape that helps both the retrieval system and the LLM.

For the Model

Good structure gives the model clearer context. It can see what title, section, table, and paragraph the text belongs to.

For Us

Good structure makes the pipeline easier to inspect, evaluate, and debug when the answer is wrong.

What Parsing Means

Parsing means converting raw input into a usable internal representation.

The input may come from:

PDF files
Markdown files
HTML pages
API documentation
database records
support tickets
product manuals
meeting notes

The output should not only be a string. A better output is a structured document object.

For example, this is weak parsing:

Refund Policy Refunds are allowed within 14 days. Digital products are not refundable after activation. Enterprise customers should contact support.

The text is readable, but the structure is gone. The system cannot clearly tell which part is the title, which part is the rule, and which part is an exception.

A better parser keeps the document shape:

{
  "title": "Refund Policy",
  "sections": [
    {
      "heading": "General Rule",
      "content": "Refunds are allowed within 14 days."
    },
    {
      "heading": "Digital Products",
      "content": "Digital products are not refundable after activation."
    },
    {
      "heading": "Enterprise Customers",
      "content": "Enterprise customers should contact support."
    }
  ]
}

This structure gives later stages more control.

Why Structure Benefits the Model

LLMs do not only consume text. They consume context.

If the context is a flat blob, the model needs to guess the relationship between sentences. If the context keeps structure, the model can understand the document more reliably.

Structure	Benefit to the Model
Title	Helps identify the document topic
Section heading	Explains the local context
Paragraph order	Preserves reasoning sequence
Table structure	Keeps rows and columns meaningful
Source reference	Supports citation and grounding
Metadata	Helps the model understand scope

A model can answer better when it knows that a sentence belongs to a specific section.

For example, the sentence "not refundable after activation" is stronger when the model also sees that it belongs under "Digital Products". Without the heading, the model may apply the rule too broadly.

Why Structure Benefits Us

A proper structure also benefits the engineer building the RAG system.

When the answer is wrong, we need to inspect the pipeline. A flat text blob makes that harder. A structured object makes it easier to ask precise questions.

Coverage Check

We can check whether the original answer exists in the parsed output.

Chunk Debugging

We can inspect which section produced each chunk.

Retrieval Filtering

We can filter by document type, product, section, date, or permission scope.

Citation Trace

We can trace an answer back to the source document and section.

Without structure, every failure becomes vague. We may know the answer is wrong, but we cannot easily tell whether the problem came from parsing, chunking, retrieval, reranking, or the LLM.

What a Good Parsed Object Should Store

A useful parsed object should separate content from metadata.

Content is what the model may read. Metadata is what the system uses to filter, trace, rank, and debug.

Field	Purpose
document_id	Stable ID for tracing
source_type	PDF, HTML, Markdown, database, or manual input
source_uri	Original file path, URL, or database reference
title	Human-readable document name
language	Useful for multilingual retrieval
sections	Preserved document structure
blocks	Ordered content units
metadata	Product, domain, owner, date, permission, version
parse_warnings	Records extraction issues
created_at	When the parsed object was created

The important idea is that parsing should produce something inspectable.

If a table was dropped, record a warning. If a page had broken text extraction, record it. If the parser guessed a heading, record the confidence or the method.

Common Parsing Mistakes

Parsing mistakes often look like retrieval or LLM mistakes later.

Mistake	Later Symptom
Removing headings	Retrieved chunk lacks context
Flattening tables	Model misreads row-column relationships
Keeping repeated headers	Retrieval returns noisy chunks
Losing source references	Answer cannot cite the origin
Mixing unrelated sections	Chunk contains conflicting rules
Ignoring permissions	User may retrieve content they should not see

The parser should not try to be too clever too early. It should preserve enough structure so later stages can make better decisions.

Reusable Example: Raw Text

We will use the following simple policy text as a reusable example in future logs.

This example is intentionally small. It contains a title, sections, rules, exceptions, metadata-like details, and a table-like structure.

Document Title: Refund and Cancellation Policy
Product: LearnPro Online Course Platform
Version: 2026.04
Owner: Billing Team

1. General Refund Rule
Customers can request a refund within 14 days after purchase if they have completed less than 20% of the course content.

2. Digital Course Activation
Once a customer downloads course materials or receives a completion certificate, the purchase is no longer refundable.

3. Subscription Cancellation
Monthly subscriptions can be cancelled at any time. The cancellation will stop the next billing cycle, but it does not refund the current active month.

4. Enterprise Customers
Enterprise customers with custom contracts should contact the account manager. Their refund terms follow the signed contract instead of the standard policy.

5. Support Contact
For billing issues, customers should contact billing-support@learnpro.example.

Refund Summary:
Purchase Type | Refund Window | Important Condition
Single Course | 14 days | Less than 20% completed
Monthly Subscription | Before next billing cycle | No refund for current active month
Enterprise Contract | Based on contract | Contact account manager

As plain text, this is readable for humans. But for a RAG pipeline, it is still not ideal. The system has to guess where sections start, what product the document belongs to, and how the summary table should be understood.

Reusable Example: Structured Parsed Output

A better parsed output keeps the document structure explicit.

{
  "document_id": "doc_refund_policy_learnpro_2026_04",
  "source_type": "text",
  "source_uri": "manual://learnpro/refund-policy/2026-04",
  "title": "Refund and Cancellation Policy",
  "language": "en",
  "metadata": {
    "product": "LearnPro Online Course Platform",
    "version": "2026.04",
    "owner": "Billing Team",
    "domain": "billing",
    "document_type": "policy"
  },
  "sections": [
    {
      "section_id": "general-refund-rule",
      "heading": "General Refund Rule",
      "order": 1,
      "blocks": [
        {
          "type": "paragraph",
          "text": "Customers can request a refund within 14 days after purchase if they have completed less than 20% of the course content."
        }
      ]
    },
    {
      "section_id": "digital-course-activation",
      "heading": "Digital Course Activation",
      "order": 2,
      "blocks": [
        {
          "type": "paragraph",
          "text": "Once a customer downloads course materials or receives a completion certificate, the purchase is no longer refundable."
        }
      ]
    },
    {
      "section_id": "subscription-cancellation",
      "heading": "Subscription Cancellation",
      "order": 3,
      "blocks": [
        {
          "type": "paragraph",
          "text": "Monthly subscriptions can be cancelled at any time. The cancellation will stop the next billing cycle, but it does not refund the current active month."
        }
      ]
    },
    {
      "section_id": "enterprise-customers",
      "heading": "Enterprise Customers",
      "order": 4,
      "blocks": [
        {
          "type": "paragraph",
          "text": "Enterprise customers with custom contracts should contact the account manager. Their refund terms follow the signed contract instead of the standard policy."
        }
      ]
    },
    {
      "section_id": "support-contact",
      "heading": "Support Contact",
      "order": 5,
      "blocks": [
        {
          "type": "paragraph",
          "text": "For billing issues, customers should contact billing-support@learnpro.example."
        }
      ]
    },
    {
      "section_id": "refund-summary",
      "heading": "Refund Summary",
      "order": 6,
      "blocks": [
        {
          "type": "table",
          "columns": ["Purchase Type", "Refund Window", "Important Condition"],
          "rows": [
            ["Single Course", "14 days", "Less than 20% completed"],
            ["Monthly Subscription", "Before next billing cycle", "No refund for current active month"],
            ["Enterprise Contract", "Based on contract", "Contact account manager"]
          ]
        }
      ]
    }
  ],
  "parse_warnings": [],
  "created_at": "2026-06-06"
}

This structure is useful because each later RAG stage can make better decisions.

Chunking can preserve section boundaries. Retrieval can filter by product, domain, or document_type. Reranking can boost exact section matches. The LLM can receive context with a title, section heading, and source reference. Debugging can inspect whether the correct answer existed in the parsed output.

The Main Principle

Parsing should not only extract text. It should recover structure.

Good structure helps the model because it receives clearer context. It helps us because every piece of text becomes traceable, filterable, and debuggable.

The practical rule is simple: before chunking starts, the document should already have a stable shape. If the parsed output is just one large string, the rest of the RAG pipeline has to guess too much.

Parsing 是 RAG 系统里第一个真正的数据转换步骤。它会把原始资料转换成后面可以切片、embedding、存储、检索和检查的文本。如果 parsing 只是产出一大段纯文字，系统也许还能跑，但模型会更难理解上下文，我们也会更难调试问题。

简短答案

Parsing 不只是抽取文字。它也是在恢复资料结构。

一个有用的 parser 应该尽量保留：

文档标题
章节标题
段落顺序
表格
列表
来源位置
metadata
内容块之间的关系

目标不是让资料看起来漂亮，而是把知识存成一种对检索系统和 LLM 都更友好的形状。

对模型的好处

好结构可以给模型更清楚的上下文。模型可以知道文本属于哪个标题、章节、表格或段落。

对我们的好处

好结构可以让 pipeline 更容易检查、评估和调试。回答错了时，我们更容易定位失败点。

Parsing 是什么

Parsing 是把原始输入转换成系统内部可使用的数据表示。

输入可能来自：

PDF 文件
Markdown 文件
HTML 页面
API 文档
数据库记录
客服工单
产品手册
会议笔记

输出不应该只是一段字符串。更好的输出应该是一个有结构的 document object。

比如，这是比较弱的 parsing：

Refund Policy Refunds are allowed within 14 days. Digital products are not refundable after activation. Enterprise customers should contact support.

这段文字可以读，但结构已经丢了。系统很难明确知道哪一段是标题、哪一段是规则、哪一段是例外。

更好的 parser 会保留文档形状：

{
  "title": "Refund Policy",
  "sections": [
    {
      "heading": "General Rule",
      "content": "Refunds are allowed within 14 days."
    },
    {
      "heading": "Digital Products",
      "content": "Digital products are not refundable after activation."
    },
    {
      "heading": "Enterprise Customers",
      "content": "Enterprise customers should contact support."
    }
  ]
}

这种结构会让后面的阶段更容易控制。

为什么结构对模型有帮助

LLM 不是只吃文字。它吃的是上下文。

如果上下文是一整段扁平文字，模型就需要自己猜句子之间的关系。如果上下文保留结构，模型就能更稳定地理解文档。

结构	对模型的好处
Title	帮助模型识别文档主题
Section heading	说明局部上下文
Paragraph order	保留原本的说明顺序
Table structure	保留表格行列关系
Source reference	支持引用和 grounding
Metadata	帮助模型理解适用范围

当模型知道一句话属于某个章节时，回答会更可靠。

比如 “not refundable after activation” 这句话，如果模型同时看到它属于 “Digital Products”，它就更不容易把这个规则错误套用到所有产品上。

为什么结构对我们有帮助

好的结构也会帮助开发 RAG 系统的人。

当答案错了，我们需要检查 pipeline。纯文字 blob 会让检查变得困难。有结构的对象会让我们能问更精确的问题。

覆盖率检查

我们可以检查正确答案是否存在于 parsed output 里。

Chunk 调试

我们可以检查每个 chunk 是从哪个 section 生成的。

检索过滤

我们可以按 document type、product、section、date 或 permission scope 过滤。

引用追踪

我们可以把答案追踪回原始文档和具体章节。

没有结构时，每个错误都会变得很模糊。我们可能知道答案错了，但很难判断问题来自 parsing、chunking、retrieval、reranking，还是 LLM。

一个好的 Parsed Object 应该存什么

一个有用的 parsed object 应该把内容和 metadata 分开。

content 是模型可能会读的东西。metadata 是系统用来过滤、追踪、排序和调试的东西。

字段	目的
document_id	稳定追踪 ID
source_type	PDF、HTML、Markdown、database 或 manual input
source_uri	原始文件路径、URL 或数据库引用
title	人类可读的文档名称
language	支持多语言检索
sections	保留文档结构
blocks	有顺序的内容单元
metadata	产品、领域、负责人、日期、权限、版本
parse_warnings	记录抽取问题
created_at	parsed object 创建时间

重点是 parsing 的结果必须可以检查。

如果表格被丢掉，应该记录 warning。如果某一页文字抽取异常，也应该记录。如果 parser 是猜测某段文字是标题，也可以记录方法或置信度。

常见 Parsing 错误

Parsing 的错误，后面经常看起来像 retrieval 或 LLM 的错误。

错误	后续症状
移除标题	retrieved chunk 缺少上下文
把表格压成普通文字	模型误解行列关系
保留重复页眉	检索结果充满噪音
丢失来源引用	答案无法追踪来源
混合无关章节	chunk 内部出现冲突规则
忽略权限信息	用户可能检索到不该看的内容

parser 不应该一开始就过度聪明。它应该先保留足够结构，让后面的阶段可以做更好的判断。

可复用例子：原始文本

后面的 log 会继续使用下面这段简单 policy text 作为例子。

这个例子刻意保持很小。它包含标题、章节、规则、例外、类似 metadata 的细节，以及一个类似表格的结构。

Document Title: Refund and Cancellation Policy
Product: LearnPro Online Course Platform
Version: 2026.04
Owner: Billing Team

1. General Refund Rule
Customers can request a refund within 14 days after purchase if they have completed less than 20% of the course content.

2. Digital Course Activation
Once a customer downloads course materials or receives a completion certificate, the purchase is no longer refundable.

3. Subscription Cancellation
Monthly subscriptions can be cancelled at any time. The cancellation will stop the next billing cycle, but it does not refund the current active month.

4. Enterprise Customers
Enterprise customers with custom contracts should contact the account manager. Their refund terms follow the signed contract instead of the standard policy.

5. Support Contact
For billing issues, customers should contact billing-support@learnpro.example.

Refund Summary:
Purchase Type | Refund Window | Important Condition
Single Course | 14 days | Less than 20% completed
Monthly Subscription | Before next billing cycle | No refund for current active month
Enterprise Contract | Based on contract | Contact account manager

作为纯文字，它对人类来说可以读。但对 RAG pipeline 来说，它还不够理想。系统必须自己猜哪里是章节、文档属于哪个产品，以及 summary table 应该怎么理解。

可复用例子：结构化 Parsed Output

更好的 parsed output 会把文档结构明确保留下来。

{
  "document_id": "doc_refund_policy_learnpro_2026_04",
  "source_type": "text",
  "source_uri": "manual://learnpro/refund-policy/2026-04",
  "title": "Refund and Cancellation Policy",
  "language": "en",
  "metadata": {
    "product": "LearnPro Online Course Platform",
    "version": "2026.04",
    "owner": "Billing Team",
    "domain": "billing",
    "document_type": "policy"
  },
  "sections": [
    {
      "section_id": "general-refund-rule",
      "heading": "General Refund Rule",
      "order": 1,
      "blocks": [
        {
          "type": "paragraph",
          "text": "Customers can request a refund within 14 days after purchase if they have completed less than 20% of the course content."
        }
      ]
    },
    {
      "section_id": "digital-course-activation",
      "heading": "Digital Course Activation",
      "order": 2,
      "blocks": [
        {
          "type": "paragraph",
          "text": "Once a customer downloads course materials or receives a completion certificate, the purchase is no longer refundable."
        }
      ]
    },
    {
      "section_id": "subscription-cancellation",
      "heading": "Subscription Cancellation",
      "order": 3,
      "blocks": [
        {
          "type": "paragraph",
          "text": "Monthly subscriptions can be cancelled at any time. The cancellation will stop the next billing cycle, but it does not refund the current active month."
        }
      ]
    },
    {
      "section_id": "enterprise-customers",
      "heading": "Enterprise Customers",
      "order": 4,
      "blocks": [
        {
          "type": "paragraph",
          "text": "Enterprise customers with custom contracts should contact the account manager. Their refund terms follow the signed contract instead of the standard policy."
        }
      ]
    },
    {
      "section_id": "support-contact",
      "heading": "Support Contact",
      "order": 5,
      "blocks": [
        {
          "type": "paragraph",
          "text": "For billing issues, customers should contact billing-support@learnpro.example."
        }
      ]
    },
    {
      "section_id": "refund-summary",
      "heading": "Refund Summary",
      "order": 6,
      "blocks": [
        {
          "type": "table",
          "columns": ["Purchase Type", "Refund Window", "Important Condition"],
          "rows": [
            ["Single Course", "14 days", "Less than 20% completed"],
            ["Monthly Subscription", "Before next billing cycle", "No refund for current active month"],
            ["Enterprise Contract", "Based on contract", "Contact account manager"]
          ]
        }
      ]
    }
  ],
  "parse_warnings": [],
  "created_at": "2026-06-06"
}

这种结构很有用，因为后面的 RAG 阶段可以做更好的判断。

chunking 可以保留 section 边界。retrieval 可以按 product、domain 或 document_type 过滤。reranking 可以提高精准 section match 的优先级。LLM 可以拿到带有 title、section heading 和 source reference 的上下文。debugging 也可以检查正确答案是否真的存在于 parsed output 里。

核心原则

Parsing 不应该只抽取文字。它应该恢复结构。

好的结构帮助模型，因为模型拿到的上下文更清楚。好的结构也帮助我们，因为每一段文字都可以追踪、过滤和调试。

实用规则很简单：在 chunking 开始之前，文档就应该已经有稳定的形状。如果 parsed output 只是一整段大字符串，后面的 RAG pipeline 就必须猜太多东西。