Chunking is the step that turns parsed documents into searchable units. A RAG system does not usually retrieve a whole document. It retrieves chunks. That means chunking directly affects what the retriever can find, what the reranker can compare, and what the LLM can use as context.
Short Answer
There is no single best chunking strategy.
A good chunking strategy depends on:
- document structure
- document length
- question type
- retrieval method
- embedding model
- metadata quality
- answer format
- context window budget
The practical goal is not to find the perfect chunk size. The goal is to make each chunk contain enough meaning to be retrieved and enough context to be answered.
Too Small
The chunk may match the query, but it may not contain enough surrounding context to answer correctly.
Too Large
The chunk may contain the answer, but the embedding may become noisy because too many unrelated ideas are mixed together.
Why Chunking Matters
Chunking controls the unit of retrieval.
If the correct answer is split across multiple chunks, retrieval may only return half of the evidence. If unrelated rules are mixed into the same chunk, the LLM may read conflicting context.
Chunking is not only a storage problem. It affects the whole RAG pipeline.
| Stage | How Chunking Affects It |
|---|---|
| Embedding | The vector represents the chunk, not the whole document |
| Retrieval | Search returns chunks based on chunk-level similarity |
| Reranking | Candidate chunks are compared against the user question |
| LLM Context | The model only sees the selected chunks |
| Debugging | Engineers inspect chunk-level evidence |
A weak chunking strategy can make a good retriever look bad. It can also make the LLM look unreliable when the real problem is that the answer was cut away from its context.
Strategy 1: Fixed-Size Chunking
Fixed-size chunking splits text by a fixed number of characters, words, or tokens.
Example:
| Setting | Meaning |
|---|---|
| Chunk size | 500 tokens |
| Overlap | 50 tokens |
| Split rule | Keep cutting until the document ends |
This is the simplest strategy. It is easy to implement and easy to test.
When It Works
Fixed-size chunking works reasonably well when documents are plain text and do not have strong structure.
Use it for:
- notes
- transcripts
- loose articles
- raw text dumps
- early baseline experiments
Main Weakness
The splitter does not understand meaning.
It may cut through a paragraph, a table, a list, or a section. That can separate a rule from its condition.
For example, it may keep this in one chunk:
Customers can request a refund within 14 days after purchase.
But the next chunk may contain the condition:
This only applies if they have completed less than 20% of the course content.
Each chunk is now incomplete.
Strategy 2: Fixed-Size With Overlap
Fixed-size with overlap is an improvement over pure fixed-size chunking.
Instead of cutting the text into isolated blocks, each chunk shares some text with the previous chunk.
Example:
| Chunk | Content Range |
|---|---|
| Chunk 1 | Tokens 1-500 |
| Chunk 2 | Tokens 451-950 |
| Chunk 3 | Tokens 901-1400 |
The overlap reduces the risk of cutting important context at the boundary.
When It Works
Use fixed-size with overlap when:
- you need a fast baseline
- the source text is not well structured
- the documents are long
- answers may appear near chunk boundaries
- you want predictable chunk length
This is often the first practical strategy because it is simple, stable, and easy to compare.
Main Weakness
Overlap increases storage and retrieval noise.
The same sentence may appear in multiple chunks. This can cause duplicate retrieval results and waste context window budget.
Strategy 3: Paragraph-Based Chunking
Paragraph-based chunking splits text by natural paragraph boundaries.
This strategy respects the author's original writing structure better than fixed-size splitting.
When It Works
Use paragraph-based chunking for:
- blog posts
- essays
- documentation pages
- policy text
- explanation-heavy content
Paragraphs usually contain one local idea. That makes them good candidates for embedding and retrieval.
Main Weakness
Paragraph length is not consistent.
Some paragraphs are too short to be useful. Some are too long and contain multiple ideas. A production system often needs extra rules, such as merging short paragraphs or splitting very long paragraphs.
Strategy 4: Heading-Aware Chunking
Heading-aware chunking uses document titles and section headings to guide the split.
Instead of treating text as one flat stream, it keeps the section structure.
Example chunk format:
Document: Refund and Cancellation Policy
Section: Subscription Cancellation
Monthly subscriptions can be cancelled at any time. The cancellation will stop the next billing cycle, but it does not refund the current active month.
This is usually stronger than plain paragraph chunking because the heading becomes part of the context.
When It Works
Use heading-aware chunking for:
- API documentation
- product manuals
- policy documents
- technical guides
- knowledge base articles
- structured Markdown or HTML pages
This strategy works well when the document already has meaningful headings.
Main Weakness
It depends on good parsing.
If the parser fails to detect headings, the chunker will build bad chunks. Heading-aware chunking is only as good as the parsed structure it receives.
Strategy 5: Recursive Chunking
Recursive chunking tries to split text using a hierarchy of separators.
A common order looks like this:
- section
- paragraph
- sentence
- token limit
The splitter first tries to preserve larger meaningful units. If a unit is too large, it recursively splits it into smaller units.
When It Works
Use recursive chunking when you want a strong general-purpose default.
It works well for many document types because it tries to respect structure while still enforcing a size limit.
Use it for:
- mixed Markdown documents
- documentation pages
- support articles
- semi-structured text
- early production RAG systems
Main Weakness
It is still rule-based.
It does not truly understand meaning. It only follows split rules. If the document structure is messy, recursive chunking may still produce weak chunks.
Strategy 6: Semantic Chunking
Semantic chunking splits text based on meaning instead of only size or separators.
The idea is to group sentences or paragraphs that are semantically close, then start a new chunk when the topic shifts.
When It Works
Use semantic chunking when:
- documents contain long sections with multiple topics
- paragraph boundaries are weak
- the same section mixes several different concepts
- retrieval quality is poor with rule-based splitting
Semantic chunking can produce chunks that feel more natural to the embedding model because each chunk is more topically focused.
Main Weakness
It is more expensive and less predictable.
It may require embeddings or model calls during indexing. It can also be harder to debug because chunk boundaries are generated by similarity behavior, not simple rules.
Strategy 7: Parent-Child Chunking
Parent-child chunking separates the retrieval unit from the context unit.
The child chunk is small and searchable. The parent chunk is larger and used for final context.
Example:
| Unit | Purpose |
|---|---|
| Child chunk | Used for embedding and retrieval |
| Parent chunk | Returned to the LLM after child match |
This solves a common problem: small chunks retrieve well, but large chunks answer better.
When It Works
Use parent-child chunking when:
- small chunks improve search accuracy
- answers need surrounding context
- sections are too large for direct embedding
- the LLM needs a full policy rule, not one isolated sentence
For example, the retriever may match the child sentence about "no refund for current active month", but the system can return the full "Subscription Cancellation" section as parent context.
Main Weakness
It needs more metadata and careful linking.
Every child chunk must know its parent. If the relationship is wrong, retrieval may find the right sentence but return the wrong surrounding context.
Strategy 8: Table-Aware Chunking
Table-aware chunking preserves table rows, columns, and labels.
This matters because flattening a table into plain text can destroy the meaning.
Bad table chunk:
Single Course 14 days Less than 20% completed Monthly Subscription Before next billing cycle No refund for current active month
Better table chunk:
Refund Summary
Purchase Type: Single Course
Refund Window: 14 days
Important Condition: Less than 20% completed
When It Works
Use table-aware chunking for:
- pricing tables
- policy summaries
- comparison tables
- product spec sheets
- financial reports
- configuration matrices
Main Weakness
Table chunking needs custom logic.
Some tables should be chunked by row. Some should be chunked as a full table. Some need both: one chunk for each row and one chunk for the whole table summary.
How to Select the Right Strategy
The correct chunking strategy depends on the shape of the source document and the type of question you expect.
| Situation | Better Strategy | Reason |
|---|---|---|
| Plain long text | Fixed-size with overlap | Fast baseline and predictable size |
| Markdown or HTML docs | Heading-aware or recursive | Preserves document structure |
| Policy documents | Heading-aware with parent-child | Rules need section context |
| Tables and specs | Table-aware | Preserves row-column meaning |
| Mixed-topic long sections | Semantic chunking | Splits by topic shift |
| FAQ pages | Question-answer pair chunking | Each pair is already a retrieval unit |
| Code documentation | Heading-aware with code block preservation | Avoids separating code from explanation |
| Large knowledge base | Recursive plus metadata filtering | Balances structure and scale |
A good selection process usually starts simple, then becomes more specific after evaluation.
Start With a Baseline
Use recursive or fixed-size with overlap first. Measure whether the correct answer appears in top-k retrieval.
Inspect Failures
If the correct content exists but is not retrieved, inspect whether the chunk is too large, too small, or missing context.
Use Structure When Available
If the source has headings, tables, or sections, preserve them instead of flattening everything.
Add Complexity Only When Needed
Semantic or parent-child chunking is useful, but it adds indexing cost and debugging complexity.
Why There Is No Best Strategy
There is no best chunking strategy because chunking is a trade-off.
Each strategy optimizes for a different kind of retrieval behavior.
| Trade-Off | Small Chunk | Large Chunk |
|---|---|---|
| Retrieval precision | Usually higher | Usually lower |
| Context completeness | Usually lower | Usually higher |
| Embedding noise | Lower | Higher |
| Risk of missing condition | Higher | Lower |
| Context window cost | Lower | Higher |
Small chunks are easier to match but may lose the surrounding condition. Large chunks preserve context but may dilute the embedding.
The best strategy for one document type can be bad for another. A support FAQ, an API reference, a legal policy, and a product table should not be chunked the same way.
The practical question is not "Which chunking strategy is best?" The better question is "Which failure mode am I trying to reduce?"
Reusable Example: Chunking the Policy Document
From the previous log, we used this document:
Document Title: Refund and Cancellation Policy
Product: LearnPro Online Course Platform
Version: 2026.04
Owner: Billing Team
1. General Refund Rule
Customers can request a refund within 14 days after purchase if they have completed less than 20% of the course content.
2. Digital Course Activation
Once a customer downloads course materials or receives a completion certificate, the purchase is no longer refundable.
3. Subscription Cancellation
Monthly subscriptions can be cancelled at any time. The cancellation will stop the next billing cycle, but it does not refund the current active month.
4. Enterprise Customers
Enterprise customers with custom contracts should contact the account manager. Their refund terms follow the signed contract instead of the standard policy.
5. Support Contact
For billing issues, customers should contact billing-support@learnpro.example.
Refund Summary:
Purchase Type | Refund Window | Important Condition
Single Course | 14 days | Less than 20% completed
Monthly Subscription | Before next billing cycle | No refund for current active month
Enterprise Contract | Based on contract | Contact account manager
For this document, a good first strategy is heading-aware chunking with table-aware handling.
Example chunk:
{
"chunk_id": "doc_refund_policy_learnpro_2026_04__subscription_cancellation",
"document_id": "doc_refund_policy_learnpro_2026_04",
"title": "Refund and Cancellation Policy",
"section": "Subscription Cancellation",
"text": "Monthly subscriptions can be cancelled at any time. The cancellation will stop the next billing cycle, but it does not refund the current active month.",
"metadata": {
"product": "LearnPro Online Course Platform",
"domain": "billing",
"document_type": "policy",
"version": "2026.04"
}
}
For the table, row-level chunks can make retrieval more precise:
{
"chunk_id": "doc_refund_policy_learnpro_2026_04__refund_summary__monthly_subscription",
"document_id": "doc_refund_policy_learnpro_2026_04",
"title": "Refund and Cancellation Policy",
"section": "Refund Summary",
"text": "Refund Summary. Purchase Type: Monthly Subscription. Refund Window: Before next billing cycle. Important Condition: No refund for current active month.",
"metadata": {
"product": "LearnPro Online Course Platform",
"domain": "billing",
"document_type": "policy",
"version": "2026.04",
"block_type": "table_row"
}
}
This structure keeps the chunk focused, traceable, and useful for retrieval.
What Is Commonly Used Now
In many practical RAG systems, the most common starting point is still recursive chunking or fixed-size chunking with overlap.
The reason is not that it is always the best. The reason is that it is easy to implement, easy to compare, and good enough for many early systems.
A more mature setup often moves toward:
- recursive chunking for general documents
- heading-aware chunking for structured documentation
- table-aware chunking for tabular content
- parent-child chunking when small retrieval units need larger answer context
So the common path is: start with a simple baseline, evaluate retrieval failures, then add structure-aware chunking where the data clearly needs it.
The Main Principle
Chunking is not about cutting text into equal pieces. It is about choosing the right retrieval unit.
There is no universal best strategy because different documents fail in different ways. Some need smaller chunks for precision. Some need larger chunks for context. Some need headings. Some need table preservation. Some need parent-child relationships.
The practical rule is simple: choose the chunking strategy based on the failure you want to reduce, then prove it with retrieval evaluation.
Chunking 是把 parsed document 转换成可搜索单元的步骤。RAG 系统一般不会直接检索整份文档,而是检索 chunk。所以 chunking 会直接影响 retriever 找到什么、reranker 比较什么,以及 LLM 最后能拿到什么上下文。
简短答案
没有一个永远最好的 chunking strategy。
好的 chunking strategy 取决于:
- 文档结构
- 文档长度
- 问题类型
- 检索方式
- embedding model
- metadata 质量
- 回答格式
- context window 预算
实际目标不是找到完美的 chunk size。真正目标是让每个 chunk 同时具备两个条件:可以被准确检索,并且包含足够上下文来回答问题。
太小
chunk 可能可以匹配 query,但没有足够上下文让模型正确回答。
太大
chunk 可能包含答案,但因为混入太多无关信息,embedding 会变得不够聚焦。
为什么 Chunking 重要
Chunking 控制的是检索单位。
如果正确答案被切到多个 chunk 里,retrieval 可能只拿到一半证据。如果无关规则被混在同一个 chunk 里,LLM 可能会读到互相冲突的上下文。
Chunking 不只是存储问题。它会影响整条 RAG pipeline。
| 阶段 | Chunking 如何影响它 |
|---|---|
| Embedding | 向量表示的是 chunk,不是整份文档 |
| Retrieval | 搜索是根据 chunk-level similarity 返回结果 |
| Reranking | 候选 chunk 会和用户问题比较 |
| LLM Context | 模型只能看到被选中的 chunks |
| Debugging | 工程师检查的是 chunk-level evidence |
差的 chunking strategy 会让好的 retriever 看起来很差。它也可能让 LLM 看起来不稳定,但真正问题其实是答案和上下文被切坏了。
策略一:固定大小切片
固定大小切片会按照固定字符数、词数或 token 数来切文本。
例子:
| 设置 | 含义 |
|---|---|
| Chunk size | 500 tokens |
| Overlap | 50 tokens |
| Split rule | 一直切到文档结束 |
这是最简单的策略。它容易实现,也容易测试。
什么时候适合
固定大小切片适合文档结构不明显的纯文本。
适合用于:
- 笔记
- transcript
- 普通文章
- 原始文本 dump
- 早期 baseline 实验
主要弱点
splitter 不理解语义。
它可能会切断段落、表格、列表或章节。这样会把规则和条件分开。
比如它可能把这句话放在一个 chunk:
Customers can request a refund within 14 days after purchase.
但条件被切到下一个 chunk:
This only applies if they have completed less than 20% of the course content.
这样每个 chunk 都不完整。
策略二:固定大小加 Overlap
固定大小加 overlap 是纯固定切片的改良版。
它不会把每个 chunk 完全隔开,而是让相邻 chunk 共享一部分文本。
例子:
| Chunk | Content Range |
|---|---|
| Chunk 1 | Tokens 1-500 |
| Chunk 2 | Tokens 451-950 |
| Chunk 3 | Tokens 901-1400 |
overlap 可以降低重要上下文刚好被切断的风险。
什么时候适合
适合在这些情况使用:
- 需要快速 baseline
- source text 结构不明显
- 文档很长
- 答案可能出现在 chunk 边界附近
- 希望 chunk 长度可预测
这通常是第一个实用策略,因为它简单、稳定,也容易比较。
主要弱点
overlap 会增加存储量和检索噪音。
同一句话可能出现在多个 chunk 里。这会导致重复检索结果,也会浪费 context window。
策略三:按段落切片
按段落切片会根据自然段落边界来切文本。
这个策略比固定大小切片更尊重作者原本的写作结构。
什么时候适合
适合用于:
- blog posts
- essays
- documentation pages
- policy text
- 解释型内容
段落通常会表达一个局部想法,所以适合作为 embedding 和 retrieval 的单位。
主要弱点
段落长度不稳定。
有些段落太短,没有足够信息。有些段落太长,里面包含多个想法。生产系统通常还需要额外规则,比如合并太短的段落,或者继续切开太长的段落。
策略四:Heading-Aware Chunking
heading-aware chunking 会使用文档标题和章节标题来决定切片。
它不会把文本当成一条扁平 stream,而是保留 section structure。
chunk 可以长这样:
Document: Refund and Cancellation Policy
Section: Subscription Cancellation
Monthly subscriptions can be cancelled at any time. The cancellation will stop the next billing cycle, but it does not refund the current active month.
这通常比普通段落切片更强,因为 heading 会变成上下文的一部分。
什么时候适合
适合用于:
- API documentation
- product manuals
- policy documents
- technical guides
- knowledge base articles
- 结构良好的 Markdown 或 HTML 页面
当文档本来就有清楚标题时,这个策略很有用。
主要弱点
它依赖好的 parsing。
如果 parser 没有正确识别 heading,chunker 就会产出差的 chunk。heading-aware chunking 的质量取决于 parsed structure 的质量。
策略五:Recursive Chunking
recursive chunking 会按照一组层级 separator 逐层切文本。
常见顺序是:
- section
- paragraph
- sentence
- token limit
splitter 会先尝试保留较大的有意义单位。如果这个单位太大,它再递归切成更小的单位。
什么时候适合
如果你想要一个比较强的通用默认策略,可以先用 recursive chunking。
它适合很多文档类型,因为它会尽量保留结构,同时又会强制控制大小。
适合用于:
- 混合 Markdown 文档
- documentation pages
- support articles
- 半结构化文本
- 早期 production RAG systems
主要弱点
它本质上还是 rule-based。
它不是真的理解语义,只是跟着分隔规则走。如果文档结构很乱,recursive chunking 仍然可能产出不好的 chunk。
策略六:Semantic Chunking
semantic chunking 会根据语义来切文本,而不是只看大小或 separator。
它的想法是把语义接近的句子或段落放在一起,当主题开始变化时,就切出新的 chunk。
什么时候适合
适合在这些情况使用:
- 文档的 section 很长,而且包含多个主题
- 段落边界不可靠
- 同一个 section 混合了不同概念
- rule-based splitting 的检索质量不好
semantic chunking 产出的 chunk 通常会更符合 embedding model 的使用方式,因为每个 chunk 的主题更集中。
主要弱点
它更贵,也更不稳定。
它可能需要在 indexing 阶段使用 embedding 或 model call。它也比较难调试,因为 chunk 边界不是简单规则切出来的,而是根据相似度行为生成的。
策略七:Parent-Child Chunking
parent-child chunking 会把检索单位和上下文单位分开。
child chunk 小,用来 embedding 和 retrieval。parent chunk 大,用来提供给 LLM 作为最终上下文。
例子:
| 单位 | 目的 |
|---|---|
| Child chunk | 用来 embedding 和 retrieval |
| Parent chunk | child 命中后返回给 LLM |
这个策略解决了一个常见问题:小 chunk 好检索,大 chunk 好回答。
什么时候适合
适合在这些情况使用:
- 小 chunk 可以提高搜索准确度
- 答案需要周围上下文
- section 太大,不适合直接 embedding
- LLM 需要完整 policy rule,而不是一句孤立文字
比如 retriever 可能命中 “no refund for current active month” 这句 child sentence,但系统可以返回完整的 “Subscription Cancellation” section 作为 parent context。
主要弱点
它需要更多 metadata 和更仔细的 linking。
每个 child chunk 都必须知道自己的 parent。如果关系错了,retrieval 可能找到正确句子,但返回错误的上下文。
策略八:Table-Aware Chunking
table-aware chunking 会保留表格的 rows、columns 和 labels。
这很重要,因为把表格压成普通文字可能会破坏含义。
差的 table chunk:
Single Course 14 days Less than 20% completed Monthly Subscription Before next billing cycle No refund for current active month
更好的 table chunk:
Refund Summary
Purchase Type: Single Course
Refund Window: 14 days
Important Condition: Less than 20% completed
什么时候适合
适合用于:
- pricing tables
- policy summaries
- comparison tables
- product spec sheets
- financial reports
- configuration matrices
主要弱点
table chunking 需要自定义逻辑。
有些表格应该按 row 切。有些应该整张表保留。有些需要两种都做:每一 row 一个 chunk,再额外保留整张表的 summary chunk。
如何选择正确策略
正确的 chunking strategy 取决于 source document 的形状,以及你预期用户会问什么类型的问题。
| 情况 | 更适合的策略 | 原因 |
|---|---|---|
| 普通长文本 | Fixed-size with overlap | 快速 baseline,长度可预测 |
| Markdown 或 HTML docs | Heading-aware 或 recursive | 保留文档结构 |
| Policy documents | Heading-aware with parent-child | 规则通常需要章节上下文 |
| Tables and specs | Table-aware | 保留行列关系 |
| 混合主题长章节 | Semantic chunking | 按主题变化切开 |
| FAQ pages | Question-answer pair chunking | 每组问答本身就是检索单位 |
| Code documentation | Heading-aware with code block preservation | 避免代码和解释分离 |
| 大型知识库 | Recursive plus metadata filtering | 平衡结构和规模 |
好的选择流程通常是先简单,再根据 evaluation 结果变具体。
先做 Baseline
先用 recursive 或 fixed-size with overlap。检查正确答案是否出现在 top-k retrieval 里。
检查失败案例
如果正确内容存在但没有被检索到,检查 chunk 是太大、太小,还是缺少上下文。
有结构就保留结构
如果 source 有 headings、tables 或 sections,就不要全部压平成普通文本。
必要时才增加复杂度
semantic 或 parent-child chunking 有用,但会增加 indexing 成本和 debugging 难度。
为什么没有最佳策略
没有最好的 chunking strategy,因为 chunking 本质上是 trade-off。
每种策略都在优化不同的 retrieval behavior。
| Trade-Off | 小 Chunk | 大 Chunk |
|---|---|---|
| Retrieval precision | 通常更高 | 通常更低 |
| Context completeness | 通常更低 | 通常更高 |
| Embedding noise | 更低 | 更高 |
| 漏掉条件的风险 | 更高 | 更低 |
| Context window cost | 更低 | 更高 |
小 chunk 更容易匹配,但可能失去周围条件。大 chunk 保留上下文,但可能让 embedding 变得不够聚焦。
一个策略对某类文档很好,不代表它对所有文档都好。support FAQ、API reference、legal policy 和 product table 不应该用完全一样的切法。
所以真正的问题不是 “哪个 chunking strategy 最好?” 更好的问题是 “我现在想减少哪一种失败?”
可复用例子:切分 Policy Document
上一篇 log 使用了这份文档:
Document Title: Refund and Cancellation Policy
Product: LearnPro Online Course Platform
Version: 2026.04
Owner: Billing Team
1. General Refund Rule
Customers can request a refund within 14 days after purchase if they have completed less than 20% of the course content.
2. Digital Course Activation
Once a customer downloads course materials or receives a completion certificate, the purchase is no longer refundable.
3. Subscription Cancellation
Monthly subscriptions can be cancelled at any time. The cancellation will stop the next billing cycle, but it does not refund the current active month.
4. Enterprise Customers
Enterprise customers with custom contracts should contact the account manager. Their refund terms follow the signed contract instead of the standard policy.
5. Support Contact
For billing issues, customers should contact billing-support@learnpro.example.
Refund Summary:
Purchase Type | Refund Window | Important Condition
Single Course | 14 days | Less than 20% completed
Monthly Subscription | Before next billing cycle | No refund for current active month
Enterprise Contract | Based on contract | Contact account manager
对这份文档来说,比较好的第一版策略是 heading-aware chunking 加 table-aware handling。
section chunk 可以长这样:
{
"chunk_id": "doc_refund_policy_learnpro_2026_04__subscription_cancellation",
"document_id": "doc_refund_policy_learnpro_2026_04",
"title": "Refund and Cancellation Policy",
"section": "Subscription Cancellation",
"text": "Monthly subscriptions can be cancelled at any time. The cancellation will stop the next billing cycle, but it does not refund the current active month.",
"metadata": {
"product": "LearnPro Online Course Platform",
"domain": "billing",
"document_type": "policy",
"version": "2026.04"
}
}
对表格来说,row-level chunk 可以让检索更精准:
{
"chunk_id": "doc_refund_policy_learnpro_2026_04__refund_summary__monthly_subscription",
"document_id": "doc_refund_policy_learnpro_2026_04",
"title": "Refund and Cancellation Policy",
"section": "Refund Summary",
"text": "Refund Summary. Purchase Type: Monthly Subscription. Refund Window: Before next billing cycle. Important Condition: No refund for current active month.",
"metadata": {
"product": "LearnPro Online Course Platform",
"domain": "billing",
"document_type": "policy",
"version": "2026.04",
"block_type": "table_row"
}
}
这种结构让 chunk 保持聚焦、可追踪,也更适合 retrieval。
现在常见的做法
在很多实际 RAG 系统里,最常见的起点仍然是 recursive chunking,或者 fixed-size chunking with overlap。
原因不是它们永远最好,而是它们容易实现、容易比较,而且对很多早期系统来说已经足够可用。
更成熟的系统通常会逐步走向:
- general documents 使用 recursive chunking
- structured documentation 使用 heading-aware chunking
- tabular content 使用 table-aware chunking
- 当小检索单位需要大上下文时,使用 parent-child chunking
所以常见路线是:先用简单 baseline,评估 retrieval failure,再在资料明显需要结构时加入 structure-aware chunking。
核心原则
Chunking 不是把文字平均切开。它是在选择正确的 retrieval unit。
没有通用最佳策略,因为不同文档的失败方式不同。有些需要小 chunk 提高 precision。有些需要大 chunk 保留 context。有些需要 headings。有些需要保留表格。有些需要 parent-child 关系。
实用规则很简单:根据你想减少的失败类型选择 chunking strategy,然后用 retrieval evaluation 证明它真的有效。