After parsing, chunking, embedding, and metadata design, the next question is where to store the retrieval data. This is where vector database selection matters. A vector database is not only a place to store vectors. It also affects filtering, retrieval speed, hybrid search, scaling, debugging, and operational complexity.
Short Answer
A vector database stores embeddings and makes similarity search possible.
In a RAG system, it usually stores:
- vector embeddings
- chunk text
- chunk ID
- document ID
- metadata
- source reference
- sometimes sparse vectors or keyword index data
The correct vector database depends on the system you are building.
| Situation | Suitable Direction |
|---|---|
| Small app already using PostgreSQL | pgvector |
| Managed production RAG with low ops burden | Pinecone, Weaviate Cloud, Qdrant Cloud |
| Open-source self-hosted vector search | Qdrant, Weaviate, Milvus |
| Very large-scale vector workload | Milvus or managed vector DB |
| Strong keyword plus vector search | Elasticsearch, OpenSearch, Weaviate |
| Fast cache-like vector retrieval | Redis |
| Local prototype | Chroma, FAISS, SQLite plus vector extension |
There is no universal best vector database. The best choice depends on scale, filtering needs, deployment model, team skill, cost, and how much operational work you want to own.
What a Vector Database Actually Stores
A vector database stores the retrieval unit produced by your indexing pipeline.
It does not only store the vector. A practical RAG record usually looks like this:
{
"chunk_id": "doc_refund_policy_learnpro_2026_04__subscription_cancellation",
"document_id": "doc_refund_policy_learnpro_2026_04",
"embedding": [0.012, -0.044, 0.031],
"text": "Monthly subscriptions can be cancelled at any time. The cancellation will stop the next billing cycle, but it does not refund the current active month.",
"metadata": {
"product": "LearnPro Online Course Platform",
"domain": "billing",
"document_type": "policy",
"version": "2026.04",
"language": "en",
"permission_scope": "public_support"
},
"source": {
"title": "Refund and Cancellation Policy",
"section": "Subscription Cancellation",
"source_uri": "manual://learnpro/refund-policy/2026-04"
}
}
The embedding is used for similarity search. The text is used by the LLM. The metadata is used for filtering and debugging. The source fields are used for citation and traceability.
A weak vector store design only stores vectors. A better design stores vectors, text, metadata, and source references together or keeps stable links between them.
Why Vector Databases Are Different
Vector databases differ because they optimize for different trade-offs.
Some are built for managed simplicity. Some are built for open-source control. Some are extensions on existing databases. Some are search engines that added vector search. Some are better for hybrid search. Some are better for large-scale distributed workloads.
| Difference | Why It Matters |
|---|---|
| Deployment model | Managed service or self-hosted |
| Filtering capability | Whether metadata filters work efficiently |
| Hybrid search | Whether vector and keyword search can be combined |
| Scaling model | Single node, distributed cluster, or serverless |
| Index types | Affects speed, recall, memory, and write behavior |
| Update behavior | Important when documents change often |
| Cost model | Storage, query count, pods, nodes, memory, or cloud usage |
| Ecosystem | SDKs, framework support, monitoring, backup, and operations |
The vector database is not a neutral storage choice. It shapes how the RAG system can retrieve evidence.
Common Vector Database Options
This is not a complete market list. It is a practical engineering map of common choices.
| Option | What It Is Good For | Main Trade-Off |
|---|---|---|
| pgvector | Simple RAG inside PostgreSQL | Less specialized for very large vector workloads |
| Pinecone | Managed vector search with low ops burden | Vendor dependency and cloud cost |
| Weaviate | Open-source or managed vector DB with hybrid search features | More concepts to learn |
| Qdrant | Open-source vector search with strong filtering model | Need to operate it if self-hosted |
| Milvus | Large-scale open-source vector database | More infrastructure complexity |
| Elasticsearch / OpenSearch | Search-heavy systems needing keyword plus vector retrieval | Heavier search stack |
| Redis | Low-latency vector search near cache/session workloads | Memory and persistence trade-offs |
| Chroma | Local development and prototyping | Not usually the first choice for large production systems |
| FAISS | Local vector index library | Not a full database by itself |
The important point: these tools are not different only by syntax. They represent different operational models.
pgvector
pgvector is a PostgreSQL extension that lets PostgreSQL store and search vector embeddings.
It is a strong choice when your application already uses PostgreSQL and your RAG dataset is not too large.
Good Fit
Small to medium RAG apps, internal tools, SaaS apps already using PostgreSQL, and teams that want one database first.
Be Careful
If vector traffic becomes very large, PostgreSQL may become harder to tune than a purpose-built vector database.
Use pgvector when simplicity is more important than specialized vector infrastructure.
A common early-stage architecture is:
Application DB: PostgreSQL
Vector Storage: pgvector table
Metadata: normal PostgreSQL columns
Chunk Text: normal PostgreSQL text column
This keeps the system easy to inspect. You can join chunks with documents, users, products, permissions, and evaluation records.
Pinecone
Pinecone is a managed vector database.
Its main value is reducing operational work. The team does not need to manage vector database servers, indexes, replication, or scaling details directly.
Good Fit
Production RAG systems where the team wants managed vector search and does not want to operate vector infrastructure.
Be Careful
Managed convenience usually means cloud cost, vendor dependency, and less low-level control.
Use Pinecone when your main priority is shipping a production vector search feature without owning too much database operation.
It is often suitable when:
- the team is small
- production reliability matters
- vector search is core to the product
- you want managed scaling
- you do not want to tune a self-hosted cluster
Weaviate
Weaviate is an open-source vector database that can also be used as a managed cloud service.
It is often attractive when you want vector search, metadata filtering, schema, hybrid search, and AI-related features in one system.
Good Fit
RAG systems that need hybrid search, object-like data modeling, and a database designed around AI retrieval.
Be Careful
It introduces its own schema and concepts, so the team needs to learn how to model data correctly.
Use Weaviate when you want a purpose-built vector database but still care about structured object storage and hybrid retrieval.
It is useful when:
- documents have rich metadata
- hybrid search matters
- you want managed or self-hosted options
- your team wants a vector database with more application-level retrieval features
Qdrant
Qdrant is an open-source vector search engine designed around vectors plus payload metadata.
It is a strong option when metadata filtering is important and the team wants a clean vector search service.
Good Fit
Systems that need semantic search with strong metadata filtering, clear APIs, and self-hosted or cloud deployment.
Be Careful
If self-hosted, your team still owns deployment, monitoring, backup, scaling, and upgrades.
Use Qdrant when you want a dedicated vector search engine with practical filtering and good control over retrieval behavior.
It is often a good middle ground between simple pgvector and heavier distributed systems.
Milvus
Milvus is an open-source vector database designed for large-scale vector similarity search.
It is usually more relevant when the dataset is large, query volume is high, or the vector search workload needs distributed architecture.
Good Fit
Large vector collections, high-throughput retrieval, and teams that can operate more complex data infrastructure.
Be Careful
It can be heavier to operate than smaller or managed options.
Use Milvus when vector search scale is a serious requirement, not just a future possibility.
If your dataset is still small, Milvus may be more infrastructure than you need.
Elasticsearch and OpenSearch
Elasticsearch and OpenSearch are search engines that support keyword search and vector search.
They are useful when your system needs strong text search, filters, analytics, logs, or search ranking features together with vector search.
Good Fit
Search-heavy systems where BM25, filters, facets, text analyzers, and vector search all matter.
Be Careful
They are heavier systems. If you only need simple vector search, they may be too much.
Use this direction when hybrid search is central.
For example, if users often search exact product names, error codes, IDs, legal terms, and semantic questions, a search engine can be a strong fit.
Redis
Redis can support vector search through its search/query capabilities.
It is useful when low latency matters and the data fits the Redis operating model.
Good Fit
Fast retrieval near cache-heavy applications, session-aware personalization, and low-latency semantic lookup.
Be Careful
Redis is often memory-oriented. Cost and persistence behavior need careful design.
Use Redis when vector search is close to real-time application state or cache-like workloads.
Do not select Redis only because it is fast. Select it when its data model and memory cost fit your system.
Chroma and FAISS
Chroma and FAISS are common in local experiments and prototypes.
FAISS is a vector index library, not a full database. Chroma is easier to use as a local vector store for application experiments.
Good Fit
Local RAG prototypes, notebooks, demos, and experiments before production database selection.
Be Careful
Prototype convenience does not automatically mean production suitability.
Use these tools when you are still validating chunking, embeddings, retrieval quality, and prompt design.
For production, evaluate persistence, backup, access control, monitoring, scaling, and deployment model before committing.
How to Select Based on Situation
The best vector database is the one that matches your current workload and near-future workload.
| Case | Recommended Direction | Why |
|---|---|---|
| Personal project or small internal tool | pgvector or Chroma | Simple and cheap to start |
| Existing PostgreSQL app | pgvector | Keeps data and metadata in one system |
| Production SaaS with small team | Pinecone or managed Qdrant / Weaviate | Reduces operations burden |
| Heavy metadata filtering | Qdrant, Weaviate, pgvector | Filtering is part of retrieval control |
| Strong hybrid search requirement | Elasticsearch, OpenSearch, Weaviate | Keyword and vector search both matter |
| Very large vector dataset | Milvus or managed vector DB | Built for scale |
| Low-latency cache-like lookup | Redis | Fast access pattern |
| Research prototype | FAISS or Chroma | Fast iteration |
Selection should start from constraints, not from popularity.
Selection Checklist
Before choosing a vector database, answer these questions.
1. How Large Is the Dataset?
Ten thousand chunks and one billion vectors require very different infrastructure.
2. How Important Is Metadata Filtering?
If filtering by product, permission, version, or region is critical, test filtered retrieval early.
3. Do You Need Hybrid Search?
If users search IDs, names, and exact terms, pure vector search may not be enough.
4. Who Operates the Database?
A self-hosted database is not free if your team must monitor, scale, patch, and recover it.
5. How Often Does Data Change?
Frequent updates need good upsert, delete, reindex, and versioning behavior.
6. What Is the Cost Model?
Check storage cost, query cost, memory cost, node cost, and managed service pricing.
A database that works well in a demo may fail when filtering, updates, permissions, and evaluation are added.
Common Mistakes
Vector database selection often fails because teams choose based on tool popularity instead of workload.
| Mistake | Why It Hurts |
|---|---|
| Choosing before defining metadata filters | The selected DB may not support efficient filtering |
| Ignoring keyword search | Exact terms, IDs, and error codes may retrieve poorly |
| Storing vectors without source references | Answers become hard to cite and debug |
| Using local prototype storage in production | Backup, scaling, and access control may be weak |
| Overengineering too early | Large distributed DB adds complexity before it is needed |
| Ignoring update behavior | Reindexing and document deletion become painful |
| Treating vector DB as the only source of truth | Source documents and chunk records become hard to rebuild |
The safest approach is to keep the indexing pipeline reproducible. If you can rebuild vectors from parsed documents and chunks, changing vector databases later is much easier.
Reusable Example: Storage Record
Using the same refund policy example, a production-ready storage record should keep the vector, text, metadata, and source reference connected.
{
"id": "doc_refund_policy_learnpro_2026_04__subscription_cancellation",
"vector": [0.012, -0.044, 0.031],
"payload": {
"document_id": "doc_refund_policy_learnpro_2026_04",
"title": "Refund and Cancellation Policy",
"section": "Subscription Cancellation",
"text": "Monthly subscriptions can be cancelled at any time. The cancellation will stop the next billing cycle, but it does not refund the current active month.",
"metadata": {
"product": "LearnPro Online Course Platform",
"domain": "billing",
"document_type": "policy",
"version": "2026.04",
"language": "en",
"permission_scope": "public_support",
"chunk_strategy": "heading_aware_table_aware_v1",
"embedding_model": "text-embedding-3-small"
},
"source_uri": "manual://learnpro/refund-policy/2026-04"
}
}
Different databases may name this differently.
Some call the extra fields metadata. Some call them payload. Some store them as table columns. The naming is less important than the design principle: the vector must stay connected to the chunk text, metadata, and source.
Practical Starting Recommendation
For most early RAG projects, start with the simplest option that can support your expected evaluation.
A reasonable default path is:
| Stage | Practical Choice |
|---|---|
| Local prototype | Chroma or FAISS |
| Small real app with PostgreSQL | pgvector |
| Production with low ops | Managed vector DB |
| Search-heavy production | Elasticsearch, OpenSearch, or Weaviate |
| Large-scale vector infrastructure | Milvus or managed large-scale vector DB |
If you are not sure, start with pgvector when you already use PostgreSQL. Start with a managed vector database when you want to avoid infrastructure work. Start with Elasticsearch or OpenSearch when keyword search is as important as semantic search.
The Main Principle
A vector database is part of retrieval design, not just storage.
The right choice depends on your workload: scale, metadata filtering, hybrid search, update frequency, cost, deployment model, and team operation ability.
The practical rule is simple: choose the database that makes your retrieval behavior easy to build, easy to inspect, and easy to operate. Do not choose based only on popularity.
在 parsing、chunking、embedding 和 metadata design 之后,下一个问题就是 retrieval data 要存在哪里。这里就会进入 vector database selection。vector database 不只是存 vector 的地方。它也会影响 filtering、retrieval speed、hybrid search、scaling、debugging 和整体运维复杂度。
简短答案
Vector database 用来存 embedding,并支持 similarity search。
在 RAG 系统里,它通常会存:
- vector embeddings
- chunk text
- chunk ID
- document ID
- metadata
- source reference
- 有时也会存 sparse vectors 或 keyword index data
正确的 vector database 取决于你正在构建什么系统。
| 情况 | 适合方向 |
|---|---|
| 小型应用,已经使用 PostgreSQL | pgvector |
| 想少做运维的 production RAG | Pinecone、Weaviate Cloud、Qdrant Cloud |
| open-source self-hosted vector search | Qdrant、Weaviate、Milvus |
| 超大规模 vector workload | Milvus 或 managed vector DB |
| 强 keyword + vector search | Elasticsearch、OpenSearch、Weaviate |
| 类似 cache 的高速 vector retrieval | Redis |
| 本地 prototype | Chroma、FAISS、SQLite plus vector extension |
没有通用最好的 vector database。正确选择取决于 scale、filtering 需求、部署方式、团队能力、成本,以及你愿意承担多少运维工作。
Vector Database 实际上存什么
Vector database 存的是 indexing pipeline 产出的 retrieval unit。
它不应该只存 vector。实用的 RAG record 通常长这样:
{
"chunk_id": "doc_refund_policy_learnpro_2026_04__subscription_cancellation",
"document_id": "doc_refund_policy_learnpro_2026_04",
"embedding": [0.012, -0.044, 0.031],
"text": "Monthly subscriptions can be cancelled at any time. The cancellation will stop the next billing cycle, but it does not refund the current active month.",
"metadata": {
"product": "LearnPro Online Course Platform",
"domain": "billing",
"document_type": "policy",
"version": "2026.04",
"language": "en",
"permission_scope": "public_support"
},
"source": {
"title": "Refund and Cancellation Policy",
"section": "Subscription Cancellation",
"source_uri": "manual://learnpro/refund-policy/2026-04"
}
}
embedding 用来做 similarity search。text 会交给 LLM。metadata 用来 filtering 和 debugging。source fields 用来 citation 和 traceability。
弱的 vector store design 只存 vectors。更好的设计会把 vectors、text、metadata 和 source references 一起存,或者至少让它们之间有稳定连接。
为什么 Vector Database 之间会不同
Vector database 不同,是因为它们优化的 trade-off 不同。
有些强调 managed simplicity。有些强调 open-source control。有些是 existing database 的 extension。有些是 search engine 加上 vector search。有些更适合 hybrid search。有些更适合大规模 distributed workload。
| 差异 | 为什么重要 |
|---|---|
| Deployment model | managed service 还是 self-hosted |
| Filtering capability | metadata filters 是否能高效执行 |
| Hybrid search | 是否能结合 vector 和 keyword search |
| Scaling model | single node、distributed cluster,还是 serverless |
| Index types | 影响速度、recall、memory 和 write behavior |
| Update behavior | 文档频繁更新时很重要 |
| Cost model | storage、query count、pods、nodes、memory 或 cloud usage |
| Ecosystem | SDK、framework support、monitoring、backup 和 operations |
vector database 不是一个中立的存储选择。它会塑造 RAG 系统如何检索证据。
常见 Vector Database 选择
这不是完整市场列表,而是一个实用的工程地图。
| 选择 | 适合什么 | 主要 trade-off |
|---|---|---|
| pgvector | 在 PostgreSQL 内做简单 RAG | 对超大 vector workload 没那么专用 |
| Pinecone | 低运维 managed vector search | vendor dependency 和 cloud cost |
| Weaviate | open-source 或 managed vector DB,支持 hybrid search 特性 | 需要学习自己的 schema 和概念 |
| Qdrant | open-source vector search,filtering model 清晰 | self-hosted 时需要自己运维 |
| Milvus | 大规模 open-source vector database | infrastructure complexity 更高 |
| Elasticsearch / OpenSearch | 需要 keyword + vector retrieval 的 search-heavy systems | search stack 更重 |
| Redis | 低延迟 vector search,接近 cache/session workload | memory 和 persistence trade-off |
| Chroma | 本地开发和 prototype | 通常不是大型 production 首选 |
| FAISS | 本地 vector index library | 它本身不是完整 database |
重点是:这些工具不只是语法不同。它们代表的是不同的 operational model。
pgvector
pgvector 是 PostgreSQL 的 extension,可以让 PostgreSQL 存储和搜索 vector embeddings。
如果你的应用已经使用 PostgreSQL,而且 RAG dataset 还没有大到很夸张,它是很强的选择。
适合
small to medium RAG apps、internal tools、已经使用 PostgreSQL 的 SaaS apps,以及想先维持单一 database 的团队。
注意
如果 vector traffic 变得很大,PostgreSQL 可能会比专用 vector database 更难调优。
当 simplicity 比专用 vector infrastructure 更重要时,可以使用 pgvector。
常见早期架构是:
Application DB: PostgreSQL
Vector Storage: pgvector table
Metadata: normal PostgreSQL columns
Chunk Text: normal PostgreSQL text column
这样系统很容易检查。你可以把 chunks 和 documents、users、products、permissions、evaluation records 直接 join 起来。
Pinecone
Pinecone 是 managed vector database。
它的主要价值是减少运维工作。团队不需要直接管理 vector database servers、indexes、replication 或 scaling details。
适合
production RAG systems,尤其是团队想要 managed vector search,不想自己维护 vector infrastructure。
注意
managed convenience 通常代表 cloud cost、vendor dependency,以及较少 low-level control。
当你的主要目标是快速上线 production vector search,而且不想承担太多 database operation,可以考虑 Pinecone。
它适合这些情况:
- 团队比较小
- production reliability 很重要
- vector search 是产品核心能力
- 希望 managed scaling
- 不想调 self-hosted cluster
Weaviate
Weaviate 是 open-source vector database,也可以使用 managed cloud service。
当你想要 vector search、metadata filtering、schema、hybrid search 和 AI retrieval features 放在同一个系统里,它会比较有吸引力。
适合
需要 hybrid search、object-like data modeling,以及围绕 AI retrieval 设计的 RAG 系统。
注意
它有自己的 schema 和概念,所以团队需要学会正确 modeling data。
当你想要 purpose-built vector database,同时又关心 structured object storage 和 hybrid retrieval,可以考虑 Weaviate。
它适合:
- documents 有丰富 metadata
- hybrid search 很重要
- 需要 managed 或 self-hosted 选择
- 团队想要更多 application-level retrieval features
Qdrant
Qdrant 是 open-source vector search engine,核心设计围绕 vectors plus payload metadata。
当 metadata filtering 很重要,而且团队想要一个清晰的 vector search service 时,它是很强的选择。
适合
需要 semantic search、强 metadata filtering、清晰 API,以及 self-hosted 或 cloud deployment 的系统。
注意
如果 self-hosted,团队仍然要负责 deployment、monitoring、backup、scaling 和 upgrades。
当你想要 dedicated vector search engine,并且希望对 retrieval behavior 有更好的控制,可以考虑 Qdrant。
它通常是 simple pgvector 和更重型 distributed systems 之间的中间选择。
Milvus
Milvus 是 open-source vector database,目标是大规模 vector similarity search。
当 dataset 很大、query volume 很高,或者 vector search workload 需要 distributed architecture 时,它会更相关。
适合
大型 vector collections、高吞吐 retrieval,以及有能力运维复杂 data infrastructure 的团队。
注意
它通常比更小型或 managed 的选择更重,需要更多运维能力。
当 vector search scale 是明确需求,而不只是未来可能发生的事情时,可以考虑 Milvus。
如果你的 dataset 还很小,Milvus 可能会比你现在需要的东西更复杂。
Elasticsearch 和 OpenSearch
Elasticsearch 和 OpenSearch 是支持 keyword search 和 vector search 的 search engines。
当你的系统需要强 text search、filters、analytics、logs 或 search ranking features,同时也需要 vector search,它们会很有用。
适合
search-heavy systems,尤其是 BM25、filters、facets、text analyzers 和 vector search 都重要的场景。
注意
它们是比较重的系统。如果你只需要简单 vector search,可能会过度复杂。
当 hybrid search 是核心需求时,可以考虑这个方向。
比如用户经常搜索 exact product names、error codes、IDs、legal terms,同时也会问 semantic questions,那么 search engine 会很适合。
Redis
Redis 可以通过 search/query 能力支持 vector search。
当低延迟很重要,而且数据适合 Redis 的 operating model 时,它会比较有用。
适合
cache-heavy applications、session-aware personalization,以及低延迟 semantic lookup。
注意
Redis 通常偏 memory-oriented。成本和 persistence behavior 需要认真设计。
当 vector search 接近 real-time application state 或 cache-like workload 时,可以考虑 Redis。
不要只是因为 Redis 快就选择 Redis。要确认它的数据模型和 memory cost 适合你的系统。
Chroma 和 FAISS
Chroma 和 FAISS 常见于本地实验和 prototype。
FAISS 是 vector index library,不是完整 database。Chroma 则比较适合用作本地 vector store,快速做 application experiments。
适合
local RAG prototypes、notebooks、demos,以及 production database selection 之前的实验。
注意
prototype 方便,不代表 production 就一定适合。
当你还在验证 chunking、embeddings、retrieval quality 和 prompt design 时,可以用这些工具。
如果要进入 production,就要重新评估 persistence、backup、access control、monitoring、scaling 和 deployment model。
如何根据情况选择
最好的 vector database,是符合当前 workload 和近期 workload 的那个。
| Case | 推荐方向 | 原因 |
|---|---|---|
| 个人项目或小型内部工具 | pgvector 或 Chroma | 简单、便宜、容易开始 |
| 已经使用 PostgreSQL 的应用 | pgvector | 数据和 metadata 可以留在一个系统 |
| 小团队 production SaaS | Pinecone 或 managed Qdrant / Weaviate | 降低运维负担 |
| metadata filtering 很重 | Qdrant、Weaviate、pgvector | filtering 是 retrieval control 的一部分 |
| 强 hybrid search requirement | Elasticsearch、OpenSearch、Weaviate | keyword 和 vector search 都重要 |
| 超大 vector dataset | Milvus 或 managed vector DB | 为 scale 设计 |
| 低延迟 cache-like lookup | Redis | 适合高速访问模式 |
| research prototype | FAISS 或 Chroma | 快速迭代 |
选择应该从 constraints 出发,而不是从 popularity 出发。
Selection Checklist
在选择 vector database 之前,先回答这些问题。
1. Dataset 有多大?
一万个 chunks 和十亿个 vectors 需要完全不同的 infrastructure。
2. Metadata Filtering 有多重要?
如果需要按 product、permission、version 或 region 过滤,就要尽早测试 filtered retrieval。
3. 是否需要 Hybrid Search?
如果用户会搜索 IDs、names 和 exact terms,纯 vector search 可能不够。
4. 谁负责运维?
self-hosted database 不是免费。团队要负责 monitor、scale、patch 和 recover。
5. 数据多久更新一次?
频繁更新需要好的 upsert、delete、reindex 和 versioning behavior。
6. 成本模型是什么?
检查 storage cost、query cost、memory cost、node cost 和 managed service pricing。
一个在 demo 中表现不错的 database,加入 filtering、updates、permissions 和 evaluation 后,不一定还适合。
常见错误
Vector database selection 经常失败,是因为团队根据工具热度选择,而不是根据 workload 选择。
| 错误 | 为什么有害 |
|---|---|
| 没定义 metadata filters 就先选 DB | 选到的 DB 可能不适合高效 filtering |
| 忽略 keyword search | exact terms、IDs、error codes 可能检索不好 |
| 只存 vectors,不存 source references | 答案很难 citation 和 debugging |
| 把 local prototype storage 直接上 production | backup、scaling 和 access control 可能不足 |
| 太早 overengineering | 大型 distributed DB 在早期会增加复杂度 |
| 忽略 update behavior | reindexing 和 document deletion 会变痛苦 |
| 把 vector DB 当成唯一 source of truth | source documents 和 chunk records 很难重建 |
最安全的做法是让 indexing pipeline 可以重复执行。只要你能从 parsed documents 和 chunks 重建 vectors,以后更换 vector database 就会容易很多。
可复用例子:Storage Record
继续使用前几篇的 refund policy 例子,一个 production-ready storage record 应该把 vector、text、metadata 和 source reference 连在一起。
{
"id": "doc_refund_policy_learnpro_2026_04__subscription_cancellation",
"vector": [0.012, -0.044, 0.031],
"payload": {
"document_id": "doc_refund_policy_learnpro_2026_04",
"title": "Refund and Cancellation Policy",
"section": "Subscription Cancellation",
"text": "Monthly subscriptions can be cancelled at any time. The cancellation will stop the next billing cycle, but it does not refund the current active month.",
"metadata": {
"product": "LearnPro Online Course Platform",
"domain": "billing",
"document_type": "policy",
"version": "2026.04",
"language": "en",
"permission_scope": "public_support",
"chunk_strategy": "heading_aware_table_aware_v1",
"embedding_model": "text-embedding-3-small"
},
"source_uri": "manual://learnpro/refund-policy/2026-04"
}
}
不同 database 对这个结构的命名可能不同。
有些叫 metadata。有些叫 payload。有些会把它们存成 table columns。命名不是重点。重点是设计原则:vector 必须和 chunk text、metadata、source 保持稳定连接。
实用起点建议
多数早期 RAG 项目,应该从能支持 evaluation 的最简单选择开始。
一个合理路线是:
| 阶段 | 实用选择 |
|---|---|
| Local prototype | Chroma 或 FAISS |
| 小型真实应用,已经使用 PostgreSQL | pgvector |
| 低运维 production | Managed vector DB |
| Search-heavy production | Elasticsearch、OpenSearch 或 Weaviate |
| 大规模 vector infrastructure | Milvus 或 managed large-scale vector DB |
如果不确定,而且你已经使用 PostgreSQL,可以先从 pgvector 开始。如果你想避免 infrastructure work,可以从 managed vector database 开始。如果 keyword search 和 semantic search 同样重要,可以考虑 Elasticsearch、OpenSearch 或 Weaviate。
核心原则
Vector database 是 retrieval design 的一部分,不只是 storage。
正确选择取决于 workload:scale、metadata filtering、hybrid search、update frequency、cost、deployment model 和团队运维能力。
实用规则很简单:选择能让 retrieval behavior 容易构建、容易检查、容易运维的 database。不要只根据流行度选择。