vector-databases

# Vector Databases (Deep Workflow) Vector search is **approximate nearest neighbor (ANN)** at scale—**not** magic semantic understanding. Success requires **embedding model alignment**, **index parameters**, **metadata filters**, and **evaluation** against real queries. ## When to Offer This Workflow **Trigger conditions:** - Building RAG, similarity search, dedup, recommendations, anomaly clustering - Comparing **managed vector DB** vs **pgvector** vs **search engine kNN** - **Recall** issues, **stale** vectors, **slow** queries, or **cost** explosions **Initial offer:** Use **six stages**: (1) problem & metrics, (2) embeddings & schema, (3) index & parameters, (4) hybrid & filtering, (5) operations & cost, (6) evaluation & iteration. Confirm **scale** (vectors, QPS, dimension) and **latency SLO**. --- ## Stage 1: Problem & Metrics **Goal:** Define **what “similar” means** for the product—not only cosine similarity. ### Questions 1. **Query types**: short keyword vs long paragraph? multilingual? 2. **Precision vs recall** preference: legal/medical may need **high precision** 3. **Freshness**: how often do vectors change? **Real-time** upserts? 4 **Ground truth**: any labeled **relevant** pairs for eval? ### Metrics - **Recall@k**, **MRR**, **nDCG** when judgments exist; otherwise **human** spot checks + **proxy** tasks **Exit condition:** **Success metric** and **minimum acceptable** recall/latency stated. --- ## Stage 2: Embeddings & Schema **Goal:** **Stable embedding pipeline** with **versioning** and **metadata** design. ### Embeddings - **Model choice**: domain fit (code vs general text); **dimension**; **distance metric** (cosine vs dot vs L2)—**match** DB defaults - **Chunking** strategy upstream—bad chunks → bad retrieval regardless of DB ### Schema - **Payload/metadata** per vector: `doc_id`, `tenant_id`, `acl`, `source`, **timestamps** - **Multi-vector** per doc (passages) vs single centroid—**trade-offs** ### Versioning - **Re-embed** all on model change—**plan** downtime or **dual-write** period **Exit condition:** **ID strategy** + **metadata filter** needs documented. --- ## Stage 3: Index & Parameters **Goal:** Pick **index type** and **build params** for data size and recall. ### Common families (vendor-specific names) - **HNSW**: strong latency/recall; **memory** hungry; **tunable** `efConstruction`, `M` - **IVF**: better memory; needs **training** **nlist**; **probe** tuning - **PQ/OPQ**: compression—**recall** hit; good for **huge** scale ### Tuning loop - Start **defaults**; **sweep** parameters with **benchmark** queries - Watch **insert** throughput during **index build** on large backfills **Exit condition:** **Benchmark** results: p95 latency vs recall at fixed **k**. --- ## Stage 4: Hybrid Search & Filtering **Goal:** Combine **vector** similarity with **structured** constraints—most production needs this. ### Patterns - **Pre-filter** metadata (tenant, date) **before** ANN when supported—**verify** filter selectivity - **Hybrid**: BM25 + vector with **weighted** fusion or **rerank** stage - **Reranking**: cross-encoder on **top-k** candidates—**quality** boost, **latency** cost ### Pitfalls - **Filtering** that leaves **too few** candidates—empty results despite “similar” existing in other tenants **Exit condition:** **Query plan** documented: ANN → filter → rerank (as applicable). --- ## Stage 5: Operations & Cost **Goal:** **Reliable** ingestion, **monitoring**, and **predictable** bills. ### Ops - **Upsert** idempotency; **delete** tombstones for compliance - **Backups**, **multi-region** if needed—**eventual consistency** semantics per vendor - **Capacity**: memory per node vs **sharding**; **replication** factor ### Cost - **Managed** per **dimension × count**; **egress**; **query** units—**estimate** from **peak QPS** **Exit condition:** **Runbook** for **reindex**, **scaling**, and **incident** “search degraded.” --- ## Stage 6: Evaluation & Iteration **Goal:** **Continuous** improvement with **labeled** or **proxy** eval. ### Loop - **Golden** query set updated when product changes - **A/B** embedding models or **rerankers** with **guardrails** on latency - **Monitor** **click-through**, **thumbs**, or **human** grading in RAG ### Debugging bad retrieval - **Chunk** inspection, **metadata** leaks, **wrong** tenant filter, **stale** index --- ## Final Review Checklist - [ ] Metrics and embedding/model versioning plan - [ ] Index family chosen with benchmark evidence - [ ] Hybrid/filter strategy matches product needs - [ ] Ops: upsert, delete, scaling, backup understood - [ ] Eval set and iteration process in place ## Tips for Effective Guidance - **Never** promise “semantic search understands intent”—**ground** with eval. - **pgvector** vs **specialized**: trade-offs on **scale**, **ops**, **features**—state honestly. - Warn: **high-cardinality** filters + ANN can be **slow**—**design** metadata carefully. ## Handling Deviations - **Tiny corpus**: **brute force** or **simple** index may suffice—avoid over-engineering. - **Multimodal**: **separate** embedding spaces or **unified** model—**fusion** strategy required.

vector-databases

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载 Zip 包

vector-databases