Embeddings & Vector Search#
What is an embedding?#
An embedding is a dense vector representation of text (or any modality) in a continuous high-dimensional space, such that semantically similar inputs land near each other. Modern embedding models produce vectors of 384–3072 dimensions.
Nearest-neighbor search over embeddings is the foundation of semantic search, RAG retrieval, recommendation, and deduplication.
Embedding model shortlist (2026)#
Provider |
Model |
Dim |
Context |
Notes |
|---|---|---|---|---|
OpenAI |
|
1536 |
8191 |
Fast, cheap default |
OpenAI |
|
3072 |
8191 |
Best quality |
Voyage |
|
1024 |
32000 |
Strong on retrieval benchmarks |
Cohere |
|
1536 |
128000 |
Multilingual, long context |
Open source |
|
1024 |
8192 |
Free, multilingual, hybrid |
Open source |
|
384 |
512 |
Tiny, fast, weaker recall |
Jina AI |
|
Multi-vector |
8192 |
Late interaction (ColBERT), 89 languages |
|
3072 |
2048 |
Multimodal (text+image+video+audio) |
Rule of thumb: start with text-embedding-3-small. If quality is
not enough, switch to text-embedding-3-large or voyage-3. Only drop
to MiniLM when you need to run locally or at zero cost.
Modern embedding techniques NEW#
Matryoshka Representation Learning (MRL)#
Models trained with MRL (e.g., text-embedding-3-large, gemini-embedding-002) produce embeddings where the first N dimensions are a valid lower-dimensional embedding. Truncate from 3072 → 256 dims to cut storage ~92% with <2% quality loss.
# OpenAI: request fewer dimensions
embeddings = client.embeddings.create(
model="text-embedding-3-large",
input="Hello world",
dimensions=256, # truncated MRL embedding
)
Late interaction (ColBERT)#
Instead of a single vector per document, ColBERT produces one vector per token. Retrieval uses MaxSim: for each query token, find the max similarity across all document tokens, then sum. Much higher retrieval quality at the cost of more storage.
Multimodal embeddings#
gemini-embedding-002 (March 2026) maps text, images, video, audio, and PDFs into a single vector space. No OCR or captioning pipeline needed for document understanding.
Similarity metrics#
Metric |
Formula |
When to use |
|---|---|---|
Cosine |
|
Default for text embeddings. Range: -1..1. |
Dot product |
|
When vectors are already normalized (faster). |
Euclidean (L2) |
|
Image embeddings; rarely for text. |
For text, always prefer cosine unless your embedding model specifies otherwise. Most modern text embeddings are L2-normalized, making cosine and dot product equivalent.
Chunking strategy#
Embeddings operate on chunks, not whole documents. Poor chunking is the single most common cause of bad RAG.
Baseline (works most of the time):
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=800,
chunk_overlap=120,
separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = splitter.split_documents(docs)
Advanced: semantic chunking (split on sentence-boundary similarity drops) or structure-aware chunking (split on Markdown headings, HTML sections). See Advanced Indexing.
Vector stores#
pgvector (PostgreSQL + langchain-postgres)#
Best choice when you already have Postgres. Supports rich metadata filtering via SQL.
pip install -qU langchain-postgres
Important: the package requires psycopg3. Connection string format:
postgresql+psycopg://user:pass@host:5432/db (not psycopg2://).
from langchain_postgres import PGVector
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
store = PGVector(
embeddings=embeddings,
collection_name="my_docs",
connection="postgresql+psycopg://user:pass@localhost:5432/mydb",
)
store.add_documents(docs, ids=doc_ids)
results = store.similarity_search("query", k=5)
# Metadata filter operators: $eq, $ne, $lt, $in, $and, $or, $like, $ilike
results = store.similarity_search(
"query",
k=5,
filter={"category": {"$eq": "policy"}, "year": {"$lt": 2025}},
)
Qdrant (langchain-qdrant)#
Best for dense+sparse hybrid retrieval and when you want a dedicated vector database.
pip install -qU langchain-qdrant
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams
client = QdrantClient(":memory:") # or url="http://localhost:6333"
client.create_collection(
collection_name="docs",
vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)
store = QdrantVectorStore(
client=client,
collection_name="docs",
embedding=embeddings,
)
store.add_documents(docs, ids=uuids)
# retrieval_mode options: "dense" (default), "sparse", "hybrid"
results = store.similarity_search("query", k=5)
Chroma (langchain-chroma)#
Best for quick prototyping and local-first workflows.
pip install -qU langchain-chroma
from langchain_chroma import Chroma
store = Chroma.from_documents(
documents=docs,
embedding=embeddings,
persist_directory="./chroma_db",
)
results = store.similarity_search("query", k=5)
Hybrid search: dense + sparse#
Dense retrieval (vector similarity) excels at semantic understanding. Sparse retrieval (BM25) excels at exact keyword matching. Combining both via Reciprocal Rank Fusion (RRF) typically outperforms either alone — especially on queries with proper names, SKUs, or error codes.
# RRF pseudocode
def rrf(rank_lists: list[list[str]], k: int = 60) -> list[str]:
scores: dict[str, float] = {}
for ranks in rank_lists:
for r, doc_id in enumerate(ranks):
scores[doc_id] = scores.get(doc_id, 0.0) + 1.0 / (k + r + 1)
return sorted(scores, key=scores.get, reverse=True)
See Advanced Retrieval Strategies for the full walkthrough.
Re-ranking#
After first-stage retrieval, pass the top-50 results through a
cross-encoder (e.g., cross-encoder/ms-marco-MiniLM-L-6-v2 or
Cohere’s rerank-v3) to re-score them with deeper attention. This
typically lifts precision by 5–15 points.
See Re-ranking for details.
Practice#
1. Embedding comparison#
Embed the following three sentences with
text-embedding-3-small and compute pairwise cosine similarity:
“A dog is chasing a ball in the park.”
“A puppy plays fetch in the garden.”
“The stock market crashed yesterday.”
Expected: (1,2) ≳ 0.7; (1,3) and (2,3) ≲ 0.3. Verify your intuition.
2. Chunking sensitivity#
Take a 20-page PDF and index it twice:
Run A:
chunk_size=400, chunk_overlap=40Run B:
chunk_size=1200, chunk_overlap=180
Write 10 test questions and measure top-5 recall on both runs. Report which chunk size works better and why.
3. pgvector end-to-end#
Spin up Postgres with the pgvector extension (Docker one-liner works).
Index 1,000 Wikipedia articles using text-embedding-3-small via
langchain-postgres. Implement a metadata filter so that a query can
be restricted to articles from a specific category. Verify both the
unrestricted and filtered searches return reasonable results.
4. Hybrid search with RRF#
Build a hybrid search that combines:
Dense vector search (any of the three stores above).
BM25 keyword search via
rank-bm25.
Fuse the two ranked lists with Reciprocal Rank Fusion (k=60). Compare the fused top-5 against the dense-only top-5 on 10 queries that include proper names or codes. Hybrid should outperform dense-only on those queries.
5. Cross-encoder re-ranking#
Retrieve top-50 from your best retriever. Re-rank with
cross-encoder/ms-marco-MiniLM-L-6-v2 and keep the top-5. Measure
precision@5 before and after re-ranking on your test set.
Expected: precision@5 should improve by 5–15 percentage points.
Review Questions#
What similarity metric is the default choice for text embeddings?
A. Cosine similarity
B. Manhattan distance
C. Jaccard index
D. Hamming distance
For most teams starting a RAG project, which embedding model is the recommended first choice?
A.
all-MiniLM-L6-v2(local, 384 dim)B.
text-embedding-3-small(OpenAI, 1536 dim)C. A custom model trained from scratch
D.
text-embedding-ada-002(legacy)
What is the pgvector connection string format required by
langchain-postgres?A.
postgres://user:pass@host/dbB.
postgresql+psycopg2://user:pass@host/dbC.
postgresql+psycopg://user:pass@host/db(psycopg3)D.
jdbc:postgresql://host/db
Why does hybrid search (dense + sparse) outperform dense-only on queries containing exact SKUs or error codes?
A. Sparse retrieval is faster
B. BM25 excels at exact keyword matching, which pure vector similarity can miss
C. Hybrid uses more memory
D. Dense vectors don’t work at all
Which Qdrant parameter enables hybrid dense+sparse retrieval?
A.
search_type="both"B.
retrieval_mode="hybrid"C.
mode="mixed"D.
use_bm25=True
What is the primary purpose of chunk overlap when splitting documents?
A. To make chunks larger for free
B. To avoid cutting a relevant passage exactly at a chunk boundary, which would make it unretrievable
C. To confuse the embedding model
D. To save disk space
A cross-encoder re-ranker is used after which step in a typical RAG pipeline?
A. Before the first-stage retriever runs
B. After first-stage retrieval, to re-score the top-N candidates with deeper attention
C. Instead of the embedding model
D. On the final answer text
What is Reciprocal Rank Fusion (RRF) used for?
A. Training embedding models
B. Combining ranked lists from multiple retrievers into a single fused ranking
C. Compressing vectors
D. Chunking documents
When would you prefer Chroma over pgvector for a vector store?
A. Never — always use pgvector
B. For quick prototyping and local-first workflows without an existing Postgres instance
C. When you need SQL metadata filters
D. When you have >100M vectors
Most modern text embeddings are L2-normalized. What does this imply about cosine similarity vs dot product?
A. They produce different rankings
B. They produce equivalent rankings (cosine = dot product when vectors are unit length)
C. Dot product is always slower
D. Cosine is impossible to compute
View Answer Key
A — Cosine is the default for text embeddings.
B —
text-embedding-3-smallis the recommended starting point; upgrade to-largeorvoyage-3if quality demands.C —
langchain-postgresrequires the psycopg3 driver.B — BM25 catches exact matches that dense vectors often miss.
B — Qdrant’s
retrieval_modeparameter with"hybrid"enables dense+sparse search.B — Overlap ensures relevant passages aren’t lost at chunk boundaries.
B — Re-ranking operates on the top-N candidates from first-stage retrieval.
B — RRF fuses multiple ranked lists, typically combining dense and sparse results.
B — Chroma is great for local prototyping; pgvector wins when Postgres is already in your stack.
B — With unit-length vectors, cosine similarity and dot product produce identical rankings; dot product is just faster.