When working on embeddings, it is natural to focus first on the model itself: use a larger backbone, add more data, and push MTEB or internal evals a little higher. The output dimension tends to move up with the model. 384 and 768 dimensions are still common, but 2048- and 4096-dimensional text and multimodal embeddings are no longer unusual.
This is not meant to be a strict leaderboard timeline. MTEB later split into English, multilingual, and versioned leaderboards, and the rankings keep changing. Still, a few representative strong models show the direction of travel:
| Year | Representative model | Output dim | Note |
|---|---|---|---|
| 2022 | sentence-transformers/all-mpnet-base-v2 | 768 | A widely used early sentence-transformers baseline |
| 2023 | BAAI/bge-large-en-v1.5 | 1024 | The BGE series ranked highly on MTEB / C-MTEB at release |
| 2024 | intfloat/e5-mistral-7b-instruct | 4096 | LLM backbones became a major line of embedding models |
| 2024 | nvidia/NV-Embed-v2 | 4096 | The model card reports No.1 on 56 MTEB tasks as of 2024-08-30 |
| 2025 | Qwen/Qwen3-Embedding-8B | 4096 | Supports variable output dimensions from 32 to 4096, with 4096 as the default upper bound |
In offline experiments, this cost is often muted. Add some GPU memory, add disk, reduce the batch size, and the experiment can keep running. Indexing and serving feel the pressure more directly. A 4096-dimensional float32 vector is 16 KB. One billion items is 16 TB before index structures, replicas, caches, or metadata. At query time, the system still has to find nearest neighbors over those vectors under a tight latency budget.
The question is not just whether vectors can be made smaller. It is more specific: in a retrieval system that already has a quality bar, which representations reduce storage, bandwidth, and compute? And how much recall changes in exchange?
This post compares three common routes: projection to a lower dimension, Matryoshka Representation Learning (MRL), and Contrastive Sparse Representation (CSR). The first two shorten a dense vector. CSR can keep a large representation space, but each sample uses only a small number of positions.
Representation Forms
Start with what the retrieval system receives at inference time. The training details come later.
Projection is the most direct form. The original dense embedding is x in R^d; a linear layer or MLP maps it to z in R^m, where m << d. The retrieval system does not need to care how the mapping was produced. It still sees a dense vector, just with fewer dimensions.
MRL also keeps the dense-vector interface. The difference is that it does not simply chop off a prefix after training. During training, it makes the first 32, 64, and 128 dimensions work on their own. At inference time, a tight budget uses a short prefix; a looser budget uses a longer one. This design is compatible with existing dense KNN / ANN systems.
CSR takes a different route. It first maps the original embedding into a larger latent space, say h dimensions, then keeps only the TopK nonzero values. Storage does not keep a full dense vector. It keeps indices and values. If k = 16, each item stores 16 positions and 16 values. This representation only becomes useful when the retrieval system also uses sparse retrieval; otherwise sparsity stays in the complexity formula rather than the system.
MRL and CSR use active count in different ways. In MRL, m=64 means “use the first 64 dense dimensions.” In CSR, k=64 means “activate 64 positions in a potentially much larger latent space.” The two settings may have similar cost, but they organize information very differently.
Training Objectives
Projection does not prescribe one fixed training recipe. You can train the low-dimensional vector directly with the original retrieval loss, such as InfoNCE. You can also distill from a high-dimensional teacher by matching its similarity distribution. Another option is to add a reconstruction loss so the low-dimensional vector preserves information from the original embedding. Regardless of the recipe, the information has to fit into m dimensions.
MRL has a clearer constraint. The original Matryoshka Representation Learning applies losses at multiple truncation lengths. It tells the model not to waste the early dimensions, because short vectors must also be usable.
For a retrieval task, MRL can be written as:
Here M is a set of truncation lengths, such as {32, 64, 128, 256}. Training computes a retrieval loss at each length. Inference chooses one length. Deployment stays simple because the system still runs dense vector search. The cost is also clear: the model must learn to place useful information early. The shorter the prefix, the less information it can hold.
CSR training looks more like a sparse autoencoder plus a task constraint. Beyond Matryoshka: Revisiting Sparse Coding for Adaptive Representation starts from a pretrained dense embedding and trains a sparse module that maps it into a TopK latent representation. The form can be written as:
The reconstruction loss preserves information from the original embedding. The contrastive loss makes the sparse latent representation useful for retrieval or classification. CSR differs from low-dimensional dense methods because it does not force every sample into the same small coordinate set. The model can have a large latent dictionary, while each sample selects only a few positions.
CSRv2 fits into this picture. It does not introduce a different representation. It addresses CSR’s training problems. In the ultra-sparse region, especially k=2 and k=4, the original CSR can produce dead neurons: many latent dimensions are rarely selected, so the model appears to have a large latent space but has much less usable capacity.
CSRv2 mainly changes training. It uses k-annealing: start with a larger k, then gradually move down to the target sparsity so the model is not constrained by a tiny active set from the beginning. It also adds supervised contrastive signals so the few active features serve downstream tasks more directly. For cross-domain settings, the paper also discusses full finetuning. A cautious reading is that CSRv2 does not prove “sparse is always better.” It shows that some ultra-sparse failures come from training collapse and should not be blamed only on sparse representations.
Evaluation
Complexity analysis needs measurements. A low-cost representation has to answer at least two questions:
- How do quality metrics change?
- How much speed and storage does it save?
These two questions should be evaluated separately. Quality can be measured with retrieval Recall, nDCG, or MRR; it can also be approximated by overlap with the full dense top-k. Performance needs storage, index build time, query latency, QPS, batch size, hardware, and kernel details. Reporting only the dimension or k is incomplete.
Dense retrieval has the basic form:
If the query matrix is B x d and the corpus matrix is N x d, the main cost of exact dense search is the B x N x d multiply-add work and the corresponding memory reads. MRL and projection still use this form, with d replaced by a smaller m.
CSR’s ideal path is different. Both query and corpus are TopK sparse. Similarity only needs to accumulate over shared active dimensions:
If the sparse index is implemented well, the cost is closer to the number of active features and posting-list accesses, rather than the full latent dimension. The engineering cost is there too. Position indices are not free, and neither are scatter / index_add. When k is very small, kernel launch overhead and non-contiguous memory access can consume part of the theoretical gain.
The following is a reproducible benchmark script. It is not a substitute for a formal evaluation. It puts quality and performance numbers from the same vector set into one place, so the quality metrics and performance metrics do not come from different settings:
python scripts/benchmark_low_dim_sparse_retrieval.py \
--device cuda \
--num-items 1000000 \
--num-queries 512 \
--dim 2048 \
--reduced-dims 32 64 128 256 \
--sparse-topks 4 8 16 32 64 \
--top-k 10 \
--query-batch-size 64 \
--warmup 10 \
--repeats 50 \
--output-json sparse_retrieval_benchmark.json
The script has two modes. Without real embeddings, it generates synthetic paired query/corpus embeddings, which is useful for checking the mechanics and performance paths of dense, projection, prefix, and sparse retrieval. This result is not a model-quality conclusion. To measure real quality, pass real embeddings or CSR sparse latents:
python scripts/benchmark_low_dim_sparse_retrieval.py \
--device cuda \
--corpus-file corpus_embeddings.pt \
--query-file query_embeddings.pt \
--target-file target_ids.pt \
--sparse-corpus-file csr_corpus_sparse.pt \
--sparse-query-file csr_query_sparse.pt
Code: low-dimensional dense / sparse retrieval benchmark
from __future__ import annotations
import argparse
import json
import math
import time
from dataclasses import asdict, dataclass
from pathlib import Path
from typing import Any
import torch
import torch.nn.functional as F
@dataclass
class MethodResult:
method: str
dim: int
active_dim: int
recall_at_k_vs_dense: float | None
paired_recall_at_k: float | None
index_build_ms: float | None
latency_ms: float
qps: float
storage_mb: float
note: str
latency_speedup_vs_dense: float | None = None
qps_vs_dense: float | None = None
storage_vs_dense: float | None = None
def parse_args() -> argparse.Namespace:
parser = argparse.ArgumentParser(
description=(
"Benchmark dense, low-dimensional dense, prefix/MRL-proxy, and "
"TopK sparse retrieval. Use real tensors when available; otherwise "
"the script creates synthetic paired query/corpus embeddings."
)
)
parser.add_argument("--device", default="cuda" if torch.cuda.is_available() else "cpu")
parser.add_argument("--num-items", type=int, default=100_000)
parser.add_argument("--num-queries", type=int, default=512)
parser.add_argument("--dim", type=int, default=2048)
parser.add_argument("--reduced-dims", type=int, nargs="+", default=[32, 64, 128, 256])
parser.add_argument("--sparse-topks", type=int, nargs="+", default=[4, 8, 16, 32, 64])
parser.add_argument("--top-k", type=int, default=10)
parser.add_argument("--query-batch-size", type=int, default=64)
parser.add_argument("--warmup", type=int, default=10)
parser.add_argument("--repeats", type=int, default=50)
parser.add_argument("--noise", type=float, default=0.05)
parser.add_argument("--seed", type=int, default=0)
parser.add_argument("--dtype", choices=["float32", "float16", "bfloat16"], default="float32")
parser.add_argument("--corpus-file", type=Path, default=None)
parser.add_argument("--query-file", type=Path, default=None)
parser.add_argument("--target-file", type=Path, default=None)
parser.add_argument("--sparse-corpus-file", type=Path, default=None)
parser.add_argument("--sparse-query-file", type=Path, default=None)
parser.add_argument("--output-json", type=Path, default=None)
return parser.parse_args()
def dtype_from_name(name: str) -> torch.dtype:
return {
"float32": torch.float32,
"float16": torch.float16,
"bfloat16": torch.bfloat16,
}[name]
def load_tensor(path: Path) -> torch.Tensor:
value = torch.load(path, map_location="cpu")
if isinstance(value, dict):
for key in ("embeddings", "tensor", "data"):
if key in value:
value = value[key]
break
if not isinstance(value, torch.Tensor):
raise TypeError(f"{path} must contain a tensor or a dict with tensor-like embeddings")
return value
def load_sparse(path: Path) -> dict[str, torch.Tensor | int]:
value = torch.load(path, map_location="cpu")
if not isinstance(value, dict):
raise TypeError(f"{path} must contain a dict with indices, values, and dim")
required = {"indices", "values", "dim"}
missing = required - set(value)
if missing:
raise KeyError(f"{path} is missing sparse keys: {sorted(missing)}")
return {
"indices": value["indices"].long(),
"values": value["values"],
"dim": int(value["dim"]),
}
def make_synthetic(
num_items: int,
num_queries: int,
dim: int,
noise: float,
seed: int,
dtype: torch.dtype,
) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
generator = torch.Generator(device="cpu").manual_seed(seed)
corpus = torch.randn(num_items, dim, generator=generator, dtype=torch.float32)
corpus = F.normalize(corpus, dim=1).to(dtype)
target_ids = torch.randint(num_items, (num_queries,), generator=generator)
query = corpus[target_ids].float()
query = query + noise * torch.randn(query.shape, generator=generator)
query = F.normalize(query, dim=1).to(dtype)
return corpus, query, target_ids
def maybe_sync(device: torch.device) -> None:
if device.type == "cuda":
torch.cuda.synchronize(device)
def tensor_storage_mb(tensor: torch.Tensor) -> float:
return tensor.numel() * tensor.element_size() / 1_000_000
def sparse_storage_mb(sparse: dict[str, torch.Tensor | int]) -> float:
indices = sparse["indices"]
values = sparse["values"]
assert isinstance(indices, torch.Tensor)
assert isinstance(values, torch.Tensor)
return (indices.numel() * indices.element_size() + values.numel() * values.element_size()) / 1_000_000
@torch.no_grad()
def dense_topk(
query: torch.Tensor,
corpus: torch.Tensor,
k: int,
query_batch_size: int,
) -> torch.Tensor:
all_indices = []
corpus_t = corpus.t().contiguous()
for start in range(0, query.shape[0], query_batch_size):
q = query[start : start + query_batch_size]
scores = q @ corpus_t
all_indices.append(torch.topk(scores, k=min(k, corpus.shape[0]), dim=1).indices)
return torch.cat(all_indices, dim=0)
def time_call(fn: Any, warmup: int, repeats: int, device: torch.device) -> tuple[Any, float]:
result = None
for _ in range(warmup):
result = fn()
maybe_sync(device)
start = time.perf_counter()
for _ in range(repeats):
result = fn()
maybe_sync(device)
elapsed_ms = (time.perf_counter() - start) * 1000.0 / repeats
return result, elapsed_ms
def recall_vs_reference(found: torch.Tensor, reference: torch.Tensor) -> float:
hits = 0
for row_found, row_ref in zip(found.cpu(), reference.cpu(), strict=True):
hits += len(set(row_found.tolist()) & set(row_ref.tolist()))
return hits / max(found.shape[0] * reference.shape[1], 1)
def paired_recall(found: torch.Tensor, target_ids: torch.Tensor | None) -> float | None:
if target_ids is None:
return None
target_ids = target_ids.cpu()
hits = 0
for row, target in zip(found.cpu(), target_ids, strict=True):
hits += int(int(target) in set(row.tolist()))
return hits / max(found.shape[0], 1)
def projected(
corpus: torch.Tensor,
query: torch.Tensor,
out_dim: int,
seed: int,
dtype: torch.dtype,
device: torch.device,
) -> tuple[torch.Tensor, torch.Tensor]:
generator = torch.Generator(device="cpu").manual_seed(seed + out_dim)
scale = 1.0 / math.sqrt(out_dim)
projection = torch.randn(corpus.shape[1], out_dim, generator=generator) * scale
projection = projection.to(device=device, dtype=dtype)
return (
F.normalize(corpus @ projection, dim=1),
F.normalize(query @ projection, dim=1),
)
def prefix(corpus: torch.Tensor, query: torch.Tensor, out_dim: int) -> tuple[torch.Tensor, torch.Tensor]:
return F.normalize(corpus[:, :out_dim], dim=1), F.normalize(query[:, :out_dim], dim=1)
def to_topk_sparse(x: torch.Tensor, k: int) -> dict[str, torch.Tensor | int]:
values, indices = torch.topk(x.abs(), k=min(k, x.shape[1]), dim=1)
signed_values = torch.gather(x, 1, indices)
signed_values = F.normalize(signed_values, dim=1)
return {"indices": indices.long(), "values": signed_values, "dim": x.shape[1]}
def build_postings(
sparse: dict[str, torch.Tensor | int],
num_items: int,
dim: int,
device: torch.device,
) -> dict[str, torch.Tensor]:
indices = sparse["indices"]
values = sparse["values"]
assert isinstance(indices, torch.Tensor)
assert isinstance(values, torch.Tensor)
flat_dims = indices.reshape(-1).to(device=device)
flat_values = values.reshape(-1).to(device=device)
flat_items = (
torch.arange(num_items, device=device)
.repeat_interleave(indices.shape[1])
.to(torch.long)
)
order = torch.argsort(flat_dims)
flat_dims = flat_dims[order]
flat_values = flat_values[order]
flat_items = flat_items[order]
counts = torch.bincount(flat_dims, minlength=dim)
offsets = torch.zeros(dim + 1, dtype=torch.long, device=device)
offsets[1:] = torch.cumsum(counts, dim=0)
return {
"items": flat_items,
"values": flat_values,
"offsets_cpu": offsets.cpu(),
}
@torch.no_grad()
def sparse_topk_from_postings(
query_sparse: dict[str, torch.Tensor | int],
postings: dict[str, torch.Tensor],
num_items: int,
k: int,
) -> torch.Tensor:
query_indices = query_sparse["indices"]
query_values = query_sparse["values"]
assert isinstance(query_indices, torch.Tensor)
assert isinstance(query_values, torch.Tensor)
device = query_values.device
rows = []
query_indices_cpu = query_indices.cpu()
offsets_cpu = postings["offsets_cpu"]
items = postings["items"]
values = postings["values"]
for row_indices, row_values in zip(query_indices_cpu, query_values, strict=True):
scores = torch.zeros(num_items, dtype=query_values.dtype, device=device)
for dim_id, query_value in zip(row_indices, row_values, strict=True):
dim_int = int(dim_id)
start = int(offsets_cpu[dim_int])
end = int(offsets_cpu[dim_int + 1])
if end > start:
scores.index_add_(0, items[start:end], values[start:end] * query_value)
rows.append(torch.topk(scores, k=min(k, num_items), dim=0).indices)
return torch.stack(rows, dim=0)
def format_markdown(results: list[MethodResult]) -> str:
headers = [
"method",
"dim",
"active",
"recall@k vs dense",
"paired recall@k",
"build ms",
"latency ms",
"latency x dense",
"qps",
"qps x dense",
"storage MB",
"storage x dense",
"note",
]
lines = ["| " + " | ".join(headers) + " |", "| " + " | ".join(["---"] * len(headers)) + " |"]
for result in results:
lines.append(
"| "
+ " | ".join(
[
result.method,
str(result.dim),
str(result.active_dim),
"n/a" if result.recall_at_k_vs_dense is None else f"{result.recall_at_k_vs_dense:.4f}",
"n/a" if result.paired_recall_at_k is None else f"{result.paired_recall_at_k:.4f}",
"n/a" if result.index_build_ms is None else f"{result.index_build_ms:.2f}",
f"{result.latency_ms:.2f}",
"n/a" if result.latency_speedup_vs_dense is None else f"{result.latency_speedup_vs_dense:.2f}x",
f"{result.qps:.1f}",
"n/a" if result.qps_vs_dense is None else f"{result.qps_vs_dense:.2f}x",
f"{result.storage_mb:.2f}",
"n/a" if result.storage_vs_dense is None else f"{result.storage_vs_dense:.4f}x",
result.note,
]
)
+ " |"
)
return "\n".join(lines)
def add_relative_metrics(results: list[MethodResult]) -> None:
baseline = next(result for result in results if result.method == "full_dense")
for result in results:
result.latency_speedup_vs_dense = baseline.latency_ms / result.latency_ms
result.qps_vs_dense = result.qps / baseline.qps
result.storage_vs_dense = result.storage_mb / baseline.storage_mb
def main() -> None:
args = parse_args()
device = torch.device(args.device)
dtype = dtype_from_name(args.dtype)
if args.corpus_file and args.query_file:
corpus = load_tensor(args.corpus_file)
query = load_tensor(args.query_file)
target_ids = load_tensor(args.target_file).long() if args.target_file else None
else:
corpus, query, target_ids = make_synthetic(
args.num_items,
args.num_queries,
args.dim,
args.noise,
args.seed,
dtype,
)
corpus = F.normalize(corpus.to(device=device, dtype=dtype), dim=1)
query = F.normalize(query.to(device=device, dtype=dtype), dim=1)
if target_ids is not None:
target_ids = target_ids.cpu()
if corpus.shape[1] != query.shape[1]:
raise ValueError(f"corpus dim {corpus.shape[1]} != query dim {query.shape[1]}")
results: list[MethodResult] = []
full_indices, full_latency_ms = time_call(
lambda: dense_topk(query, corpus, args.top_k, args.query_batch_size),
args.warmup,
args.repeats,
device,
)
results.append(
MethodResult(
method="full_dense",
dim=corpus.shape[1],
active_dim=corpus.shape[1],
recall_at_k_vs_dense=1.0,
paired_recall_at_k=paired_recall(full_indices, target_ids),
index_build_ms=None,
latency_ms=full_latency_ms,
qps=query.shape[0] / (full_latency_ms / 1000.0),
storage_mb=tensor_storage_mb(corpus),
note="exact dense baseline",
)
)
for reduced_dim in args.reduced_dims:
if reduced_dim > corpus.shape[1]:
continue
p_corpus, p_query = projected(corpus, query, reduced_dim, args.seed, dtype, device)
p_indices, p_latency_ms = time_call(
lambda: dense_topk(p_query, p_corpus, args.top_k, args.query_batch_size),
args.warmup,
args.repeats,
device,
)
results.append(
MethodResult(
method=f"projection_{reduced_dim}",
dim=reduced_dim,
active_dim=reduced_dim,
recall_at_k_vs_dense=recall_vs_reference(p_indices, full_indices),
paired_recall_at_k=paired_recall(p_indices, target_ids),
index_build_ms=None,
latency_ms=p_latency_ms,
qps=query.shape[0] / (p_latency_ms / 1000.0),
storage_mb=tensor_storage_mb(p_corpus),
note="random projection; replace with trained projection for model quality",
)
)
m_corpus, m_query = prefix(corpus, query, reduced_dim)
m_indices, m_latency_ms = time_call(
lambda: dense_topk(m_query, m_corpus, args.top_k, args.query_batch_size),
args.warmup,
args.repeats,
device,
)
results.append(
MethodResult(
method=f"prefix_mrl_proxy_{reduced_dim}",
dim=reduced_dim,
active_dim=reduced_dim,
recall_at_k_vs_dense=recall_vs_reference(m_indices, full_indices),
paired_recall_at_k=paired_recall(m_indices, target_ids),
index_build_ms=None,
latency_ms=m_latency_ms,
qps=query.shape[0] / (m_latency_ms / 1000.0),
storage_mb=tensor_storage_mb(m_corpus),
note="prefix mechanics only; real MRL requires trained nested embeddings",
)
)
for sparse_topk in args.sparse_topks:
if args.sparse_corpus_file and args.sparse_query_file:
sparse_corpus = load_sparse(args.sparse_corpus_file)
sparse_query = load_sparse(args.sparse_query_file)
else:
sparse_corpus = to_topk_sparse(corpus, sparse_topk)
sparse_query = to_topk_sparse(query, sparse_topk)
sparse_corpus = {
"indices": sparse_corpus["indices"].to(device=device),
"values": sparse_corpus["values"].to(device=device, dtype=dtype),
"dim": int(sparse_corpus["dim"]),
}
sparse_query = {
"indices": sparse_query["indices"].to(device=device),
"values": sparse_query["values"].to(device=device, dtype=dtype),
"dim": int(sparse_query["dim"]),
}
maybe_sync(device)
start = time.perf_counter()
postings = build_postings(sparse_corpus, corpus.shape[0], int(sparse_corpus["dim"]), device)
maybe_sync(device)
build_ms = (time.perf_counter() - start) * 1000.0
sparse_indices, sparse_latency_ms = time_call(
lambda: sparse_topk_from_postings(sparse_query, postings, corpus.shape[0], args.top_k),
args.warmup,
args.repeats,
device,
)
results.append(
MethodResult(
method=f"sparse_topk_{sparse_topk}",
dim=int(sparse_corpus["dim"]),
active_dim=sparse_topk,
recall_at_k_vs_dense=recall_vs_reference(sparse_indices, full_indices),
paired_recall_at_k=paired_recall(sparse_indices, target_ids),
index_build_ms=build_ms,
latency_ms=sparse_latency_ms,
qps=query.shape[0] / (sparse_latency_ms / 1000.0),
storage_mb=sparse_storage_mb(sparse_corpus),
note="TopK sparse retrieval; use real CSR latents for CSR model quality",
)
)
add_relative_metrics(results)
print(format_markdown(results))
if args.output_json:
args.output_json.write_text(
json.dumps([asdict(result) for result in results], indent=2),
encoding="utf-8",
)
if __name__ == "__main__":
main()
The output has two groups of metrics. recall@k vs dense measures how much the method’s top-k overlaps with the full dense top-k. paired recall@k checks whether the target item appears in the top-k, using either synthetic or real target ids. The performance columns report latency, QPS, storage MB, and sparse posting-index build time.
Model quality should be read separately from the paper experiments. The table below comes from the CSRv2 paper’s e5-Mistral-7B comparison: the same backbone and training configuration are used to compare MRL, CSR, and CSRv2 across six MTEB task types. The table keeps only the average to show the trend across active dimensions.
| active dim / k | MRL avg | CSR avg | CSRv2-linear avg | CSRv2 avg | Note |
|---|---|---|---|---|---|
| 64 | 61.86 | 66.68 | 67.58 | 68.08 | At the same active count, CSR/CSRv2 have higher average scores |
| 16 | 51.93 | 62.83 | 64.26 | 65.76 | Sparse latents retain higher average scores at low active counts |
| 4 | 40.83 | 52.94 | 58.62 | 61.01 | CSRv2’s training changes matter more in the ultra-sparse regime |
| 2 | 33.81 | 44.33 | 53.35 | 58.38 | The very low-active-count regime emphasized by the CSRv2 paper |
This table cannot replace quality evaluation on your own workload. It only shows that, when sparse latents are trained, CSR/CSRv2 model quality cannot be inferred from the recall@k vs dense numbers in the synthetic benchmark below. The numbers below are mainly a mechanics and performance check: dense baseline, random projection, prefix truncation, and TopK sparse retrieval measured on the same machine for latency, QPS, and storage.
All relative columns use full_dense as the baseline. latency x dense above 1.0x means faster than full dense; below 1.0x means slower. A smaller storage x dense means lower storage use. The full script is included in the collapsed code block above; add_relative_metrics computes these relative metrics.
| method | dim | active | recall@k vs dense | paired recall@k | build ms | latency ms | latency x dense | qps | qps x dense | storage MB | storage x dense | note |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| full_dense | 2048 | 2048 | 1.0000 | 1.0000 | n/a | 71.44 | 1.00x | 7167.0 | 1.00x | 8192.00 | 1.0000x | exact dense baseline |
| projection_32 | 32 | 32 | 0.0014 | 0.0137 | n/a | 8.89 | 8.04x | 57577.7 | 8.03x | 128.00 | 0.0156x | random projection; replace with trained projection for model quality |
| prefix_mrl_proxy_32 | 32 | 32 | 0.0021 | 0.0195 | n/a | 8.89 | 8.04x | 57621.4 | 8.04x | 128.00 | 0.0156x | prefix mechanics only; real MRL requires trained nested embeddings |
| projection_64 | 64 | 64 | 0.0193 | 0.1914 | n/a | 10.79 | 6.62x | 47431.4 | 6.62x | 256.00 | 0.0313x | random projection; replace with trained projection for model quality |
| prefix_mrl_proxy_64 | 64 | 64 | 0.0139 | 0.1367 | n/a | 10.82 | 6.60x | 47308.5 | 6.60x | 256.00 | 0.0313x | prefix mechanics only; real MRL requires trained nested embeddings |
| projection_128 | 128 | 128 | 0.0631 | 0.6211 | n/a | 13.78 | 5.18x | 37156.0 | 5.18x | 512.00 | 0.0625x | random projection; replace with trained projection for model quality |
| prefix_mrl_proxy_128 | 128 | 128 | 0.0678 | 0.6738 | n/a | 13.74 | 5.20x | 37264.3 | 5.20x | 512.00 | 0.0625x | prefix mechanics only; real MRL requires trained nested embeddings |
| projection_256 | 256 | 256 | 0.0994 | 0.9805 | n/a | 19.62 | 3.64x | 26092.6 | 3.64x | 1024.00 | 0.1250x | random projection; replace with trained projection for model quality |
| prefix_mrl_proxy_256 | 256 | 256 | 0.1000 | 0.9922 | n/a | 19.71 | 3.62x | 25971.5 | 3.62x | 1024.00 | 0.1250x | prefix mechanics only; real MRL requires trained nested embeddings |
| sparse_topk_4 | 2048 | 4 | 0.0002 | 0.0000 | 61.53 | 128.40 | 0.56x | 3987.6 | 0.56x | 48.00 | 0.0059x | TopK sparse retrieval; use real CSR latents for CSR model quality |
| sparse_topk_8 | 2048 | 8 | 0.0014 | 0.0117 | 2.26 | 143.58 | 0.50x | 3565.9 | 0.50x | 96.00 | 0.0117x | TopK sparse retrieval; use real CSR latents for CSR model quality |
| sparse_topk_16 | 2048 | 16 | 0.0043 | 0.0430 | 4.12 | 203.11 | 0.35x | 2520.8 | 0.35x | 192.00 | 0.0234x | TopK sparse retrieval; use real CSR latents for CSR model quality |
| sparse_topk_32 | 2048 | 32 | 0.0150 | 0.1504 | 7.69 | 361.48 | 0.20x | 1416.4 | 0.20x | 384.00 | 0.0469x | TopK sparse retrieval; use real CSR latents for CSR model quality |
| sparse_topk_64 | 2048 | 64 | 0.0523 | 0.5234 | 14.90 | 676.08 | 0.11x | 757.3 | 0.11x | 768.00 | 0.0938x | TopK sparse retrieval; use real CSR latents for CSR model quality |
These numbers are better read as an engineering sanity check, not a model-quality conclusion. The recall of random projection and the prefix proxy shows that untrained low-dimensional representations do not represent the real performance of trained projection or MRL. The sparse TopK performance also shows that this sparse retrieval path has not realized the theoretical advantage. A real CSR/CSRv2 evaluation needs trained sparse latents and should compare quality and latency together.
Method Selection
Projection is a good baseline. It is simple to implement and simple to deploy. The tradeoff is that a small dense vector becomes the information bottleneck. If the task has enough redundancy, or if the teacher embedding has more dimensions than the task really needs, projection may be sufficient.
MRL is a good fit when one model needs to serve multiple budget tiers. A single embedding can be truncated to 32/64/128/256 dimensions, and the downstream interface remains dense search. Its training constraint is stronger: the model must put useful information early. At very short lengths, prefix capacity is a hard limit.
CSR is a fit when the retrieval path can change and the system can exploit sparsity. It shifts “low cost” from low dimensionality to low activation count: the latent space can be large, but each sample uses only a few active features in computation. CSRv2 improves training stability and makes lower active counts more plausible, but it still depends heavily on the sparse retrieval implementation. If the sparse index, kernels, and serving path do not turn sparsity into real latency or memory gains, it is not a direct replacement for dense embeddings.
A conservative evaluation order is: first measure the full dense upper bound, then trained projection and MRL prefixes, and finally CSR sparse latents. At every step, measure quality and performance together. Low-dimensional representation is not the goal by itself. The goal is acceptable retrieval quality under a latency, memory, and cost budget.
References
- Kusupati et al., Matryoshka Representation Learning, NeurIPS 2022.
- Wen et al., Beyond Matryoshka: Revisiting Sparse Coding for Adaptive Representation, 2025.
- CSRv2 authors, CSRv2: Unlocking Ultra-Sparse Embeddings, 2026.
- Muennighoff et al., MTEB: Massive Text Embedding Benchmark, 2022.
- Hugging Face model cards: all-mpnet-base-v2, bge-large-en-v1.5, e5-mistral-7b-instruct, NV-Embed-v2, Qwen3-Embedding-8B.