Low-Dimensional Representations: Projection, MRL, and Sparse Representations

When working on embeddings, it is natural to focus first on the model itself: use a larger backbone, add more data, and push MTEB or internal evals a little higher. The output dimension tends to move up with the model. 384 and 768 dimensions are still common, but 2048- and 4096-dimensional text and multimodal embeddings are no longer unusual.

This is not meant to be a strict leaderboard timeline. MTEB later split into English, multilingual, and versioned leaderboards, and the rankings keep changing. Still, a few representative strong models show the direction of travel:

Year	Representative model	Output dim	Note
2022	sentence-transformers/all-mpnet-base-v2	768	A widely used early sentence-transformers baseline
2023	BAAI/bge-large-en-v1.5	1024	The BGE series ranked highly on MTEB / C-MTEB at release
2024	intfloat/e5-mistral-7b-instruct	4096	LLM backbones became a major line of embedding models
2024	nvidia/NV-Embed-v2	4096	The model card reports No.1 on 56 MTEB tasks as of 2024-08-30
2025	Qwen/Qwen3-Embedding-8B	4096	Supports variable output dimensions from 32 to 4096, with 4096 as the default upper bound

In offline experiments, this cost is often muted. Add some GPU memory, add disk, reduce the batch size, and the experiment can keep running. Indexing and serving feel the pressure more directly. A 4096-dimensional float32 vector is 16 KB. One billion items is 16 TB before index structures, replicas, caches, or metadata. At query time, the system still has to find nearest neighbors over those vectors under a tight latency budget.

The question is not just whether vectors can be made smaller. It is more specific: in a retrieval system that already has a quality bar, which representations reduce storage, bandwidth, and compute? And how much recall changes in exchange?

This post compares three common routes: projection to a lower dimension, Matryoshka Representation Learning (MRL), and Contrastive Sparse Representation (CSR). The first two shorten a dense vector. CSR can keep a large representation space, but each sample uses only a small number of positions.

Representation Forms

Start with what the retrieval system receives at inference time. The training details come later.

Projection is the most direct form. The original dense embedding is x in R^d; a linear layer or MLP maps it to z in R^m, where m << d. The retrieval system does not need to care how the mapping was produced. It still sees a dense vector, just with fewer dimensions.

MRL also keeps the dense-vector interface. The difference is that it does not simply chop off a prefix after training. During training, it makes the first 32, 64, and 128 dimensions work on their own. At inference time, a tight budget uses a short prefix; a looser budget uses a longer one. This design is compatible with existing dense KNN / ANN systems.

CSR takes a different route. It first maps the original embedding into a larger latent space, say h dimensions, then keeps only the TopK nonzero values. Storage does not keep a full dense vector. It keeps indices and values. If k = 16, each item stores 16 positions and 16 values. This representation only becomes useful when the retrieval system also uses sparse retrieval; otherwise sparsity stays in the complexity formula rather than the system.

A visual comparison of the representation forms. Projection compresses the vector into a short dense representation; MRL uses prefixes; CSR/CSRv2 still represent samples in a high-dimensional space but keep only a few nonzero positions.

MRL and CSR use active count in different ways. In MRL, m=64 means “use the first 64 dense dimensions.” In CSR, k=64 means “activate 64 positions in a potentially much larger latent space.” The two settings may have similar cost, but they organize information very differently.

Training Objectives

Projection does not prescribe one fixed training recipe. You can train the low-dimensional vector directly with the original retrieval loss, such as InfoNCE. You can also distill from a high-dimensional teacher by matching its similarity distribution. Another option is to add a reconstruction loss so the low-dimensional vector preserves information from the original embedding. Regardless of the recipe, the information has to fit into m dimensions.

MRL has a clearer constraint. The original Matryoshka Representation Learning applies losses at multiple truncation lengths. It tells the model not to waste the early dimensions, because short vectors must also be usable.

For a retrieval task, MRL can be written as:

Here M is a set of truncation lengths, such as {32, 64, 128, 256}. Training computes a retrieval loss at each length. Inference chooses one length. Deployment stays simple because the system still runs dense vector search. The cost is also clear: the model must learn to place useful information early. The shorter the prefix, the less information it can hold.

CSR training looks more like a sparse autoencoder plus a task constraint. Beyond Matryoshka: Revisiting Sparse Coding for Adaptive Representation starts from a pretrained dense embedding and trains a sparse module that maps it into a TopK latent representation. The form can be written as:

The reconstruction loss preserves information from the original embedding. The contrastive loss makes the sparse latent representation useful for retrieval or classification. CSR differs from low-dimensional dense methods because it does not force every sample into the same small coordinate set. The model can have a large latent dictionary, while each sample selects only a few positions.

CSR training can be split into two constraints: the decoder should reconstruct the original embedding from the sparse latent representation, and the task loss should keep the sparse latent representation useful for retrieval.

CSRv2 fits into this picture. It does not introduce a different representation. It addresses CSR’s training problems. In the ultra-sparse region, especially k=2 and k=4, the original CSR can produce dead neurons: many latent dimensions are rarely selected, so the model appears to have a large latent space but has much less usable capacity.

CSRv2 mainly changes training. It uses k-annealing: start with a larger k, then gradually move down to the target sparsity so the model is not constrained by a tiny active set from the beginning. It also adds supervised contrastive signals so the few active features serve downstream tasks more directly. For cross-domain settings, the paper also discusses full finetuning. A cautious reading is that CSRv2 does not prove “sparse is always better.” It shows that some ultra-sparse failures come from training collapse and should not be blamed only on sparse representations.

Evaluation

Complexity analysis needs measurements. A low-cost representation has to answer at least two questions:

How do quality metrics change?
How much speed and storage does it save?

These two questions should be evaluated separately. Quality can be measured with retrieval Recall, nDCG, or MRR; it can also be approximated by overlap with the full dense top-k. Performance needs storage, index build time, query latency, QPS, batch size, hardware, and kernel details. Reporting only the dimension or k is incomplete.

Dense retrieval has the basic form:

If the query matrix is B x d and the corpus matrix is N x d, the main cost of exact dense search is the B x N x d multiply-add work and the corresponding memory reads. MRL and projection still use this form, with d replaced by a smaller m.

CSR’s ideal path is different. Both query and corpus are TopK sparse. Similarity only needs to accumulate over shared active dimensions:

If the sparse index is implemented well, the cost is closer to the number of active features and posting-list accesses, rather than the full latent dimension. The engineering cost is there too. Position indices are not free, and neither are scatter / index_add. When k is very small, kernel launch overhead and non-contiguous memory access can consume part of the theoretical gain.

The following is a reproducible benchmark script. It is not a substitute for a formal evaluation. It puts quality and performance numbers from the same vector set into one place, so the quality metrics and performance metrics do not come from different settings:

python scripts/benchmark_low_dim_sparse_retrieval.py \
  --device cuda \
  --num-items 1000000 \
  --num-queries 512 \
  --dim 2048 \
  --reduced-dims 32 64 128 256 \
  --sparse-topks 4 8 16 32 64 \
  --top-k 10 \
  --query-batch-size 64 \
  --warmup 10 \
  --repeats 50 \
  --output-json sparse_retrieval_benchmark.json

The script has two modes. Without real embeddings, it generates synthetic paired query/corpus embeddings, which is useful for checking the mechanics and performance paths of dense, projection, prefix, and sparse retrieval. This result is not a model-quality conclusion. To measure real quality, pass real embeddings or CSR sparse latents:

python scripts/benchmark_low_dim_sparse_retrieval.py \
  --device cuda \
  --corpus-file corpus_embeddings.pt \
  --query-file query_embeddings.pt \
  --target-file target_ids.pt \
  --sparse-corpus-file csr_corpus_sparse.pt \
  --sparse-query-file csr_query_sparse.pt

Code: low-dimensional dense / sparse retrieval benchmark

from __future__ import annotations

import argparse
import json
import math
import time
from dataclasses import asdict, dataclass
from pathlib import Path
from typing import Any

import torch
import torch.nn.functional as F


@dataclass
class MethodResult:
    method: str
    dim: int
    active_dim: int
    recall_at_k_vs_dense: float | None
    paired_recall_at_k: float | None
    index_build_ms: float | None
    latency_ms: float
    qps: float
    storage_mb: float
    note: str
    latency_speedup_vs_dense: float | None = None
    qps_vs_dense: float | None = None
    storage_vs_dense: float | None = None


def parse_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser(
        description=(
            "Benchmark dense, low-dimensional dense, prefix/MRL-proxy, and "
            "TopK sparse retrieval. Use real tensors when available; otherwise "
            "the script creates synthetic paired query/corpus embeddings."
        )
    )
    parser.add_argument("--device", default="cuda" if torch.cuda.is_available() else "cpu")
    parser.add_argument("--num-items", type=int, default=100_000)
    parser.add_argument("--num-queries", type=int, default=512)
    parser.add_argument("--dim", type=int, default=2048)
    parser.add_argument("--reduced-dims", type=int, nargs="+", default=[32, 64, 128, 256])
    parser.add_argument("--sparse-topks", type=int, nargs="+", default=[4, 8, 16, 32, 64])
    parser.add_argument("--top-k", type=int, default=10)
    parser.add_argument("--query-batch-size", type=int, default=64)
    parser.add_argument("--warmup", type=int, default=10)
    parser.add_argument("--repeats", type=int, default=50)
    parser.add_argument("--noise", type=float, default=0.05)
    parser.add_argument("--seed", type=int, default=0)
    parser.add_argument("--dtype", choices=["float32", "float16", "bfloat16"], default="float32")
    parser.add_argument("--corpus-file", type=Path, default=None)
    parser.add_argument("--query-file", type=Path, default=None)
    parser.add_argument("--target-file", type=Path, default=None)
    parser.add_argument("--sparse-corpus-file", type=Path, default=None)
    parser.add_argument("--sparse-query-file", type=Path, default=None)
    parser.add_argument("--output-json", type=Path, default=None)
    return parser.parse_args()


def dtype_from_name(name: str) -> torch.dtype:
    return {
        "float32": torch.float32,
        "float16": torch.float16,
        "bfloat16": torch.bfloat16,
    }[name]


def load_tensor(path: Path) -> torch.Tensor:
    value = torch.load(path, map_location="cpu")
    if isinstance(value, dict):
        for key in ("embeddings", "tensor", "data"):
            if key in value:
                value = value[key]
                break
    if not isinstance(value, torch.Tensor):
        raise TypeError(f"{path} must contain a tensor or a dict with tensor-like embeddings")
    return value


def load_sparse(path: Path) -> dict[str, torch.Tensor | int]:
    value = torch.load(path, map_location="cpu")
    if not isinstance(value, dict):
        raise TypeError(f"{path} must contain a dict with indices, values, and dim")
    required = {"indices", "values", "dim"}
    missing = required - set(value)
    if missing:
        raise KeyError(f"{path} is missing sparse keys: {sorted(missing)}")
    return {
        "indices": value["indices"].long(),
        "values": value["values"],
        "dim": int(value["dim"]),
    }


def make_synthetic(
    num_items: int,
    num_queries: int,
    dim: int,
    noise: float,
    seed: int,
    dtype: torch.dtype,
) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
    generator = torch.Generator(device="cpu").manual_seed(seed)
    corpus = torch.randn(num_items, dim, generator=generator, dtype=torch.float32)
    corpus = F.normalize(corpus, dim=1).to(dtype)
    target_ids = torch.randint(num_items, (num_queries,), generator=generator)
    query = corpus[target_ids].float()
    query = query + noise * torch.randn(query.shape, generator=generator)
    query = F.normalize(query, dim=1).to(dtype)
    return corpus, query, target_ids


def maybe_sync(device: torch.device) -> None:
    if device.type == "cuda":
        torch.cuda.synchronize(device)


def tensor_storage_mb(tensor: torch.Tensor) -> float:
    return tensor.numel() * tensor.element_size() / 1_000_000


def sparse_storage_mb(sparse: dict[str, torch.Tensor | int]) -> float:
    indices = sparse["indices"]
    values = sparse["values"]
    assert isinstance(indices, torch.Tensor)
    assert isinstance(values, torch.Tensor)
    return (indices.numel() * indices.element_size() + values.numel() * values.element_size()) / 1_000_000


@torch.no_grad()
def dense_topk(
    query: torch.Tensor,
    corpus: torch.Tensor,
    k: int,
    query_batch_size: int,
) -> torch.Tensor:
    all_indices = []
    corpus_t = corpus.t().contiguous()
    for start in range(0, query.shape[0], query_batch_size):
        q = query[start : start + query_batch_size]
        scores = q @ corpus_t
        all_indices.append(torch.topk(scores, k=min(k, corpus.shape[0]), dim=1).indices)
    return torch.cat(all_indices, dim=0)


def time_call(fn: Any, warmup: int, repeats: int, device: torch.device) -> tuple[Any, float]:
    result = None
    for _ in range(warmup):
        result = fn()
    maybe_sync(device)
    start = time.perf_counter()
    for _ in range(repeats):
        result = fn()
    maybe_sync(device)
    elapsed_ms = (time.perf_counter() - start) * 1000.0 / repeats
    return result, elapsed_ms


def recall_vs_reference(found: torch.Tensor, reference: torch.Tensor) -> float:
    hits = 0
    for row_found, row_ref in zip(found.cpu(), reference.cpu(), strict=True):
        hits += len(set(row_found.tolist()) & set(row_ref.tolist()))
    return hits / max(found.shape[0] * reference.shape[1], 1)


def paired_recall(found: torch.Tensor, target_ids: torch.Tensor | None) -> float | None:
    if target_ids is None:
        return None
    target_ids = target_ids.cpu()
    hits = 0
    for row, target in zip(found.cpu(), target_ids, strict=True):
        hits += int(int(target) in set(row.tolist()))
    return hits / max(found.shape[0], 1)


def projected(
    corpus: torch.Tensor,
    query: torch.Tensor,
    out_dim: int,
    seed: int,
    dtype: torch.dtype,
    device: torch.device,
) -> tuple[torch.Tensor, torch.Tensor]:
    generator = torch.Generator(device="cpu").manual_seed(seed + out_dim)
    scale = 1.0 / math.sqrt(out_dim)
    projection = torch.randn(corpus.shape[1], out_dim, generator=generator) * scale
    projection = projection.to(device=device, dtype=dtype)
    return (
        F.normalize(corpus @ projection, dim=1),
        F.normalize(query @ projection, dim=1),
    )


def prefix(corpus: torch.Tensor, query: torch.Tensor, out_dim: int) -> tuple[torch.Tensor, torch.Tensor]:
    return F.normalize(corpus[:, :out_dim], dim=1), F.normalize(query[:, :out_dim], dim=1)


def to_topk_sparse(x: torch.Tensor, k: int) -> dict[str, torch.Tensor | int]:
    values, indices = torch.topk(x.abs(), k=min(k, x.shape[1]), dim=1)
    signed_values = torch.gather(x, 1, indices)
    signed_values = F.normalize(signed_values, dim=1)
    return {"indices": indices.long(), "values": signed_values, "dim": x.shape[1]}


def build_postings(
    sparse: dict[str, torch.Tensor | int],
    num_items: int,
    dim: int,
    device: torch.device,
) -> dict[str, torch.Tensor]:
    indices = sparse["indices"]
    values = sparse["values"]
    assert isinstance(indices, torch.Tensor)
    assert isinstance(values, torch.Tensor)

    flat_dims = indices.reshape(-1).to(device=device)
    flat_values = values.reshape(-1).to(device=device)
    flat_items = (
        torch.arange(num_items, device=device)
        .repeat_interleave(indices.shape[1])
        .to(torch.long)
    )
    order = torch.argsort(flat_dims)
    flat_dims = flat_dims[order]
    flat_values = flat_values[order]
    flat_items = flat_items[order]
    counts = torch.bincount(flat_dims, minlength=dim)
    offsets = torch.zeros(dim + 1, dtype=torch.long, device=device)
    offsets[1:] = torch.cumsum(counts, dim=0)
    return {
        "items": flat_items,
        "values": flat_values,
        "offsets_cpu": offsets.cpu(),
    }


@torch.no_grad()
def sparse_topk_from_postings(
    query_sparse: dict[str, torch.Tensor | int],
    postings: dict[str, torch.Tensor],
    num_items: int,
    k: int,
) -> torch.Tensor:
    query_indices = query_sparse["indices"]
    query_values = query_sparse["values"]
    assert isinstance(query_indices, torch.Tensor)
    assert isinstance(query_values, torch.Tensor)

    device = query_values.device
    rows = []
    query_indices_cpu = query_indices.cpu()
    offsets_cpu = postings["offsets_cpu"]
    items = postings["items"]
    values = postings["values"]
    for row_indices, row_values in zip(query_indices_cpu, query_values, strict=True):
        scores = torch.zeros(num_items, dtype=query_values.dtype, device=device)
        for dim_id, query_value in zip(row_indices, row_values, strict=True):
            dim_int = int(dim_id)
            start = int(offsets_cpu[dim_int])
            end = int(offsets_cpu[dim_int + 1])
            if end > start:
                scores.index_add_(0, items[start:end], values[start:end] * query_value)
        rows.append(torch.topk(scores, k=min(k, num_items), dim=0).indices)
    return torch.stack(rows, dim=0)


def format_markdown(results: list[MethodResult]) -> str:
    headers = [
        "method",
        "dim",
        "active",
        "recall@k vs dense",
        "paired recall@k",
        "build ms",
        "latency ms",
        "latency x dense",
        "qps",
        "qps x dense",
        "storage MB",
        "storage x dense",
        "note",
    ]
    lines = ["| " + " | ".join(headers) + " |", "| " + " | ".join(["---"] * len(headers)) + " |"]
    for result in results:
        lines.append(
            "| "
            + " | ".join(
                [
                    result.method,
                    str(result.dim),
                    str(result.active_dim),
                    "n/a" if result.recall_at_k_vs_dense is None else f"{result.recall_at_k_vs_dense:.4f}",
                    "n/a" if result.paired_recall_at_k is None else f"{result.paired_recall_at_k:.4f}",
                    "n/a" if result.index_build_ms is None else f"{result.index_build_ms:.2f}",
                    f"{result.latency_ms:.2f}",
                    "n/a" if result.latency_speedup_vs_dense is None else f"{result.latency_speedup_vs_dense:.2f}x",
                    f"{result.qps:.1f}",
                    "n/a" if result.qps_vs_dense is None else f"{result.qps_vs_dense:.2f}x",
                    f"{result.storage_mb:.2f}",
                    "n/a" if result.storage_vs_dense is None else f"{result.storage_vs_dense:.4f}x",
                    result.note,
                ]
            )
            + " |"
        )
    return "\n".join(lines)


def add_relative_metrics(results: list[MethodResult]) -> None:
    baseline = next(result for result in results if result.method == "full_dense")
    for result in results:
        result.latency_speedup_vs_dense = baseline.latency_ms / result.latency_ms
        result.qps_vs_dense = result.qps / baseline.qps
        result.storage_vs_dense = result.storage_mb / baseline.storage_mb


def main() -> None:
    args = parse_args()
    device = torch.device(args.device)
    dtype = dtype_from_name(args.dtype)

    if args.corpus_file and args.query_file:
        corpus = load_tensor(args.corpus_file)
        query = load_tensor(args.query_file)
        target_ids = load_tensor(args.target_file).long() if args.target_file else None
    else:
        corpus, query, target_ids = make_synthetic(
            args.num_items,
            args.num_queries,
            args.dim,
            args.noise,
            args.seed,
            dtype,
        )

    corpus = F.normalize(corpus.to(device=device, dtype=dtype), dim=1)
    query = F.normalize(query.to(device=device, dtype=dtype), dim=1)
    if target_ids is not None:
        target_ids = target_ids.cpu()

    if corpus.shape[1] != query.shape[1]:
        raise ValueError(f"corpus dim {corpus.shape[1]} != query dim {query.shape[1]}")

    results: list[MethodResult] = []
    full_indices, full_latency_ms = time_call(
        lambda: dense_topk(query, corpus, args.top_k, args.query_batch_size),
        args.warmup,
        args.repeats,
        device,
    )
    results.append(
        MethodResult(
            method="full_dense",
            dim=corpus.shape[1],
            active_dim=corpus.shape[1],
            recall_at_k_vs_dense=1.0,
            paired_recall_at_k=paired_recall(full_indices, target_ids),
            index_build_ms=None,
            latency_ms=full_latency_ms,
            qps=query.shape[0] / (full_latency_ms / 1000.0),
            storage_mb=tensor_storage_mb(corpus),
            note="exact dense baseline",
        )
    )

    for reduced_dim in args.reduced_dims:
        if reduced_dim > corpus.shape[1]:
            continue
        p_corpus, p_query = projected(corpus, query, reduced_dim, args.seed, dtype, device)
        p_indices, p_latency_ms = time_call(
            lambda: dense_topk(p_query, p_corpus, args.top_k, args.query_batch_size),
            args.warmup,
            args.repeats,
            device,
        )
        results.append(
            MethodResult(
                method=f"projection_{reduced_dim}",
                dim=reduced_dim,
                active_dim=reduced_dim,
                recall_at_k_vs_dense=recall_vs_reference(p_indices, full_indices),
                paired_recall_at_k=paired_recall(p_indices, target_ids),
                index_build_ms=None,
                latency_ms=p_latency_ms,
                qps=query.shape[0] / (p_latency_ms / 1000.0),
                storage_mb=tensor_storage_mb(p_corpus),
                note="random projection; replace with trained projection for model quality",
            )
        )

        m_corpus, m_query = prefix(corpus, query, reduced_dim)
        m_indices, m_latency_ms = time_call(
            lambda: dense_topk(m_query, m_corpus, args.top_k, args.query_batch_size),
            args.warmup,
            args.repeats,
            device,
        )
        results.append(
            MethodResult(
                method=f"prefix_mrl_proxy_{reduced_dim}",
                dim=reduced_dim,
                active_dim=reduced_dim,
                recall_at_k_vs_dense=recall_vs_reference(m_indices, full_indices),
                paired_recall_at_k=paired_recall(m_indices, target_ids),
                index_build_ms=None,
                latency_ms=m_latency_ms,
                qps=query.shape[0] / (m_latency_ms / 1000.0),
                storage_mb=tensor_storage_mb(m_corpus),
                note="prefix mechanics only; real MRL requires trained nested embeddings",
            )
        )

    for sparse_topk in args.sparse_topks:
        if args.sparse_corpus_file and args.sparse_query_file:
            sparse_corpus = load_sparse(args.sparse_corpus_file)
            sparse_query = load_sparse(args.sparse_query_file)
        else:
            sparse_corpus = to_topk_sparse(corpus, sparse_topk)
            sparse_query = to_topk_sparse(query, sparse_topk)

        sparse_corpus = {
            "indices": sparse_corpus["indices"].to(device=device),
            "values": sparse_corpus["values"].to(device=device, dtype=dtype),
            "dim": int(sparse_corpus["dim"]),
        }
        sparse_query = {
            "indices": sparse_query["indices"].to(device=device),
            "values": sparse_query["values"].to(device=device, dtype=dtype),
            "dim": int(sparse_query["dim"]),
        }

        maybe_sync(device)
        start = time.perf_counter()
        postings = build_postings(sparse_corpus, corpus.shape[0], int(sparse_corpus["dim"]), device)
        maybe_sync(device)
        build_ms = (time.perf_counter() - start) * 1000.0
        sparse_indices, sparse_latency_ms = time_call(
            lambda: sparse_topk_from_postings(sparse_query, postings, corpus.shape[0], args.top_k),
            args.warmup,
            args.repeats,
            device,
        )
        results.append(
            MethodResult(
                method=f"sparse_topk_{sparse_topk}",
                dim=int(sparse_corpus["dim"]),
                active_dim=sparse_topk,
                recall_at_k_vs_dense=recall_vs_reference(sparse_indices, full_indices),
                paired_recall_at_k=paired_recall(sparse_indices, target_ids),
                index_build_ms=build_ms,
                latency_ms=sparse_latency_ms,
                qps=query.shape[0] / (sparse_latency_ms / 1000.0),
                storage_mb=sparse_storage_mb(sparse_corpus),
                note="TopK sparse retrieval; use real CSR latents for CSR model quality",
            )
        )

    add_relative_metrics(results)
    print(format_markdown(results))
    if args.output_json:
        args.output_json.write_text(
            json.dumps([asdict(result) for result in results], indent=2),
            encoding="utf-8",
        )


if __name__ == "__main__":
    main()

The output has two groups of metrics. recall@k vs dense measures how much the method’s top-k overlaps with the full dense top-k. paired recall@k checks whether the target item appears in the top-k, using either synthetic or real target ids. The performance columns report latency, QPS, storage MB, and sparse posting-index build time.

Model quality should be read separately from the paper experiments. The table below comes from the CSRv2 paper’s e5-Mistral-7B comparison: the same backbone and training configuration are used to compare MRL, CSR, and CSRv2 across six MTEB task types. The table keeps only the average to show the trend across active dimensions.

active dim / k	MRL avg	CSR avg	CSRv2-linear avg	CSRv2 avg	Note
64	61.86	66.68	67.58	68.08	At the same active count, CSR/CSRv2 have higher average scores
16	51.93	62.83	64.26	65.76	Sparse latents retain higher average scores at low active counts
4	40.83	52.94	58.62	61.01	CSRv2’s training changes matter more in the ultra-sparse regime
2	33.81	44.33	53.35	58.38	The very low-active-count regime emphasized by the CSRv2 paper

This table cannot replace quality evaluation on your own workload. It only shows that, when sparse latents are trained, CSR/CSRv2 model quality cannot be inferred from the recall@k vs dense numbers in the synthetic benchmark below. The numbers below are mainly a mechanics and performance check: dense baseline, random projection, prefix truncation, and TopK sparse retrieval measured on the same machine for latency, QPS, and storage.

All relative columns use full_dense as the baseline. latency x dense above 1.0x means faster than full dense; below 1.0x means slower. A smaller storage x dense means lower storage use. The full script is included in the collapsed code block above; add_relative_metrics computes these relative metrics.

method	dim	active	recall@k vs dense	paired recall@k	build ms	latency ms	latency x dense	qps	qps x dense	storage MB	storage x dense	note
full_dense	2048	2048	1.0000	1.0000	n/a	71.44	1.00x	7167.0	1.00x	8192.00	1.0000x	exact dense baseline
projection_32	32	32	0.0014	0.0137	n/a	8.89	8.04x	57577.7	8.03x	128.00	0.0156x	random projection; replace with trained projection for model quality
prefix_mrl_proxy_32	32	32	0.0021	0.0195	n/a	8.89	8.04x	57621.4	8.04x	128.00	0.0156x	prefix mechanics only; real MRL requires trained nested embeddings
projection_64	64	64	0.0193	0.1914	n/a	10.79	6.62x	47431.4	6.62x	256.00	0.0313x	random projection; replace with trained projection for model quality
prefix_mrl_proxy_64	64	64	0.0139	0.1367	n/a	10.82	6.60x	47308.5	6.60x	256.00	0.0313x	prefix mechanics only; real MRL requires trained nested embeddings
projection_128	128	128	0.0631	0.6211	n/a	13.78	5.18x	37156.0	5.18x	512.00	0.0625x	random projection; replace with trained projection for model quality
prefix_mrl_proxy_128	128	128	0.0678	0.6738	n/a	13.74	5.20x	37264.3	5.20x	512.00	0.0625x	prefix mechanics only; real MRL requires trained nested embeddings
projection_256	256	256	0.0994	0.9805	n/a	19.62	3.64x	26092.6	3.64x	1024.00	0.1250x	random projection; replace with trained projection for model quality
prefix_mrl_proxy_256	256	256	0.1000	0.9922	n/a	19.71	3.62x	25971.5	3.62x	1024.00	0.1250x	prefix mechanics only; real MRL requires trained nested embeddings
sparse_topk_4	2048	4	0.0002	0.0000	61.53	128.40	0.56x	3987.6	0.56x	48.00	0.0059x	TopK sparse retrieval; use real CSR latents for CSR model quality
sparse_topk_8	2048	8	0.0014	0.0117	2.26	143.58	0.50x	3565.9	0.50x	96.00	0.0117x	TopK sparse retrieval; use real CSR latents for CSR model quality
sparse_topk_16	2048	16	0.0043	0.0430	4.12	203.11	0.35x	2520.8	0.35x	192.00	0.0234x	TopK sparse retrieval; use real CSR latents for CSR model quality
sparse_topk_32	2048	32	0.0150	0.1504	7.69	361.48	0.20x	1416.4	0.20x	384.00	0.0469x	TopK sparse retrieval; use real CSR latents for CSR model quality
sparse_topk_64	2048	64	0.0523	0.5234	14.90	676.08	0.11x	757.3	0.11x	768.00	0.0938x	TopK sparse retrieval; use real CSR latents for CSR model quality

These numbers are better read as an engineering sanity check, not a model-quality conclusion. The recall of random projection and the prefix proxy shows that untrained low-dimensional representations do not represent the real performance of trained projection or MRL. The sparse TopK performance also shows that this sparse retrieval path has not realized the theoretical advantage. A real CSR/CSRv2 evaluation needs trained sparse latents and should compare quality and latency together.

Method Selection

Projection is a good baseline. It is simple to implement and simple to deploy. The tradeoff is that a small dense vector becomes the information bottleneck. If the task has enough redundancy, or if the teacher embedding has more dimensions than the task really needs, projection may be sufficient.

MRL is a good fit when one model needs to serve multiple budget tiers. A single embedding can be truncated to 32/64/128/256 dimensions, and the downstream interface remains dense search. Its training constraint is stronger: the model must put useful information early. At very short lengths, prefix capacity is a hard limit.

CSR is a fit when the retrieval path can change and the system can exploit sparsity. It shifts “low cost” from low dimensionality to low activation count: the latent space can be large, but each sample uses only a few active features in computation. CSRv2 improves training stability and makes lower active counts more plausible, but it still depends heavily on the sparse retrieval implementation. If the sparse index, kernels, and serving path do not turn sparsity into real latency or memory gains, it is not a direct replacement for dense embeddings.

A conservative evaluation order is: first measure the full dense upper bound, then trained projection and MRL prefixes, and finally CSR sparse latents. At every step, measure quality and performance together. Low-dimensional representation is not the goal by itself. The goal is acceptable retrieval quality under a latency, memory, and cost budget.

References

Kusupati et al., Matryoshka Representation Learning, NeurIPS 2022.
Wen et al., Beyond Matryoshka: Revisiting Sparse Coding for Adaptive Representation, 2025.
CSRv2 authors, CSRv2: Unlocking Ultra-Sparse Embeddings, 2026.
Muennighoff et al., MTEB: Massive Text Embedding Benchmark, 2022.
Hugging Face model cards: all-mpnet-base-v2, bge-large-en-v1.5, e5-mistral-7b-instruct, NV-Embed-v2, Qwen3-Embedding-8B.