Think Before You Embed | ForeverYoung

From 2018 to 2022, the embedding field had a recognizable shape. BERT gave you bidirectional attention and a [CLS] token to pool. SBERT showed you could make sentence vectors with siamese training. DPR applied the recipe to dense retrieval. E5, GTE, and BGE each pushed MTEB higher while keeping the same basic setup: InfoNCE contrastive loss, (query, positive, hard-negative) triplets, encoder-only architecture.

Then two things happened in parallel. Decoder-only LLMs took over the embedding leaderboard, with scale and instruction-following doing what BERT-sized models couldn’t. And practitioners building search systems started routing every query through a separate rewriting LLM before it touched the embedding model, because raw user queries are too short and vague to retrieve well on their own.

Two ICLR 2026 papers connect these threads. What if the rewriting step moved inside the embedding model itself, with the same gradient running through both?

The BERT-era baseline

The standard setup is a bi-encoder: one model handles both queries and documents, with shared weights, run as two independent forward passes. You encode the query, encode the document, and compute similarity between the two vectors. Document vectors can be pre-computed and indexed offline — at query time you just run the query encoder and do a nearest-neighbor lookup, fast enough to search millions of documents in milliseconds. A cross-encoder (which processes query and document together) gives better rankings but can’t be pre-computed, so it’s only used for re-ranking a small candidate set, not first-stage retrieval.

Training uses InfoNCE contrastive loss: compute a similarity matrix over a batch, push the diagonal entries (matched query-document pairs) toward 1, push everything else toward 0.

Bi-encoder contrastive training with a single shared-weight encoder. Queries and documents are encoded in separate forward passes, producing independent vectors. The similarity matrix drives InfoNCE loss: diagonal entries are positives, everything else is negatives.

The recipe held up for years. The costs only show up once the base model is a full LLM.

Contrastive fine-tuning degrades general reasoning and generation. The gradient targets only the pooled vector, so everything else the model knows how to do gets pushed aside. For BERT this barely registered. For a 7B model trained to reason and generate, it’s a real tradeoff.

There’s also no room to think before encoding. One forward pass, raw tokens to pooled output. BERT was never going to reason through a query anyway. A 7B LLM could, and the standard recipe never lets it.

Decoder-only takes the leaderboard

E5-mistral-7B (2023) applied the exact same contrastive recipe to a 7B decoder-only Mistral model, swapping CLS pooling for last-token pooling. That’s the only architectural change. It hit the top of MTEB.

Encoder-only (left) pools the [CLS] token under full bidirectional attention. Decoder-only (right) pools the last [EOS] token; via causal masking, it has attended to every preceding token and aggregates full-sequence context.

The leaderboard since 2023 tells the story in a table:

Year	Model	Architecture	MTEB Score
2022–23	E5-large, GTE-large, BGE-large	Encoder-only (330M)	~62–64
2023	E5-mistral-7B	Decoder-only, 7B	~66
2024	NV-Embed-v2	Decoder-only, 7B	72.31 (EN)
2025	Qwen3-Embedding-8B	Decoder-only, 8B	70.58 (multilingual)
2026	Harrier-OSS-v1-27B	Decoder-only, 27B	74.3 (multilingual v2)

Harrier is worth a note. Microsoft quietly dropped it on Hugging Face in March 2026 under MIT license, no announcement. It’s built on Gemma3 with last-token pooling and L2 normalization, supports 100+ languages and 32k context. The 27B variant now holds #1 on Multilingual MTEB v2. Even the 270M version beats most encoder-only baselines.

Decoder-only models won the leaderboard for a mix of reasons. Scale is the obvious one — 7B parameters is just bigger than 330M. But scale alone doesn’t explain it. Instruction following matters too: prefix the query with a task description and a large LLM adjusts its representation in ways BERT-sized models can’t. And last-token pooling works better than you’d expect: the final [EOS] token has attended to everything before it, so it ends up as a reasonable summary of the full sequence.

All of these models still train with contrastive loss. The decoder is treated as a bigger, better BERT, with the generative capability present but never touched during training. Two ICLR 2026 papers ask what happens when you actually use it.

Elaboration before the vector

Production search systems rarely embed a raw user query directly. “best practices for retrieval” — short, vague, underspecified — is a poor retrieval target. The standard fix is a separate LLM that elaborates the query before it reaches the embedding model: rewrite it into paraphrases, generate a hypothetical answer that a matching document might contain, or decompose it into subquestions. An embedding model then vectorizes those elaborated forms.

The pattern works. The problem is the seam. The rewriter has no signal about what makes a good embedding; the embedder has no influence over how queries get elaborated. Each is optimized independently, against its own objective, and the joint retrieval quality depends on how well the two happen to align.

Production query rewriting (left) splits the work across two models with no shared gradient: the rewriter doesn't know what makes a good embedding, and the embedder can't improve the rewriting. GRACE and TTE (right) collapse this into a single model: elaboration and embedding share weights, so the elaboration learns to optimize for retrieval quality.

GRACE and TTE close that seam. GRACE trains the LLM to generate rationales whose embeddings rank well against target documents — elaboration quality is rewarded directly by retrieval quality via RL. TTE makes the elaboration explicit at inference: the Reasoner generates a chain-of-thought trace, and the Embedder reads that trace before producing the vector. In both cases the same model handles both steps, with a gradient signal connecting them.

That’s what separates these approaches from production query rewriting: when elaboration and embedding share weights, the elaboration learns to optimize for retrieval quality, not just paraphrase fluency.

GRACE: contrastive signals as rewards

Standard CL (left) feeds the contrastive signal directly as a loss on encoder weights, bypassing the generative pathway entirely. GRACE (right) treats the same signal as a reward: the LLM generates a natural-language rationale, pools it into a vector, and policy gradient teaches the model to write better rationales.

GRACE (code) changes one thing: the contrastive signal goes from a loss to a reward.

In standard CL, InfoNCE updates the encoder directly — the model produces a vector and learns to push it toward the positive document’s vector. The generative pathway is never touched. In GRACE, the LLM generates a natural-language rationale first: a brief explanation of what the query means and how it relates to matching documents. That rationale gets mean-pooled into an embedding. The reward is the cosine similarity between that embedding and the target document’s vector. Policy gradient (via the verl framework) then updates the model to write rationales that produce better embeddings.

The model never directly optimizes for a particular vector. It learns to write rationales that produce good vectors. Unlike standard CL, the generative pathway is actually used.

On MTEB: +11.5% over standard CL in the supervised setting averaged across four backbones; +6.9% unsupervised. Those are solid numbers, but the capability retention result is what I find more interesting. Standard CL fine-tuning measurably hurts reasoning and generation benchmarks. GRACE-trained models mostly don’t show that degradation — because they’re exercising the generative pathway throughout training rather than bypassing it.

As a side effect, the rationales are human-readable. Each embedding is the pooled encoding of the model’s own interpretation of the input, so you can read what it “thought” before producing the vector.

Think Then Embed: reasoning before the vector

GRACE is a training-time change. Think Then Embed is an inference-time change. Same underlying idea, different lever.

The Reasoner generates a chain-of-thought trace explaining the query's semantics. The Embedder then produces a vector conditioned on both the original query and the trace. The reasoning context becomes part of the representation.

The paper targets multimodal retrieval, but the problem shows up anywhere queries need compositional reasoning. Current MLLM-based embedders are single-pass: raw input in, vector out, no opportunity to unpack the query’s structure first. When a query says “find images of someone standing near water at sunset with other people visible in the background,” all of that has to be resolved in one forward pass. There’s nowhere for the model to work through what it’s actually looking for.

TTE splits the step. A Reasoner — an MLLM fine-tuned to produce retrieval-useful reasoning traces — first generates a chain-of-thought description: what the query is asking, which features matter, how to recognize a match. An Embedder then produces the vector conditioned on both the original input and that trace.

The Reasoner is trained specifically for retrieval, not general summarization. It learns to surface the features that matter for matching. On MMEB-V2, TTE leads the benchmark — 7% above recent baselines, including proprietary models trained on larger in-house datasets. The paper also looks at merging Reasoner and Embedder into a single model to cut the inference cost.

The common thread

Both papers are running the same play from different ends. GRACE teaches the model to elaborate queries during training: policy gradient shapes how the model writes rationales, optimizing them toward what actually retrieves well. TTE makes the elaboration step explicit at inference — the reasoning trace is the elaboration, readable and separable from the encoding.

The production pipeline they’re both competing with uses two separate models with no shared optimization. In both cases, joint optimization wins. As you know, the reason why deep learning works is end-to-end optimization.