Beyond Basic RAG: How HyDE and Cohere Reranking Changed Everything About My Pipeline's Output Quality
Standard RAG pipelines often look like they're working — until you stress-test them with real, imprecise queries. HyDE and Cohere reranking address the two core retrieval weaknesses that cause pipelines to underperform in production, and together they make a measurable difference to output quality.
There's a particular kind of frustration that comes with a RAG pipeline that almost works.
The retrieval is running. The embeddings are indexed. The chunks are a reasonable size. You ask a question, the system finds something relevant, and the LLM generates a coherent response. On the surface, it looks like it's doing the job.
But then you start stress-testing it. You push the query language a little further from the source material. You ask something slightly abstract. You ask the kind of question a real user would ask — imprecise, conversational, not optimised for semantic similarity with a well-structured knowledge base.
And you start to see the cracks.
The retrieved chunks drift. The answer is plausible but built on the wrong foundations. The system is retrieving what it can find rather than what it should find. Good enough for a demo. Not good enough for production.
This is the problem that HyDE and Cohere reranking solve. And once you've used both together, it's genuinely difficult to go back.
First, the Core Problem With Standard RAG Retrieval
To understand why these techniques matter, it helps to be precise about what standard RAG retrieval is actually doing — and where it breaks down.
In a typical pipeline, the user's query is embedded into a vector, and that vector is compared against your indexed document chunks to find the closest semantic matches. Closest, in this context, means the chunks whose embeddings sit nearest to the query embedding in high-dimensional space.
The problem is that query language and document language are often quite different — even when they're talking about the same thing.
A user might ask: "Why does my laptop keep overheating when I'm on a video call?"
The knowledge base article that answers that question might be titled: "Managing CPU thermal load under sustained GPU-accelerated workloads."
Both are about the same problem. But their embeddings may sit surprisingly far apart in vector space, because the surface language — the actual words — is so different. Standard cosine similarity retrieval may not surface that article at all. Or it surfaces it fifth, behind four chunks that match the query language better but answer the question less well.
This is the gap that HyDE is designed to close.
HyDE: Searching With the Answer You Haven't Got Yet
HyDE — Hypothetical Document Embedding — is an elegantly counterintuitive idea.
Instead of embedding the user's query directly and searching for similar chunks, you first ask an LLM to generate a hypothetical answer to the query. A short, plausible response written in the kind of language a knowledge base document would use. You then embed that hypothetical answer and use it as your search vector.
The hypothesis doesn't need to be correct. It doesn't even need to be particularly good. Its only job is to exist in a part of vector space that's closer to your actual documents than the raw query is.
Think about what this does to the overheating example. The raw query — "Why does my laptop keep overheating when I'm on a video call?" — generates an embedding rooted in conversational, problem-description language. But the hypothetical answer generated by the LLM might read something like: "Sustained video conferencing places continuous load on both CPU and GPU, which can cause thermal throttling if cooling is insufficient." That language sits much closer to the knowledge base article. The retrieval improves — not because the index changed, but because the search vector is now a better representation of what you're actually looking for.
In practice, HyDE tends to make the biggest difference on abstract, conversational, or vague queries — exactly the kind that real users generate. For precise, technical queries that already use the right terminology, the improvement is more marginal. But the queries that most need help are usually the imprecise ones, so the overall impact on pipeline quality is significant.
The implementation overhead is real — you're adding an LLM call before every retrieval, which has latency and cost implications worth factoring in. But for pipelines where output quality matters more than raw throughput, the trade-off is usually worth it.
Cohere Reranking: Sorting What You've Already Found
HyDE improves what gets retrieved. Cohere reranking improves how what you've retrieved gets prioritised before it reaches the LLM.
Here's the distinction that matters: vector similarity search is fast, but it's a blunt instrument. It finds chunks that are semantically similar to your query. It doesn't deeply evaluate which of those chunks actually answers the query most usefully.
Cohere's reranker is a cross-encoder model. Rather than comparing a query embedding to a document embedding independently, it evaluates the query and each candidate chunk together — as a pair — and scores the relevance of that specific combination. This is computationally heavier than vector similarity, which is why you don't run it across your entire index. You run it as a second-pass filter across the top N candidates that your initial retrieval already surfaced.
The practical effect is that it re-orders your retrieved chunks by actual relevance to the specific question asked, not by geometric proximity in embedding space. Chunks that looked relevant from a distance but don't genuinely answer the question get pushed down. Chunks that were slightly lower in the initial ranking but are genuinely more useful get elevated.
It's the difference between a retrieval system that finds documents about the right topic and one that finds documents that answer the right question. Those aren't the same thing, and the gap between them is where a lot of RAG pipelines quietly underperform.
Running Them Together
HyDE and Cohere reranking are complementary, not competing. They address different parts of the same problem.
HyDE works at the front of the retrieval process — shaping the search vector so that the initial pool of retrieved candidates is better matched to what the user actually needs. Cohere works at the back — taking that pool and sorting it so that the most genuinely relevant content reaches the LLM first and most prominently.
A rough pipeline with both looks like this:
- Query arrives from the user
- HyDE generates a hypothetical answer using a lightweight LLM call
- Hypothetical answer is embedded and used as the search vector
- Vector search retrieves top N candidates from the index
- Cohere reranker scores each candidate against the original query
- Top K reranked results are passed to the LLM as context
- LLM generates the final response
Steps 2 and 5 are the additions. Everything else is standard RAG. But the quality difference in step 7 — the thing the user actually sees — is pronounced.
What Changes in Practice
The most noticeable change is how the pipeline handles difficult queries. The ones that are vague. The ones where the user doesn't know the right terminology. The ones where the answer exists in the knowledge base but the question and the answer are phrased very differently.
Those queries used to be where my pipelines showed their weakest results — retrieved chunks that were adjacent to the right answer but not quite it, leading to LLM responses that were confident but subtly off. With HyDE and reranking in place, those queries are now the ones that most clearly demonstrate the improvement. The system finds the right material even when the query doesn't give it much to work with.
The other thing that changes is the consistency of output quality across query types. Without these techniques, there's a noticeable variance — some queries get excellent results, others get mediocre ones, and it can be hard to predict which is which. With them, the floor rises. The best results don't necessarily get much better, but the worst results improve substantially, and the average quality across the pipeline becomes something you can actually rely on.
A Few Practical Notes
If you're considering adding these to an existing pipeline, a few things worth knowing.
HyDE latency is real. You're adding an LLM call to every retrieval. For synchronous, user-facing queries where response time matters, this is worth benchmarking carefully. In many cases the added latency is acceptable; in some it isn't. Caching hypothetical embeddings for repeated or similar queries can help.
Cohere reranking requires a decision on N and K. How many candidates you retrieve initially (N) and how many you pass to the LLM after reranking (K) both affect quality and cost. Too small an N and you might not catch the best chunks in the initial retrieval pool for the reranker to elevate. Too large a K and you're padding the LLM's context with diminishing-returns material. Finding the right values for your specific use case takes experimentation.
Evaluation matters more than intuition. The improvement from these techniques is real, but it's not uniform across all query types and all knowledge bases. Building even a lightweight evaluation set — a handful of representative queries with known good answers — is the only reliable way to measure whether the changes are actually helping and to tune the parameters sensibly.
The Bigger Point
Standard RAG is a good starting point. But "good enough for a demo" and "reliable enough for production" are separated by a meaningful gap, and that gap lives mostly in retrieval quality.
HyDE and Cohere reranking are two of the most impactful techniques I've found for closing it. They don't require a fundamental rearchitecture of your pipeline — they slot in at specific points and do their job quietly. But the difference in what reaches your LLM, and therefore in what your users actually get back, is hard to overstate once you've seen it working properly.
If your RAG pipeline almost works — if it's solid on easy queries but shows cracks when the language gets difficult — this is probably where to look next.