Summary: This post explains how using large language models (LLMs) as rerankers improves retrieval-augmented generation (RAG) by selecting the most relevant passages. The author describes engineering techniques like parallel pointwise reranking to speed up processing and reduce errors. Their approach boosts answer quality significantly, but added latency led them to train a custom reranker based on LLM guidance.
reranking, which reorders the results from vector search so the final answer is grounded on the most relevant passages. (View Highlight)
There are three ways to prompt the LLM for reranking [arxiv]:
β’ Pointwise reranking: ask the LLM to rate how relevant each passage is on a scale from 1 to 10. Example output with passage ids: [("id0", 6), ("id1", 10), ("id2", 5), ("id3", 10), ("id4", 4)]
β’ Listwise reranking: ask the LLM to order the passages by relevance. Example output: "id1" > "id3" > "id0" > "id5" > "id2" > "id4"
β’ Pairwise reranking: build a ranking through pairwise comparisons, asking the model which of two passages (ππp_i or ππp_jβ) is more relevant to the query. If you implement the LLM as a comparator inside a sort, you typically need π(πΎlogπΎ)O(K \log K) or π(πΎ2)O(K^2) comparisons, making it the most expensive. Studies often find pairwise prompting best on quality, though at higher cost [arxiv]. (View Highlight)
When cutting latency, reducing output tokens is a good first step (View Highlight)
switched to a Dict format instead of List[Tuple]. (View Highlight)
Note: Iβm honestly a little surprised that there arenβt efficiencies b/c (" should tokenize to the same thing over and over again.
thresholding: instructed the LLM to omit passage ids if their score is below 5. (View Highlight)
Vector search already imposes an ordering bias (higher semantic similarity first). If you naively split passages into consecutive chunks, for example the first πΎπ\frac{K}{N} for worker 1, the next πΎπ\frac{K}{N} for worker 2, etc., youβll overweight the first shard with the βbestβ candidates. We instead assign round-robin by index (View Highlight)