Benchmarks

Building your contextual brain

Your AI needs to retrieve exactly the right knowledge from thousands of entries — every time, without flooding the context window. We benchmarked three approaches on a production workspace.

0.0%

Recall@10 — zero information loss

Why retrieval accuracy matters

When your AI misses a relevant skill, it silently produces non-compliant output — wrong API patterns, missing design tokens, ignored security protocols. The difference between 93% and 100% recall is the difference between a system that occasionally fails and one that reliably surfaces all relevant knowledge.

83

skills

Production knowledge entries across 15 semantic groups

36

queries

Curated benchmark queries across 10 difficulty categories

90%

reduction

Context tokens saved vs. dumping all skills into context

Three strategies, progressively smarter

We tested three search architectures — from basic vector similarity to our full hierarchical routing with task decomposition.

S3: Decomposed + Two-Layer

Decomposes a broad query into N focused sub-queries. Each sub-query gets its own group routing pass. Results are unioned by max-score per skill.

94.9%

Recall@5

100.0%

Recall@10

89.9%

MRR

89.7%

Hit@1

Side-by-side comparison

Across every core metric, hierarchical routing with decomposition outperforms flat vector search — especially on complex queries.

Recall@5

S1
86.7%
S2
88.9%
S3
94.9%

Recall@10

S1
93.2%
S2
95.7%
S3
100.0%

MRR (Mean Reciprocal Rank)

S1
78.9%
S2
84.1%
S3
89.9%

Hit@1

S1
71.8%
S2
78.6%
S3
89.7%
Metric
S1: Flat Semantic
S2: Two-Layer Routing
S3: Decomposed + Two-Layer
Composite Score
76.8%
79.4%
85.0%
Recall@5
86.7%
88.9%
94.9%
Recall@10
93.2%
95.7%
100.0%
Hit@1
71.8%
78.6%
89.7%
MRR
78.9%
84.1%
89.9%
NDCG@10
77.8%
80.2%
84.0%
Precision@5
48.7%
48.2%
55.7%
Avg First Hit Rank
2.0
1.8
1.4
Zero Information Loss

100% Recall@10

Every single expected knowledge entry was surfaced within the top 10 results — across all 36 benchmark queries. No skill was ever missed. No information was lost.

Zero-recall queries

10

Avg first hit rank

2.01.4

Complex query R@5

78.8%98.4%

How it works

Our best-performing strategy combines query decomposition with hierarchical group routing for maximum recall with minimal noise.

How Decomposed Two-Layer Routing Works

1. Decompose the Query

A broad query is split into focused sub-tasks, each targeting a specific information need.

"Remotion pipeline""Design system tokens""Motion graphics""Production workflow"

2. Route Through Groups (L0) → Skills (L1)

Each sub-task independently activates the most relevant semantic groups, then scores individual skills within those groups. Different sub-tasks can activate different clusters.

3. Max-Score Union

Results from all sub-tasks are merged. For each skill, we take the highest score across all sub-tasks — preserving signal without dilution. Final ranking by merged scores.

Performance by query category

Recall@5 across 10 categories — decomposition shines on cross-group and multi-topic queries where flat search struggles most.

Category
Queries
S1
S2
S3
Decomposition Advantage
4
71.3%
71.3%
100.0%
Cross-Group
3
47.2%
58.3%
91.7%
Cross-Domain
4
91.7%
91.7%
100.0%
Multi-Step
4
91.7%
100.0%
100.0%
Keyword Exact
4
100.0%
100.0%
100.0%
Semantic Inference
5
100.0%
100.0%
80.0%
Negation
2
100.0%
100.0%
100.0%
Temporal
3
100.0%
100.0%
100.0%
Single Skill
5
80.0%
80.0%
80.0%
Broad Recall
2
100.0%
100.0%
100.0%
Biggest Win

Cross-Group queries: +44.5pp

Queries needing skills from multiple organizational groups went from 47.2% to 91.7% Recall@5 — nearly doubling retrieval accuracy.

Decomposition Value

Decomposition-advantage queries: 71.3% → 100%

Wide queries that span 4+ topics — exactly where single-embedding search dilutes — now achieve perfect recall through focused sub-queries.

100% recall, minimal context

We don't dump your entire knowledge base into context. Kavela surfaces 5–10 precisely relevant skills per query — a 90% reduction in context tokens versus loading everything, while maintaining perfect recall.

Dump everything

~41,500

tokens consumed per query

Wastes attention, increases cost

Kavela retrieval

~3,500

tokens consumed per query

100% recall, zero noise

Benchmark methodology

Give your AI team a perfect memory

Store your team's knowledge once. Kavela surfaces exactly what's relevant — every time, with 100% recall.

Get early access