Quality Architecture for Agentic Skills and Tools

Abstract

Tool and skill selection accuracy in agentic systems is lower than commonly assumed — recent benchmarks report below 60% accuracy for frontier models across realistic tool registries — and most current engineering effort targets only the retrieval layer. This paper argues that selection quality is a lagging indicator: it is determined upstream by contract quality (how well tools are specified) and execution quality (how reliably tools perform). These three layers compose multiplicatively, which means that local optimization of any one layer produces diminishing returns if the others are degraded. We present a three-layer quality framework, a decision model for when to use atomic tools versus composed skills, and four concrete prescriptions for production agentic systems.

The Problem

Selection Accuracy Is a Downstream Symptom

Agentic platforms — whether consumer-facing assistants, enterprise productivity tools, or developer environments — share a common architectural pattern: an LLM-based orchestrator selects from a registry of available capabilities (tools and skills) to accomplish user tasks. The quality of this selection, and the reliability of the selected capability, determines system-level performance.

The engineering community has focused heavily on the selection layer: better embeddings, smarter routing, richer metadata. This focus is understandable but misplaced as a primary strategy. Selection quality is bounded by the upstream layers. A well-specified tool with a reliable execution record is easy to route to correctly. A poorly specified tool that overlaps with adjacent capabilities creates an ambiguity that no retrieval mechanism can cleanly resolve.

The empirical case: The MCP-Bench benchmark (ICLR 2026) evaluated frontier models across 28 MCP servers with 250 tools under realistic noisy conditions and found significant struggles with tool selection and dependency chain compliance. MCPVerse reported that the top-performing model (Claude Sonnet, early 2026) achieved only 57.77% accuracy across 550+ real-world tools. A University of Alberta study (February 2026) found that improving tool description quality yielded a 5.85 percentage-point improvement in task success — but also a 67.46% increase in execution steps, revealing a fundamental accuracy-cost tradeoff that better retrieval alone cannot resolve. These numbers would be unacceptable in most production workflows.

The underlying cause is not model capability. It is that tool registries accumulate quality debt across all three layers simultaneously, and most teams have no systematic way to measure or address the upstream debt.

The Framework

Three Layers, One Metric

Tool and skill quality must be evaluated across three distinct but interconnected layers. The layers compose multiplicatively — weak performance at any layer propagates downstream and cannot be recovered by the others.

P(E2E success) = P(clean contract) × P(tool succeeds | called) × P(right tool selected)

Layer 1 Contract Quality The Specification Layer

A tool's contract is the specification that tells the orchestrator what the capability does, when to use it, and what inputs/outputs to expect. Contract quality fails in three ways: ambiguous descriptions that don't clearly bound the capability's scope; overlapping descriptions that create retrieval confusion with adjacent tools; and schema incompleteness that causes runtime errors on legitimate inputs.

GitHub's engineering team found that small edits to tool descriptions — tightening scope, separating overlapping tools, adding context about when not to use a tool — produced significant selection improvements. An independent analysis of a major AI platform's published tool registry found that several tools failed basic spec compliance criteria (schema completeness, non-overlapping descriptions) — suggesting contract debt is common even in carefully maintained registries.

Description clarity score Schema completeness Overlap coefficient with adjacent tools LLM-judge invocation correctness

Layer 2 Execution Quality The Runtime Reliability Layer

Execution quality measures whether a tool actually does what its contract claims, at the rate it claims, under realistic input distributions. A tool can have a perfect contract and still fail at execution — through silent failures, partial success (succeeds on a narrow subset of valid inputs), inconsistent latency, or poor error handling that leaves the orchestrator with no actionable signal.

Silent failures are the most dangerous: the orchestrator proceeds as if the tool succeeded, builds further actions on a false premise, and compounds error. Any production agentic system should treat silent failure detection as a first-class instrumentation requirement, not an afterthought. The orchestrator should always receive an explicit success/failure signal and a structured error payload it can reason over.

Task success rate Silent failure rate p50 / p99 latency Error payload quality Input distribution coverage

Layer 3 Orchestration-Time Selection Quality The Routing Layer

Selection quality measures whether the orchestrator routes a given task to the right tool, across the full distribution of task types in production. This is the layer that has received the most engineering attention — embedding-based retrieval, reranking, routing heuristics. It is also the layer most constrained by Layers 1 and 2: selection precision is bounded by contract clarity, and selection recall is bounded by the tool's actual execution coverage.

An important secondary effect: selection precision simultaneously functions as the primary per-query token cost control. A system that loads 40 tools into the prompt for every query — because it cannot confidently select the right 3–5 — is not just paying a quality cost. It is paying a direct inference cost on every request.

Precision @ k Recall across task distribution False positive rate Active tool count per query (token cost proxy)

Design Decision

Atomic Tools vs. Composed Skills

A related architectural question with significant quality implications: when should a capability be implemented as an atomic tool (a single function with a defined contract) versus a composed skill (a multi-step workflow encoding a task pattern)?

Dimension	Atomic Tool	Composed Skill
Contract clarity	High — bounded scope, single action	Lower — multi-step scope harder to specify without ambiguity
Execution quality measurement	Straightforward — success/failure is well-defined	Complex — partial success, step-level vs. task-level quality
Selection precision	Higher for narrow task types	Higher for complex task patterns with implicit ordering
Token cost at selection time	Lower per-tool schema size	Higher — skill description must encode workflow intent
Maintenance burden	Low — changes are isolated	Higher — workflow changes require full regression
Best for	Deterministic, well-scoped operations	Recurring task patterns with implicit ordering and user-preference encoding

A useful heuristic: if the capability requires the orchestrator to make meaningful decisions about ordering, branching, or fallback within its execution, it is a skill. If the decision-making belongs entirely to the orchestrator (and the capability simply executes a defined action when called), it is a tool. Mis-categorizing — typically encoding orchestrator-level decisions inside a tool — is a common source of contract ambiguity at Layer 1.

Prescriptions

What This Implies for System Builders

Treat contract quality as a pre-deployment gate, not a launch assumption

Every tool or skill added to a production registry should pass a contract quality check before deployment: description specificity audit, schema completeness validation, overlap analysis against existing registry entries. This is analogous to a type system or a lint pass — it catches a class of errors before they compound in production. Teams that skip this step consistently report high false-positive rates in their orchestration layer and cannot isolate whether the cause is retrieval quality or contract ambiguity.

Instrument silent failures as a first-class production signal

Every tool call should produce an explicit success/failure signal, a structured error payload, and a confidence indicator where applicable. Silent failures — where the tool returns a result but the result does not reflect actual task completion — are the hardest to detect and the most damaging to compound tasks. Treat any tool that can fail silently as a Layer 2 debt item requiring immediate remediation, regardless of how well-specified its contract is.

Measure active tool count per query as a cost and quality proxy

The number of tools loaded into the orchestrator's context window per query is a cheap, continuous proxy for both selection quality and token cost. A well-maintained registry with high contract quality should allow the orchestrator to confidently select 3–5 tools for most queries. If the median active tool count is 20+, this signals contract overlap or insufficient registry curation — and the token cost implication is direct and compounding across query volume.

Evaluate at the E2E task level, not the per-tool level

Individual tool success rates are necessary but not sufficient as quality metrics. A tool that succeeds 95% of the time but is selected for the wrong task class 30% of the time has a compound E2E success rate well below 95%. The P(E2E success) formula above requires estimating all three factors. Teams that measure only Layer 3 (selection accuracy) or only Layer 2 (per-tool success rate) will consistently misattribute their E2E quality problems to the wrong layer.

Broader Context

The Registry Scale Challenge

The quality framework above is tractable for curated, organization-managed registries. The challenge compounds significantly at ecosystem scale. The open-source tool ecosystem has grown rapidly, with major registries hosting tens of thousands of community-built skills. Analysis of these registries suggests that a meaningful fraction are low-quality by the Layer 1 and Layer 2 criteria above — and that the registry discovery problem (how does an orchestrator find the right tool across a large heterogeneous registry?) is itself a Layer 3 challenge that current retrieval approaches do not fully solve.

A secondary concern at ecosystem scale: tool quality is also a security surface. Several documented supply-chain incidents in community registries have involved malicious tools using typosquatted names — exploiting the fact that the orchestrator selects by name similarity and description match, not by provenance verification. For enterprise deployments, registry provenance and signing should be treated as Layer 1 contract attributes, not as out-of-scope infrastructure concerns.

Research notes, half-baked ideas. Probably overthought, definitely over-architected.