Quality Architecture for Agentic Skills and Tools
A three-layer framework for contract quality, execution reliability, and orchestration-time selection — and why optimizing only the third layer is insufficient without addressing the upstream layers.
Selection Accuracy Is a Downstream Symptom
Agentic platforms — whether consumer-facing assistants, enterprise productivity tools, or developer environments — share a common architectural pattern: an LLM-based orchestrator selects from a registry of available capabilities (tools and skills) to accomplish user tasks. The quality of this selection, and the reliability of the selected capability, determines system-level performance.
The engineering community has focused heavily on the selection layer: better embeddings, smarter routing, richer metadata. This focus is understandable but misplaced as a primary strategy. Selection quality is bounded by the upstream layers. A well-specified tool with a reliable execution record is easy to route to correctly. A poorly specified tool that overlaps with adjacent capabilities creates an ambiguity that no retrieval mechanism can cleanly resolve.
The underlying cause is not model capability. It is that tool registries accumulate quality debt across all three layers simultaneously, and most teams have no systematic way to measure or address the upstream debt.
Three Layers, One Metric
Tool and skill quality must be evaluated across three distinct but interconnected layers. The layers compose multiplicatively — weak performance at any layer propagates downstream and cannot be recovered by the others.
GitHub's engineering team found that small edits to tool descriptions — tightening scope, separating overlapping tools, adding context about when not to use a tool — produced significant selection improvements. An independent analysis of a major AI platform's published tool registry found that several tools failed basic spec compliance criteria (schema completeness, non-overlapping descriptions) — suggesting contract debt is common even in carefully maintained registries.
Silent failures are the most dangerous: the orchestrator proceeds as if the tool succeeded, builds further actions on a false premise, and compounds error. Any production agentic system should treat silent failure detection as a first-class instrumentation requirement, not an afterthought. The orchestrator should always receive an explicit success/failure signal and a structured error payload it can reason over.
An important secondary effect: selection precision simultaneously functions as the primary per-query token cost control. A system that loads 40 tools into the prompt for every query — because it cannot confidently select the right 3–5 — is not just paying a quality cost. It is paying a direct inference cost on every request.
Atomic Tools vs. Composed Skills
A related architectural question with significant quality implications: when should a capability be implemented as an atomic tool (a single function with a defined contract) versus a composed skill (a multi-step workflow encoding a task pattern)?
| Dimension | Atomic Tool | Composed Skill |
|---|---|---|
| Contract clarity | High — bounded scope, single action | Lower — multi-step scope harder to specify without ambiguity |
| Execution quality measurement | Straightforward — success/failure is well-defined | Complex — partial success, step-level vs. task-level quality |
| Selection precision | Higher for narrow task types | Higher for complex task patterns with implicit ordering |
| Token cost at selection time | Lower per-tool schema size | Higher — skill description must encode workflow intent |
| Maintenance burden | Low — changes are isolated | Higher — workflow changes require full regression |
| Best for | Deterministic, well-scoped operations | Recurring task patterns with implicit ordering and user-preference encoding |
A useful heuristic: if the capability requires the orchestrator to make meaningful decisions about ordering, branching, or fallback within its execution, it is a skill. If the decision-making belongs entirely to the orchestrator (and the capability simply executes a defined action when called), it is a tool. Mis-categorizing — typically encoding orchestrator-level decisions inside a tool — is a common source of contract ambiguity at Layer 1.
What This Implies for System Builders
Treat contract quality as a pre-deployment gate, not a launch assumption
Every tool or skill added to a production registry should pass a contract quality check before deployment: description specificity audit, schema completeness validation, overlap analysis against existing registry entries. This is analogous to a type system or a lint pass — it catches a class of errors before they compound in production. Teams that skip this step consistently report high false-positive rates in their orchestration layer and cannot isolate whether the cause is retrieval quality or contract ambiguity.
Instrument silent failures as a first-class production signal
Every tool call should produce an explicit success/failure signal, a structured error payload, and a confidence indicator where applicable. Silent failures — where the tool returns a result but the result does not reflect actual task completion — are the hardest to detect and the most damaging to compound tasks. Treat any tool that can fail silently as a Layer 2 debt item requiring immediate remediation, regardless of how well-specified its contract is.
Measure active tool count per query as a cost and quality proxy
The number of tools loaded into the orchestrator's context window per query is a cheap, continuous proxy for both selection quality and token cost. A well-maintained registry with high contract quality should allow the orchestrator to confidently select 3–5 tools for most queries. If the median active tool count is 20+, this signals contract overlap or insufficient registry curation — and the token cost implication is direct and compounding across query volume.
Evaluate at the E2E task level, not the per-tool level
Individual tool success rates are necessary but not sufficient as quality metrics. A tool that succeeds 95% of the time but is selected for the wrong task class 30% of the time has a compound E2E success rate well below 95%. The P(E2E success) formula above requires estimating all three factors. Teams that measure only Layer 3 (selection accuracy) or only Layer 2 (per-tool success rate) will consistently misattribute their E2E quality problems to the wrong layer.
The Registry Scale Challenge
The quality framework above is tractable for curated, organization-managed registries. The challenge compounds significantly at ecosystem scale. The open-source tool ecosystem has grown rapidly, with major registries hosting tens of thousands of community-built skills. Analysis of these registries suggests that a meaningful fraction are low-quality by the Layer 1 and Layer 2 criteria above — and that the registry discovery problem (how does an orchestrator find the right tool across a large heterogeneous registry?) is itself a Layer 3 challenge that current retrieval approaches do not fully solve.
A secondary concern at ecosystem scale: tool quality is also a security surface. Several documented supply-chain incidents in community registries have involved malicious tools using typosquatted names — exploiting the fact that the orchestrator selects by name similarity and description match, not by provenance verification. For enterprise deployments, registry provenance and signing should be treated as Layer 1 contract attributes, not as out-of-scope infrastructure concerns.
Research notes, half-baked ideas. Probably overthought, definitely over-architected.