Methodology

How we ship agents into
production departments.

A 9-step deployment-first process — three phases, five Academy tracks, one measurement frame, and a clear handover. Built and refined across six engagements in healthcare, government, research, marketing, executive search, and B2B sales.

Philosophy 9 Steps Academy Measurement Governance Deliverables

our philosophy

Three commitments behind every step.

The methodology is the visible part. These three commitments are why it produces a different outcome.

Commitment 01

Diagnose before prescribing

Clients almost always present symptoms, not root causes. We treat every initial brief as a hypothesis. We follow the constraint, not the symptom. We never accept the first answer — and we surface assumptions explicitly before we commit to an architecture.

Commitment 02

Deploy or it doesn't count

A finding that doesn't change behavior is theater. Every recommendation we make is tied to a specific decision the client has to take, with named trade-offs, named preconditions, and a measurable definition of success. We deploy what we recommend — or we don't recommend it.

Commitment 03

Transfer the skill

The deliverable is not a deck and not a dashboard — it's a client team that can build, evaluate, govern, and extend agents without us in the room. Every engagement runs an embedded Academy that produces in-house operators, engineers, evaluators, and governance leads.

the process

9 steps. 3 phases. 1 outcome: autonomous operations.

Every Orchestrary engagement runs through this loop. The phases compress or expand to match the department's complexity, but the steps and the order are the same.

Phase 1 · Ignite Diagnose, frame, prove value fast. 3 — 7 days

Diagnostic

60-minute partner-led discovery. We map the department's real workflows, identify the binding constraint, and stress-test the brief. We surface assumptions the client didn't know they were making.

Output → Reframed problem statement with named constraint and three falsifiable hypotheses.

Opportunity Map

Decompose the department into atomic, agent-suitable workflow steps. Score each on volume, structure, ROI, and risk. Pick the 3–5 highest-leverage candidates for the first wave.

Output → Ranked opportunity list with effort/value scoring and EU AI Act risk tier.

Ignite Demo

We deploy one live agent against one real workflow on real data, in a sandboxed environment. Stakeholders watch it work end-to-end. The demo doubles as the first system test.

Output → 1 working agent + measurement baseline + go/no-go decision artifact.

Phase 2 · Pilot Build the first wave. Run the Academy. Measure everything. 1 — 3 weeks

Architecture

Design the runtime, the integrations, the secrets management, the observability stack. Choose Claude Code or OpenClaw based on data sovereignty. Wire to your IDP, your CI, your existing systems.

Output → Production architecture with private endpoints, audit log, secrets vault, and rollback plan.

Build & Skill

Senior consultants write the agent skills, custom tools, evaluation suites, and SKILL.md / AGENTS.md guides. Everything ships as code in your repo — no proprietary surface area, no lock-in.

Output → 3 — 5 production agents with golden datasets and regression tests.

Academy Wave 1

In parallel with build, we run the first cycle of the 5-track Academy. Operators learn the runtime in their own terminals. Engineers learn to write tools. Governance leads learn the policy frame.

Output → 8 — 25 trained operators + 3 — 6 trained engineers + signed governance charter.

Phase 3 · Scale Hand over, harden, expand. Ongoing

Production Cutover

Move the first wave from pilot to production. Real traffic, real money, real consequences. We sit on the bridge for the first 14 days. Incident response, on-call, postmortems — all transferred to the client team by day 30.

Output → Live agents in production with on-call rotation, runbooks, and SLOs.

Continuous Evaluation

Drift detection, regression suites, golden-dataset gates in CI. The client's evaluation cohort runs the QA function. We provide a quarterly "agent health" review and a model-upgrade playbook.

Output → Self-running QA function with CI gates, drift alarms, and quarterly review cadence.

Handover & Expansion

Final knowledge transfer. The client's engineers ship agent #6 themselves while we observe. We move into a strategic advisory role — quarterly check-ins, model updates, the next department. The engagement ends. The capability stays.

Output → Self-sufficient agent factory + advisory retainer (optional) + roadmap for the next department.

academy

Five tracks. Eight to twenty-five operators. One in-house agent factory.

The Academy runs in parallel with the deployment work. By the end of Phase 3, every track has a self-sufficient client lead — and Orchestrary moves on.

Track 01

Operator basics

Every team member · 8 — 25 ppl

Drive Claude Code or OpenClaw in their own terminal. Prompt patterns, file context, planning loops, MCP tools, the agent's failure modes. No previous coding required.

Track 02

Workflow design

Senior staff · 4 — 8 ppl

Decompose a department workflow into agent-suitable atomic steps. Design data interfaces. Write the SKILL.md / AGENTS.md files that make the agent reliable in production.

Track 03

Tool building

Engineers · 3 — 6 ppl

Write the small Python / TypeScript tools the agent calls. The difference between a chatbot and an agent that actually does work. MCP servers, integration patterns, error handling, idempotency.

Track 04

Evaluation

QA cohort · 2 — 4 ppl

Build and run the in-house quality function. Golden datasets, regression suites, drift detection, hallucination tests, model-upgrade gates. Learn to say no to a release.

Track 05

Governance

Leadership · 2 — 4 ppl

The policy frame: what agents can touch, who reviews, how to audit, how to roll back. EU AI Act risk-tier mapping. CIO/CTO learns to answer the board's questions without us in the room.

measurement

Five numbers — for every agent, every department, every engagement.

If we can't put these on your BI dashboard within 30 days of cutover, the agent doesn't go to production. Underestimated risk and overestimated ROI both destroy the engagement — so we measure both, conservatively, from day one.

⏱

Time saved

Hours per week reclaimed by the team, measured against a 4-week pre-deployment baseline.

€

Cost reduced

Direct cost displaced or avoided. Conservative — never includes "soft" productivity multipliers.

↑

Throughput

Volume of work completed per unit of human time. The most honest indicator of leverage.

★

Quality

Domain-specific quality metric — bid win-rate, ticket resolution, claim accuracy, etc. Per-agent.

⚠

Error rate

Hallucinations, escalations, rollbacks. Tracked tighter than positive metrics, by design.

governance

Built for the EU AI Act — by default.

Every Orchestrary deployment ships with a governance frame mapped to EU AI Act risk tiers, GDPR, and the client's existing IT security posture. Compliance is not an afterthought — it shapes the architecture.

⛨

Risk-tier mapping

Every agent is mapped to its EU AI Act risk tier before the first prompt is written. The tier determines the architecture — what data the agent sees, what actions it takes, what review gates exist.

Article 6 / Annex III tier classification per use case
Human-in-the-loop gates for high-risk categories
Logging & audit trail to satisfy Article 12 — 15

◈

Data sovereignty

Choose Claude Code (managed) or OpenClaw (on-prem, EU-resident). For sovereignty-sensitive clients we run the entire stack inside the client VNet — no model API calls leave the perimeter.

OpenClaw on private compute · zero outbound model calls
Private endpoints for every dependency (vector store, secrets, BI)
GDPR-compatible logging & pseudonymization built in

✓

Auditability

Every prompt, every tool call, every output is logged with replay. Auditors can step through any agent action with full input/output capture, model version, and prompt hash.

Per-action audit log with cryptographic chain of custody
Replayable from any point — for debugging or audit
Quarterly external audit pack generated automatically

↺

Rollback & kill switch

Every agent ships with a kill switch exposed to the client's ops team and a documented rollback path. We pre-rehearse rollback during Phase 2 — not for the first time during an incident.

One-command kill switch · no Orchestrary involvement required
Versioned skill packs · rollback to any prior release
Pre-mortem & rehearsed incident response

what you get

The complete deliverable set.

Everything ships as code in your repository. No proprietary surface area, no maintenance dependency, no exit fee.

▤

Diagnostic dossier

Reframed problem, ranked opportunities, EU AI Act risk-tier map, named constraint. The document the board reads.

End of Phase 1

⌬

Production agent fleet

3 — 5 agents running in your stack with private endpoints, audit log, secrets vault, and rollback plan.

End of Phase 2

⊞

Skill & tool library

Every SKILL.md, every Python/TS tool, every prompt — versioned in your repo. Your engineers extend it from here.

End of Phase 2

★

Evaluation suite

Golden datasets, regression tests, CI gates, drift alarms. Wired to your CI/CD before the first cutover.

Phase 2 → 3

▣

Trained team

8 — 25 operators · 3 — 6 engineers · 2 — 4 evaluators · 2 — 4 governance leads. Certified in-house. Independent.

End of Phase 3

◇

Governance & audit pack

Signed governance charter, audit log structure, EU AI Act conformity package, rollback runbook. Auditor-ready.

End of Phase 3

start the diagnostic

Day 7 with Orchestrary looks
very different from day 7
with anyone else.

Book a free 60-minute discovery call. We'll diagnose where agents would actually move your numbers — and tell you honestly if they wouldn't. No pitch deck.

Book a discovery call Read case studies →

or email us at hello@orchestrary.com

How we ship agents into production departments.