Well-Architected AI: The "Go-Live" Playbook for Production Copilots

Master Class Series

Part 1: Architecture RAG Pattern & Azure Doc Intelligence Part 2: Implementation Step-by-Step Production Guide

Part 3: Go-Live Well-Architected Playbook

Production survives an adversary.

A demo just survives a presentation. This is the playbook for crossing the gap. We stop guessing and start engineering: dual-lane isolation, evidence-based refusal contracts, and Day-2 operations that keep you out of the headlines.

Get the Production Starter Kit

Don't start from a blank slate. Download the full UKLifeLabs Wave 1 Pack including:

Architecture Diagrams (Visio/Mermaid)
APIM Policy Pack (12+ policies)
Bicep/Terraform Modules (AKS, Search, AI)
Security Audit Checklist (Excel)

Download Kit (ZIP) View Source Repo

*Includes the official APIM Bicep Module (PremiumV2).

Architecting Production Ready AI Systems

The Cast

Upendra

Lead Architect

Trinity

Cloud Engineer

Morpheus

Security Architect

Project Manager

Decisions, RACI, milestones.

Scene 1: The Question that Matters

Project Manager: “We go live next week. Are we production-ready?”

Trinity: “The chat works. The ingestion pipeline is moving documents. It looks great in the playground.”

Morpheus: “That’s not production, Trinity. That’s just a working endpoint. A production system survives an adversary; a demo just survives a presentation.”

Upendra: “Morpheus is right. In this new era, 'production' is a high bar. It means being safe, reliable, cost-controlled, testable, and operable. Anything else is just compute.”

1) Why AI Architecture is Different

Traditional software fails in predictable ways. AI applications fail in surprising ways:

Non-determinism: the same question might get different answers
Hallucinations: the model states facts that aren’t in the data
Token runaway: costs spike quietly as prompts bloat
Prompt injection: users trick the model into bypassing safety rules

To counter this, we don’t just code. We architect guardrails.

2) WAF Pillars → Production Guardrails

Upendra turns “principles” into controls. This is what makes the design review easy.

WAF Pillar	What it means for AI workloads	Wave 1 guardrails to implement
Reliability	Survive throttling + timeouts + backlogs	Bulkheads (2 lanes), queue-based ingestion, retries/backoff, circuit breaker
Security	Prevent data leakage + bypass routes	APIM boundary, private endpoints, security trimming, content safety
Cost Optimization	Stop silent spend creep	Token budgets at gateway, per-product quotas, caching, PTU strategy
Operational Excellence	Run it daily without heroics	Trace IDs, dashboards, runbooks, incident playbooks, versioning
Performance Efficiency	Keep p95 stable under load	Retrieval time budget, topK caps, hybrid retrieval, graceful refusal paths

3) Reference Architecture: Two Lanes, One Gateway

To prevent downstream chaos, we separate the workload into two isolated lanes with one boundary.

Ingestion Lane (The Truth Factory): Documents move from extraction to indexing. We insert a Queue for backpressure, so upload spikes do not crash Search.
Chat Lane (The Intelligence Path): This is the user path. The UI never calls Search or OpenAI directly.

Why this works: bulkhead isolation. Ingestion can be on fire and chat still holds p95.

4) The Five-Layer Model

Upendra stops the “patchwork architecture” by layering the workload.

Layer	Responsibility	Azure component
1. Client	UI only, keep it thin	Teams / Web App / Open WebUI
2. Intelligence	Orchestrator where rules live	Orchestrator API
3. Inferencing	Model calls + deployment versioning	Azure OpenAI / Foundry
4. Knowledge	Retrieval + grounding	Azure AI Search + ADLS
5. Tools	Actions + business APIs	Internal APIs via APIM

Rule: the client is a view. The orchestrator is the brain.

5) The AI Workload Design Loop

Trinity: “Once deployed, we are done, right?”
Upendra: “In AI, you are never done. You ship, then you correct drift.”

Wave 1 needs this loop:

Build: prompt, retrieval, policies, limits
Measure: citation coverage, refusal rate, p95 latency, token/request
Adapt: tuning chunking/retrieval, policy updates, index rebuilds
Control change: version prompts, indexes, policies, model deployments

6) Platform Decisions (Wave 1 Tips)

AKS vs ACA

AKS: best for regulated production patterns (network control, segmentation, multiple internal services)
ACA: good for rapid pilots, fewer ops, but less control for complex enterprise segmentation

Training vs Inference Compute

Inference should be stable, monitored, and protected behind APIM
Training/fine-tuning is usually batch + transient. Shut it down when idle
Prefer managed options (Azure ML) over "always on" VM clusters for sporadic training jobs.

7) The Three Golden Production Contracts

Upendra insists most copilots fail because they lack clear contracts.

Contract 1: Evidence-or-Refusal

If grounding exists → answer + citations.
If grounding is missing → refuse.

Enforce it server-side using a validator. Not just prompt wording.

Contract 2: The Audit Trail

A “200 OK” is not an audit. Log what influenced the answer associated with the `requestId`: chunk IDs, filter logic, model deployment, and index version.

Contract 3: Permission-Aware Retrieval

The model is not your security boundary. Retrieval is. Security trimming must be enforced using validated identity claims so users only see authorised content.

8) Grounding Data Design

Morpheus: “Show me the mechanism that prevents cross-team leaks.”
Upendra: “It’s not the model. It’s the metadata and filters.”

Minimum Chunk Metadata (Wave 1)

`docId`, `chunkId`, `content`, `vector`
`pageNumber`, `sourcePath`
`allowedGroups[]` (or ACL tags)
`classification`, `ingestedAt`, `indexVersion`

9) Data Platform Strategy

Don’t build an extra data platform “because AI”. Start simple with ADLS for raw/extracted content and AI Search for vectors. Add complexity (Fabric/OneLake) only when specific aggregation or governance needs demand it.

10) Operations: Monitor Signals, Not Hopes

Model: Refusal rate, p95 latency, 429 throttle events.
Cost: Tokens per request, cost per route (chat vs ingestion).
Retrieval: Logic drift, index freshness, no-results rate.
Ingestion: Queue depth, processing failures, document throughput.

11) Day-2 Operating Model

This avoids “who owns it?” fights during incidents.

Area	Primary owner	What they own
APIM + identity boundary	Platform team	Auth, rate limits, token budgets, routing
Orchestrator API	App team	Prompting, validation, contracts, fallback logic
Search + Retrieval	Data team	Index schema, chunking, filters, freshness
Security + Compliance	Security team	Content safety, audit reviews, policy sign-off

12) Testing and Gates

Credibility comes from a Golden Test Set of 30–50 questions including "must refuse" scenarios and prompt injection attempts. Block release if citation coverage drops or cross-group retrieval is detected.

13) GenAIOps Change Control

Version everything that changes behavior: `promptVersion`, `indexVersion`, `policyVersion`. Your release gates must verify security trimming and injection resistance before every deployment.

14) Responsible AI (Minimum Controls)

Responsible AI is not a slide. It’s enforcement. Ensure citations are transparent, content safety filters are active (input/output), and PII usage is explicitly governed.

Scene 2: The Definition of Done

Client Sponsor: “So, what did we really ship?”

Morpheus: “A Copilot that doesn’t leak data, doesn’t collapse under load, and leaves an evidence trail for every answer.”

Upendra: “That is Well-Architected AI. The rest is just compute.”

Ready to operationalize your AI?

I help organizations turn stalled pilots into governed production engines.

Contact Me Download Playbook

Back to Insights