Marketing wants to chat with contracts. HR wants to chat with policies. But if you just "turn on" a model, you leak data. This is your production blueprint for a regulated Document Intelligence Copilot: zero leakage, full audit trails, and strict cost controls using Azure API Management and Application Gateway for Containers.
Get the Production Starter Kit
Don't start from any blank slate. Download the full UKLifeLabs Wave 1 Pack including:
- Architecture Diagrams (Visio/Mermaid)
- APIM Policy Pack (12+ policies)
- Bicep/Terraform Modules (AKS, Search, AI)
- Security Audit Checklist (Excel)
The Cast
The Story
Scene 1: The "Chat with PDF" Demand
Project Manager: "Marketing wants a tool to 'Chat with Contracts'. HR wants 'Chat with Policies'. Can we just turn on a Copilot for everyone?"
Trinity: "If we just 'turn it on', Marketing might see HR's salary data. A generic chatbot has no idea who is allowed to see what."
Upendra: "Exactly. We need to build an airport, not just a library. APIM is the customs and security checkpoint. Nothing reaches the model without a boarding pass."
Morpheus: "And citations are non-negotiable. If the model can't prove where it found the answer, it refuses to speak. Zero hallucinations allowed."
Upendra: "Here is the blueprint for a regulated, audit-ready Document Intelligence Copilot."
1) TL;DR (30 seconds)
UKLifeLabs needs employees to upload large document packs and ask questions.
But answers must be:
- provable (citations)
- traceable (audit trail)
- access-controlled (no leakage)
- cost-controlled (no runaway spend)
This blog gives you a buildable reference architecture:
UI → APIM (AI Gateway) → RAG Orchestrator → AI Search → Azure OpenAI + Document Intelligence → Citations + Audit
🏁 Wave 1 Scope: What is IN and OUT?
- IN: PDF Ingestion, Layout Extraction, Keyword+Vector Search, Citation-based Chat, Entra ID Auth.
- OUT: OCR for handwriting, Multi-modal (Images), Complex Reasoning (Agents), Writes/Updates to docs.
2) Design principles (the Microsoft review-board version)
- APIM-first boundary: no direct calls from UI to Search, models, or ingestion.
- Two-lane architecture: chat stays fast, ingestion stays heavy and async.
- Evidence-or-refusal: answers must include citations (or the system refuses).
- Least privilege by default: retrieval filters enforce the user's access.
- Cost is a control plane concern: token budgets and rate limits are enforced at APIM.
- Operationally boring: everything is observable, alertable, and repeatable as code.
3) The Airport + Research Library analogy
We use the "Airport" analogy to explain the security model to stakeholders, but here is the strict mapping to technical components.
Rule: Nobody reaches the experts without passing security.
3.5) Architecture Evolution: The Path to 2026+
⚠️ Critical Update: Kubernetes Ingress Landscape is Changing
In late 2024, the Kubernetes community announced that the ingress-nginx controller will be retired in March
2026. This affects millions of production deployments worldwide.
What This Means for You:
- Community NGINX Ingress: ❌ No updates after March 2026
- AKS Application Routing Add-on: ⚠️ Supported until November 2026
- Application Gateway for Containers: ✅ Long-term Microsoft support
Microsoft's Strategic Direction:
- Application Gateway for Containers (AGC) - Available now
- Azure-native Layer 7 load balancer
- Kubernetes Gateway API support
- Built-in WAF and security features
- Direct pod communication (better performance)
- Gateway API with Istio - Coming H1 2026
- Advanced traffic management
- Service mesh capabilities
This Architecture Uses: Application Gateway for Containers (Future-Proof)
Want the complete implementation guide with setup steps, cost analysis, and migration strategies? Read the Complete Implementation Guide →
4) The architecture (practical reference stack)
UI layer (pick ONE)
UI is replaceable. The boundary is not.
- Option A: Open WebUI – Fastest route. Treat it like a product.
- Option B: Microsoft-native UI – Copilot Studio or Teams (best for adoption).
- Option C: Custom portal – Best control for upload flows and audit viewing.
APIM is the AI Gateway (non-negotiable)
APIM is the platform control plane. It enforces:
- JWT validation (identity gate)
- Rate limits (traffic control)
- Token budgets (cost control)
- Audit logs (traceability)
🛡️ The Audit Log Schema (Compliance Gold)
Don't just log "200 OK". Regulators need to know what was sent. Your APIM
log-to-eventhub policy must capture this structure:
{
"timestamp": "2026-01-18T10:00:00Z",
"user_id": "upendra@company.com",
"department": "IT_Architecture",
"request_id": "req-guid-123",
"action": "rag_query",
"input_tokens": 150,
"output_tokens": 300,
"retrieved_documents": [
{
"doc_id": "HR_Policy.pdf",
"index_version": "v1.2",
"classification": "Internal"
}
],
"model_deployment": "gpt-4-turbo-0125"
}
Runtime layer (AKS vs ACA)
Recommendation: use AKS for production. It fits regulated enterprise patterns (network segmentation, control). Use ACA only for pilots.
Deep Dive: Unsure which to pick? Read AI Hosting Decision Tree: AKS vs. ACA vs. Web Apps.
Data + RAG layer
- ADLS Gen2: Raw docs, extracted JSON, audit artifacts.
- Document Intelligence: Extracts text/fields from PDFs.
- Azure AI Search: Chunk index + vector search (Standard S1+).
5) The two flows you must separate
This prevents production pain.
Flow A: Chat (user-facing, low latency)
- UI → APIM `/v1/private/chat`
- APIM validates JWT + limits
- Orchestrator queries AI Search with access filters
- Orchestrator calls Azure OpenAI with retrieved chunks
- Response returns with citations
📜 The Data Contract: How we prove it
Your generic chatbot fails because it returns text. A proper RAG system returns Evidence. The frontend must enforce this schema:
{
"answer": "The standard policy allows 15 days of PTO...",
"citations": [
{
"id": "doc-123",
"filename": "HR_Policies_2025.pdf",
"page_number": 5,
"text_snippet": "...employees accrue 1.25 days per month...",
"relevance_score": 0.89
}
],
"refusal_reason": null
}
Flow B: Ingestion (async, heavy workload)
- Document lands in storage
- Event triggers ingestion worker
- Document Intelligence extracts text
- Chunking + embeddings → Index into AI Search
6) Security trimming (do not skip this)
RAG must enforce permissions at retrieval time. Every chunk should carry `department`, `classification`, and `allowedGroups`.
Every query should filter by the user's Entra groups. If you skip this, you will leak data across teams.
7) Quota planning (TPM/RPM)
Quota is not a mystery. It’s a budget. Think of Azure OpenAI quota like a Family Data Plan:
- Subscription Quota: The total allowance for the whole family (e.g., 2M TPM per region).
- Deployments: The individual devices. Each needs its own cap so the teenager (Batch Job) doesn't starve the parents (Chat App).
- TPM (Tokens Per Minute): The speed limit. This is your primary throttle.
- 429 Errors: This isn't a bug; it's a guardrail. It means the plan works.
🏗️ Design Decision: Why 550k TPM?
We don't plan by gut feel. We calculate target throughput based on specific load testing:
(140 req/min peak) × (3,000 tokens/req) = 420k TPM
We added 30% Headroom for spikes, landing at ~550k TPM. We purposely chose
Standard (Pay-as-you-go) over PTU because our traffic is "bursty", not constant. PTU is
for stable, 24/7 base loads.
8) The Architecture Board
Use these prompts to generate your official documentation diagrams. Standard notation matters.
⚠️ The "Silent Killer": Private DNS
If your code works locally but fails inside AKS with 403 Forbidden or time-outs,
it is almost always DNS. Ensure your Private Endpoints are registered in the
privatelink.openai.azure.com and privatelink.search.windows.net zones
and linked to the AKS VNET.
9) The Build Room (Copy-Paste)
Real production means infrastructure as code. Here are the core modules you need.
A) APIM Policy: The "Airport Security" Check
Don't rely on code for security. Enforce it at the gateway.
<!-- Validate JWT before anything else -->
<validate-jwt header-name="Authorization" failed-validation-httpcode="401">
<openid-config url="https://login.microsoftonline.com/{tenantId}/v2.0/.well-known/openid-configuration" />
<required-claims>
<claim name="aud">
<value>{audience}</value>
</claim>
</required-claims>
</validate-jwt>
<!-- Enforce Token Budget (Cost Control) -->
<rate-limit-by-key calls="500" renewal-period="60" counter-key="@(context.Request.IpAddress)" />
<azure-openai-token-limit tokens-per-minute="10000" counter-key="@(context.Subscription.Id)" />
B) Repo Structure (The "Standard")
/src
/orchestrator (Python/FastAPI)
/ingestion (Dotnet Worker)
/infra (Bicep)
/modules
/apim
/ai-search
/openai
main.bicep
/policies (APIM XML)
/fragments
/security.xml
/audit.xml
/tests
/load-testing (Locust)
C) The Pipeline (GitHub Actions Snippet)
Friends don't let friends deploy broken XML. Validate it before the merge.
name: Validate APIM Policies
on: [pull_request]
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
# Don't deploy. Just check if the XML is valid.
- name: Check XML Syntax
run: |
find ./policies -name "*.xml" -print0 | xargs -0 -I {} xmllint --noout {}
- name: Run OPA Policy Check
run: ./scripts/check-compliance.sh # e.g., ensure no "star" allows
10) Go-live checklist
- UI calls APIM only
- JWT validation enabled
- APIM Products split (internal vs external)
- Token budgets enforced
- Citations required rule enabled
- Audit logs include chunk IDs + model deployment ID
Essential Resources
Watch these sessions to understand the "Security Sandwich" architecture in depth.
Ready to operationalize your Azure journey?
I help organizations turn stalled cloud initiatives into execution engines.