Document Intelligence Copilot on Azure: The Production-Ready Blueprint

Master Class Series

Part 1: Architecture RAG Pattern & Azure Doc Intelligence

Part 2: Implementation Step-by-Step Production Guide Part 3: Go-Live Well-Architected Playbook

Stop building toy copilots.

Marketing wants to chat with contracts. HR wants to chat with policies. But if you just "turn on" a model, you leak data. This is your production blueprint for a regulated Document Intelligence Copilot: zero leakage, full audit trails, and strict cost controls using Azure API Management and Application Gateway for Containers.

Get the Production Starter Kit

Don't start from any blank slate. Download the full UKLifeLabs Wave 1 Pack including:

Architecture Diagrams (Visio/Mermaid)
APIM Policy Pack (12+ policies)
Bicep/Terraform Modules (AKS, Search, AI)
Security Audit Checklist (Excel)

Download Kit (ZIP)

The Cast

Upendra

Lead Architect

Trinity

Cloud Engineer

Morpheus

Sec Architect

Project Manager

Plans, Progress, Risks & Issues

The Story

Scene 1: The "Chat with PDF" Demand

Project Manager: "Marketing wants a tool to 'Chat with Contracts'. HR wants 'Chat with Policies'. Can we just turn on a Copilot for everyone?"

Trinity: "If we just 'turn it on', Marketing might see HR's salary data. A generic chatbot has no idea who is allowed to see what."

Upendra: "Exactly. We need to build an airport, not just a library. APIM is the customs and security checkpoint. Nothing reaches the model without a boarding pass."

Morpheus: "And citations are non-negotiable. If the model can't prove where it found the answer, it refuses to speak. Zero hallucinations allowed."

Upendra: "Here is the blueprint for a regulated, audit-ready Document Intelligence Copilot."

1) TL;DR (30 seconds)

UKLifeLabs needs employees to upload large document packs and ask questions.

But answers must be:

provable (citations)
traceable (audit trail)
access-controlled (no leakage)
cost-controlled (no runaway spend)

This blog gives you a buildable reference architecture:

UI → APIM (AI Gateway) → RAG Orchestrator → AI Search → Azure OpenAI + Document Intelligence → Citations + Audit

🏁 Wave 1 Scope: What is IN and OUT?

IN: PDF Ingestion, Layout Extraction, Keyword+Vector Search, Citation-based Chat, Entra ID Auth.
OUT: OCR for handwriting, Multi-modal (Images), Complex Reasoning (Agents), Writes/Updates to docs.

2) Design principles (the Microsoft review-board version)

APIM-first boundary: no direct calls from UI to Search, models, or ingestion.
Two-lane architecture: chat stays fast, ingestion stays heavy and async.
Evidence-or-refusal: answers must include citations (or the system refuses).
Least privilege by default: retrieval filters enforce the user's access.
Cost is a control plane concern: token budgets and rate limits are enforced at APIM.
Operationally boring: everything is observable, alertable, and repeatable as code.

3) The Airport + Research Library analogy

We use the "Airport" analogy to explain the security model to stakeholders, but here is the strict mapping to technical components.

✈️ The Analogy	⚙️ The Technical Solution	🛑 The Production Reality
The Boarding Pass	Entra ID + APIM <validate-jwt>	Models have no auth. If you hit the endpoint directly, you bypass security. APIM forces the gate.
Airport Security	APIM Token Limits & Throttling	One department running a bulk job can starve the entire company. Quotas prevent "noisy neighbors".
Baggage Scanner	Document Intelligence (Layout Model)	PDFs are binary blobs. You cannot "search" pixels. You must extract structure first.
Restricted Lounge	AI Search Security Trimming	If you search for "Salary", the index must only return hits that match your `group_id`. Only APIM passes this context.
The Expert Panel	Azure OpenAI (GPT-4o)	Models are reasoning engines, not databases. They should only answer based on what the "Lounge" (Search) provided.

Rule: Nobody reaches the experts without passing security.

Target Throughput

1.1M TPM

Latency Target (P95)

< 4.0s

Rec. SKU (Search)

Standard S1

Rec. SKU (APIM)

Standard v2

3.5) Architecture Evolution: The Path to 2026+

⚠️ Critical Update: Kubernetes Ingress Landscape is Changing

In late 2024, the Kubernetes community announced that the ingress-nginx controller will be retired in March 2026. This affects millions of production deployments worldwide.

What This Means for You:

Community NGINX Ingress: ❌ No updates after March 2026
AKS Application Routing Add-on: ⚠️ Supported until November 2026
Application Gateway for Containers: ✅ Long-term Microsoft support

Microsoft's Strategic Direction:

Application Gateway for Containers (AGC) - Available now
- Azure-native Layer 7 load balancer
- Kubernetes Gateway API support
- Built-in WAF and security features
- Direct pod communication (better performance)
Gateway API with Istio - Coming H1 2026
- Advanced traffic management
- Service mesh capabilities

This Architecture Uses: Application Gateway for Containers (Future-Proof)

Want the complete implementation guide with setup steps, cost analysis, and migration strategies? Read the Complete Implementation Guide →

4) The architecture (practical reference stack)

UI layer (pick ONE)

UI is replaceable. The boundary is not.

Option A: Open WebUI – Fastest route. Treat it like a product.
Option B: Microsoft-native UI – Copilot Studio or Teams (best for adoption).
Option C: Custom portal – Best control for upload flows and audit viewing.

APIM is the AI Gateway (non-negotiable)

APIM is the platform control plane. It enforces:

JWT validation (identity gate)
Rate limits (traffic control)
Token budgets (cost control)
Audit logs (traceability)

🛡️ The Audit Log Schema (Compliance Gold)

Don't just log "200 OK". Regulators need to know what was sent. Your APIM log-to-eventhub policy must capture this structure:

{
  "timestamp": "2026-01-18T10:00:00Z",
  "user_id": "upendra@company.com",
  "department": "IT_Architecture",
  "request_id": "req-guid-123",
  "action": "rag_query",
  "input_tokens": 150,
  "output_tokens": 300,
  "retrieved_documents": [
    { 
      "doc_id": "HR_Policy.pdf", 
      "index_version": "v1.2", 
      "classification": "Internal" 
    }
  ],
  "model_deployment": "gpt-4-turbo-0125"
}

Runtime layer (AKS vs ACA)

Recommendation: use AKS for production. It fits regulated enterprise patterns (network segmentation, control). Use ACA only for pilots.

Deep Dive: Unsure which to pick? Read AI Hosting Decision Tree: AKS vs. ACA vs. Web Apps.

Data + RAG layer

ADLS Gen2: Raw docs, extracted JSON, audit artifacts.
Document Intelligence: Extracts text/fields from PDFs.
Azure AI Search: Chunk index + vector search (Standard S1+).

5) The two flows you must separate

This prevents production pain.

Flow A: Chat (user-facing, low latency)

UI → APIM `/v1/private/chat`
APIM validates JWT + limits
Orchestrator queries AI Search with access filters
Orchestrator calls Azure OpenAI with retrieved chunks
Response returns with citations

📜 The Data Contract: How we prove it

Your generic chatbot fails because it returns text. A proper RAG system returns Evidence. The frontend must enforce this schema:

{
  "answer": "The standard policy allows 15 days of PTO...",
  "citations": [
    {
      "id": "doc-123",
      "filename": "HR_Policies_2025.pdf",
      "page_number": 5,
      "text_snippet": "...employees accrue 1.25 days per month...",
      "relevance_score": 0.89
    }
  ],
  "refusal_reason": null
}

Flow B: Ingestion (async, heavy workload)

Document lands in storage
Event triggers ingestion worker
Document Intelligence extracts text
Chunking + embeddings → Index into AI Search

6) Security trimming (do not skip this)

RAG must enforce permissions at retrieval time. Every chunk should carry `department`, `classification`, and `allowedGroups`.

Every query should filter by the user's Entra groups. If you skip this, you will leak data across teams.

7) Quota planning (TPM/RPM)

Quota is not a mystery. It’s a budget. Think of Azure OpenAI quota like a Family Data Plan:

Subscription Quota: The total allowance for the whole family (e.g., 2M TPM per region).
Deployments: The individual devices. Each needs its own cap so the teenager (Batch Job) doesn't starve the parents (Chat App).
TPM (Tokens Per Minute): The speed limit. This is your primary throttle.
429 Errors: This isn't a bug; it's a guardrail. It means the plan works.

🏗️ Design Decision: Why 550k TPM?

We don't plan by gut feel. We calculate target throughput based on specific load testing:

(140 req/min peak) × (3,000 tokens/req) = 420k TPM

We added 30% Headroom for spikes, landing at ~550k TPM. We purposely chose Standard (Pay-as-you-go) over PTU because our traffic is "bursty", not constant. PTU is for stable, 24/7 base loads.

8) The Architecture Board

Use these prompts to generate your official documentation diagrams. Standard notation matters.

Shared Hub

Prod AI Sub

User

Cloudflare

Tunnel

App Gateway

WAF v2

Azure Firewall

IDPS

APIM

Gateway

AKS Cluster

Orchestrator

OpenAI

HTTPS

TLS

INSP

mTLS/JWT

⚠️ The "Silent Killer": Private DNS

If your code works locally but fails inside AKS with 403 Forbidden or time-outs, it is almost always DNS. Ensure your Private Endpoints are registered in the privatelink.openai.azure.com and privatelink.search.windows.net zones and linked to the AKS VNET.

9) The Build Room (Copy-Paste)

Real production means infrastructure as code. Here are the core modules you need.

A) APIM Policy: The "Airport Security" Check

Don't rely on code for security. Enforce it at the gateway.


<!-- Validate JWT before anything else -->
<validate-jwt header-name="Authorization" failed-validation-httpcode="401">
    <openid-config url="https://login.microsoftonline.com/{tenantId}/v2.0/.well-known/openid-configuration" />
    <required-claims>
        <claim name="aud">
            <value>{audience}</value>
        </claim>
    </required-claims>
</validate-jwt>

<!-- Enforce Token Budget (Cost Control) -->
<rate-limit-by-key calls="500" renewal-period="60" counter-key="@(context.Request.IpAddress)" />
<azure-openai-token-limit tokens-per-minute="10000" counter-key="@(context.Subscription.Id)" />

B) Repo Structure (The "Standard")


/src
  /orchestrator (Python/FastAPI)
  /ingestion (Dotnet Worker)
/infra (Bicep)
  /modules
    /apim
    /ai-search
    /openai
  main.bicep
/policies (APIM XML)
  /fragments
    /security.xml
    /audit.xml
/tests
  /load-testing (Locust)

C) The Pipeline (GitHub Actions Snippet)

Friends don't let friends deploy broken XML. Validate it before the merge.


name: Validate APIM Policies
on: [pull_request]

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      
      # Don't deploy. Just check if the XML is valid.
      - name: Check XML Syntax
        run: |
          find ./policies -name "*.xml" -print0 | xargs -0 -I {} xmllint --noout {}
          
      - name: Run OPA Policy Check
        run: ./scripts/check-compliance.sh # e.g., ensure no "star" allows

10) Go-live checklist

UI calls APIM only
JWT validation enabled
APIM Products split (internal vs external)
Token budgets enforced
Citations required rule enabled
Audit logs include chunk IDs + model deployment ID

Essential Resources

Watch these sessions to understand the "Security Sandwich" architecture in depth.

Retrieval Augmented Generation with Azure AI Search

RAG in Azure AI Search

Azure Friday - Intro to RAG with Azure OpenAI

Build your own Copilot

Microsoft Mechanics - Copilot extensibility & Plugins

Vector Search Deep Dive

Microsoft Mechanics - RAG at Scale & Vector Search

Ready to operationalize your Azure journey?

I help organizations turn stalled cloud initiatives into execution engines.

Contact Me View the Toolkit

Back to Insights