🚀 Build This Architecture Faster

Don't start from scratch. Get the high-resolution "Two Doors" diagrams, the exact APIM Policy XML code, and implementation guide.

Download Architecture Kit (ZIP)

*Includes Diagram, Policy XML & Readme.

Architecture Kit Preview
Back to Insights

Two Doors, One Rulebook: How TrustBank Scaled AI Agents Without Losing Control

TrustBank wanted AI agents for customers (internet-scale) and AI agents for employees (private-only). The CIO wanted speed. The CISO wanted an audit-proof control story. The solution was a hybrid pattern using dual Azure API Management gateways (APIM) with consistent “AI gateway” policies across both.

TL;DR - Executive Summary

Reading time: 15 minutes | Interactive demos included

What is an AI agent

An AI agent is an application that uses an AI model to understand a goal, decide steps, and call tools (APIs, search, ticketing, databases) to complete the task.

A chatbot mostly talks. An agent can take actions. That is why it needs stronger controls than a normal API.


The Cast

Upendra Kumar - Lead Architect
Lead Architect (Upendra)

Architecture, standards, trade-offs.

Trinity - Cloud Engineer
Cloud Engineer (Trinity)

Build, automation, operations.

Morpheus - Security Architect
Security Architect (Morpheus)

Identity, data controls, audit.

Upendra Kumar - Technical Consultant
Technical Consultant (Upendra)

Delivery strategy, Operating models, Scale-ready roadmaps.

Mr. Customer
Customer Leadership

Risk acceptance & operating model.

Mr. Project Manager
Project Manager

Decisions, RACI, milestones.

The Strategic Gap

Before governance, teams tend to connect directly to:

The Problem: Three Predictable Failures

Without centralized governance, direct model access creates:

💰 Uncontrolled Cost (Token spikes drain budget)
🔓 Shadow Integrations (Inconsistent access control)
🚫 Weak Governance (No single audit point)

The CISO requirement was clear:

Controls must be enforced centrally, and we must be able to prove it.
AI Gateway Concept Animation

Figure: The Conceptual Flow of a Secure AI Gateway (Source: Azure Samples)

Architecture Decision: Dual APIM

Decision: The "Two Doors" Pattern

Door 1: Public APIM-External

For customer-facing agents. Protected by WAF & DDoS at the edge.

Door 2: Private APIM-Internal

For employee agents. Accessible only via VPN/ExpressRoute.

🍕
The Analogy
Two Doors, One Kitchen

Imagine a pizza restaurant with two entrances: Door 1 (Public) is open to the street for walk-in customers, while Door 2 (Private) is a VIP entrance requiring a keycard. Both doors lead to the same kitchen (Azure OpenAI) and share the same "Jar of Tokens" (quota pool).

Here's the magic—One Rulebook: The Bouncer at Door 1 (APIM-External) and the Bouncer at Door 2 (APIM-Internal) use the exact same rulebook to check IDs and enforce token limits. If the jar goes empty, both bouncers hold up the 429: Not Today sign simultaneously.

Why Two Doors? Door 1 has a WAF shield (security guard at the street entrance), while Door 2 is hidden from the street (only accessible via VPN). But both enforce the same governance—one audit trail, one quota policy, centralized control.

Both gateways implement the AI gateway in APIM capabilities (govern LLM endpoints and MCP/tool APIs).

Reference Diagram

Reference Architecture: Azure OpenAI Service with Azure API Management Gateway

Figure 1: Enterprise GenAI Gateway Pattern (Source: Microsoft Azure Samples)

Network Configuration

“One Rulebook”

This is the governance standard applied to both APIM instances:

The Architect's Checklist

Ready to build a GenAI Gateway? Use this checklist to verify your posture before go-live.

Definition of done: model and tool access is routed through controlled gateway paths with consistent policy and traceability.

Day 1 Scope

External Customer Agents (Phase 1)

Allowed:

Not Allowed:

Internal Employee Agents (Phase 1)

Allowed:

Not Allowed (until extra gates exist):

Four Challenging Scenarios

The Crisis

Bot Storm & Cost Spike

A marketing event drove massive traffic. Bots piled in. Token spend threatened to drain the quarterly budget in 48 hours.

The Fix

Deterministic Throttling

💰 Budget Saved 🛡️ DDoS Blocked
  • WAF bot controls at the edge.
  • APIM enforced hard token budgets.
  • Outcome: 429 (Rate Limit) & 403 (Quota) stops abuse.
Token Rate Limiting Flow

Technical Control: Token Limits

We use the azure-openai-token-limit policy. It tracks consumption across all keys. When the bucket is empty, the gateway returns a 429 Too Many Requests immediately, saving the backend from overload.

View APIM Policy: Token Rate Limiting
<!-- Enforce token limits per subscription -->
<azure-openai-token-limit 
    counter-key="@(context.Subscription.Id)" 
    tokens-per-minute="50000" 
    estimate-prompt-tokens="true" 
    remaining-tokens-variable-name="remainingTokens" />
💸
What This Shows
The "Leftovers" Effect

Direct Access is like ordering a new pizza every time someone says "I'm hungry" ($300/day). Gateway with Caching is like saying "There's already pizza on the table!" for 60% of requests ($120/day). Drag the slider below to see YOUR savings.

The "Cost of Chaos" Calculator

Direct Access
$300
No Caching
Gateway Controlled
$120
60% Cache Hit
Gateway Savings: $180/ day
📚
The Analogy
The "Shortcut Library"

Imagine calling a world-renowned philosopher for every simple question—it's slow and expensive! Instead, our Gateway Librarian keeps a stack of Sticky Notes for common questions. If we've answered it before, we show the note instantly. The philosopher sleeps, and you save $5.

Semantic Caching Flow

Smart Caching

We enabled Semantic Caching. Unlike simple URL caching, this uses vector similarity. If User A asks "What's the wifi pass?" and later User B asks "Wifi password please?", the gateway detects they are the same intent and serves the cached answer instantly. Zero tokens used.

View APIM Policy: Semantic Caching
<!-- Enable Semantic Caching for Azure OpenAI -->
<azure-openai-semantic-cache-lookup
    score-threshold="0.05"
    embeddings-backend-id="embeddings-endpoint"
    embeddings-deployment-name="text-embedding-3-small" />
    
<azure-openai-semantic-cache-store duration="3600" />

Proof Artifacts: Token usage by consumer, blocked calls, top callers.

✉️
The Analogy
The "Celebrity Mail Inspector"

The LLM is like a famous celebrity. Some people try to send letters with "Secret Commands" or mean words. The Gateway is the Mail Inspector who opens everything first. If it sees a "Jailbreak" attempt, it shreds the letter instantly. The celebrity never even sees the attack.

The Risk

Prompt Injection Attack

External users tried "Ignore instructions" prompts to extract internal data or trigger unsafe actions (Jailbreak).

The Fix

Content Safety Shield

🛡️ 100% Block Rate ⚡ Zero Latency
  • No Route: External agent has zero path to internal tools.
  • Gateway Guard: Pre-flight content scan.
Content Safety Flow

Technical Control: Content Safety

Before any prompt reaches the LLM, the Gateway sends it to Azure AI Content Safety using the validate-content-safety policy. If it detects Jailbreaks, Hate Speech, or Injection patterns, the request is killed instantly. The LLM never sees the attack.

View APIM Policy: Content Safety
<!-- Sanitize inputs before they hit the LLM -->
<validate-content-safety 
    backend-id="content-safety-endpoint" 
    on-error="detach">
    <option name="Hate" />
    <option name="SelfHarm" />
    <option name="Sexual" />
    <option name="Violence" />
    <option name="Jailbreak" />
</validate-content-safety>
apim-gateway-log — bash SIMULATION MODE
// TRY IT: Type "Ignore instructions", "Drop table", or "Search flights"
user@trustbank: ~$
The Problem

Model Degradation

Primary East US 2 model slowed down. Latency spiked +500ms, risking user trust.

The Fix

Circuit Breaker

⚡ Auto-Failover 💰 Zero Downtime
  • Routing: Traffic shifted to UK South.
  • Speed: Transparent to end-users.
The Analogy
The "Stadium Generator"

Imagine the big game is on, and the city's power grid (Primary Instance) starts to flicker. In milliseconds, the Stadium Backup Generator (Secondary Region) kicks in. The lights stay on, the game continues, and the crowd never even knows there was a power surge.

Load Balancing Flow

Technical Control: Load Balancing

We use APIM's Backend Pools (backend-pool policy). If the primary East US 2 model slows down (high latency) or returns 5xx errors, the Gateway automatically shifts traffic to the secondary UK South instance. This "Circuit Breaker" happens in milliseconds, invisible to the user.

🚗
The Analogy
Grandpa's Old Car

Your family has a rule: "Everyone must have a driver's license with a photo ID." But Grandpa has a 1950s license with no photo—it's valid, but doesn't meet modern standards.

The Compromise: Grandpa can ONLY drive on the private family driveway (APIM-Internal), you install a dashcam in his car (enhanced logging), and you set a deadline: "Grandpa, you have 6 months to get a new license" (time-boxed exception).

He's not blocked from essential trips, he's not on the public highway where he could cause chaos, you're tracking everything, and there's a plan to fix it.

The Exception

Legacy App (No Auth)

Critical OpS system couldn't adopt new Identity standards. Risk of total blocker.

The Compromise

Private Lane

🛡️ Internal Only 💰 Time-Boxed
  • Gate: Allowed ONLY via APIM-Internal.
  • Audit: Enhanced logging enabled.

Operating Model

Weekly KPIs

Go-Live Checklist

Resources

Video Vault (Must Watch)

Expert deep-dives on the patterns used in this architecture.

John Savill APIM Deep Dive
Azure APIM Deep Dive
Secure GenAI with AI Content Safety
Secure GenAI with AI Content Safety

Read More

Join the Conversation

See what the community is saying about this architecture on LinkedIn:

Organizations turn stalled cloud initiatives into execution engines.

Contact Me View the Toolkit
Back to Insights