Two Doors, One Rulebook: Scaling AI Agents

TrustBank wanted AI agents for customers (internet-scale) and AI agents for employees (private-only). The CIO wanted speed. The CISO wanted an audit-proof control story. The solution was a hybrid pattern using dual Azure API Management gateways (APIM) with consistent “AI gateway” policies across both.

TL;DR - Executive Summary

The Challenge: Scale AI agents for both public customers and private employees while maintaining audit-proof governance
The Solution: Dual Azure API Management gateways (APIM-External + APIM-Internal) with unified "AI gateway" policies
The Impact: 60% cost reduction via semantic caching, 100% security coverage, zero-downtime failover, and centralized audit trail

Reading time: 15 minutes | Interactive demos included

What is an AI agent

An AI agent is an application that uses an AI model to understand a goal, decide steps, and call tools (APIs, search, ticketing, databases) to complete the task.

A chatbot mostly talks. An agent can take actions. That is why it needs stronger controls than a normal API.

The Cast

Lead Architect (Upendra)

Architecture, standards, trade-offs.

Cloud Engineer (Trinity)

Build, automation, operations.

Security Architect (Morpheus)

Identity, data controls, audit.

Technical Consultant (Upendra)

Delivery strategy, Operating models, Scale-ready roadmaps.

Customer Leadership

Risk acceptance & operating model.

Project Manager

Decisions, RACI, milestones.

The Strategic Gap

Before governance, teams tend to connect directly to:

model endpoints (Azure OpenAI / Foundry Models), and
tool APIs (search, DB, ticketing).

The Problem: Three Predictable Failures

Without centralized governance, direct model access creates:

💰 Uncontrolled Cost (Token spikes drain budget)

🔓 Shadow Integrations (Inconsistent access control)

🚫 Weak Governance (No single audit point)

The CISO requirement was clear:

Controls must be enforced centrally, and we must be able to prove it.

Figure: The Conceptual Flow of a Secure AI Gateway (Source: Azure Samples)

Architecture Decision: Dual APIM

Decision: The "Two Doors" Pattern

Door 1: Public APIM-External

For customer-facing agents. Protected by WAF & DDoS at the edge.

Door 2: Private APIM-Internal

For employee agents. Accessible only via VPN/ExpressRoute.

🍕

The Analogy

Two Doors, One Kitchen

Imagine a pizza restaurant with two entrances: Door 1 (Public) is open to the street for walk-in customers, while Door 2 (Private) is a VIP entrance requiring a keycard. Both doors lead to the same kitchen (Azure OpenAI) and share the same "Jar of Tokens" (quota pool).

Here's the magic—One Rulebook: The Bouncer at Door 1 (APIM-External) and the Bouncer at Door 2 (APIM-Internal) use the exact same rulebook to check IDs and enforce token limits. If the jar goes empty, both bouncers hold up the 429: Not Today sign simultaneously.

Why Two Doors? Door 1 has a WAF shield (security guard at the street entrance), while Door 2 is hidden from the street (only accessible via VPN). But both enforce the same governance—one audit trail, one quota policy, centralized control.

Both gateways implement the AI gateway in APIM capabilities (govern LLM endpoints and MCP/tool APIs).

Reference Diagram

Reference Architecture: Azure OpenAI Service with Azure API Management Gateway

Figure 1: Enterprise GenAI Gateway Pattern (Source: Microsoft Azure Samples)

Network Configuration

Model endpoints are configured for private access using Private Endpoint+Private DNS. Where required, public network access is disabled so Private Endpoint is the exclusive path.
Tooling systems (for example, Azure AI Search) use the same private connectivity approach.
APIM-Internal is deployed in internal VNet mode (or equivalent private-only access pattern), so it is reachable only within controlled networks.
Egress is restricted so workloads cannot call model endpoints directly. Access is permitted only through approved gateway paths.

“One Rulebook”

This is the governance standard applied to both APIM instances:

Identity at the gateway (who is calling).
Request throttling (requests per time window).
Token governance (tokens per minute and/or token quotas) using APIM GenAI policies, including azure-openai-token-limit.
Audit-ready logging with consistent fields and correlation IDs (caller, environment, model/deployment, tool invoked, token usage, latency, backend route).
Tool allowlisting (agents can call only approved tools, not arbitrary URLs).
Exceptions are time-boxed with an owner, expiry date, and compensating controls.

The Architect's Checklist

Ready to build a GenAI Gateway? Use this checklist to verify your posture before go-live.

Dual APIM instances (Isolated Public/Private doors) Managed Identity for Model & Tool authentication Token limits (TPM) configured via APIM policy Semantic Caching enabled for ROI optimization Azure AI Content Safety pre-flight scan integrated

Definition of done: model and tool access is routed through controlled gateway paths with consistent policy and traceability.

Day 1 Scope

External Customer Agents (Phase 1)

Allowed:

Read-only knowledge and safe lookups (public FAQs, product guidance, status checks with strict scope).

Not Allowed:

Any internal tools.
High-risk actions (account changes, payments, entitlements).

Internal Employee Agents (Phase 1)

Allowed:

Enterprise search and knowledge retrieval.
CMDB and operational context lookups.
Low-risk workflow actions (for example, ticket drafts or controlled ticket creation).

Not Allowed (until extra gates exist):

High-risk write actions without step-up controls (approval and/or human-in-the-loop).

Four Challenging Scenarios

The Crisis

Bot Storm & Cost Spike

A marketing event drove massive traffic. Bots piled in. Token spend threatened to drain the quarterly budget in 48 hours.

The Fix

Deterministic Throttling

💰 Budget Saved 🛡️ DDoS Blocked

WAF bot controls at the edge.
APIM enforced hard token budgets.
Outcome: 429 (Rate Limit) & 403 (Quota) stops abuse.

Technical Control: Token Limits

We use the azure-openai-token-limit policy. It tracks consumption across all keys. When the bucket is empty, the gateway returns a 429 Too Many Requests immediately, saving the backend from overload.

View APIM Policy: Token Rate Limiting

<!-- Enforce token limits per subscription -->
<azure-openai-token-limit 
    counter-key="@(context.Subscription.Id)" 
    tokens-per-minute="50000" 
    estimate-prompt-tokens="true" 
    remaining-tokens-variable-name="remainingTokens" />

💸

What This Shows

The "Leftovers" Effect

Direct Access is like ordering a new pizza every time someone says "I'm hungry" ($300/day). Gateway with Caching is like saying "There's already pizza on the table!" for 60% of requests ($120/day). Drag the slider below to see YOUR savings.

The "Cost of Chaos" Calculator

Daily Token Volume: 100M tokens

Direct Access

$300

No Caching

Gateway Controlled

$120

60% Cache Hit

Gateway Savings: $180/ day

📚

The Analogy

The "Shortcut Library"

Imagine calling a world-renowned philosopher for every simple question—it's slow and expensive! Instead, our Gateway Librarian keeps a stack of Sticky Notes for common questions. If we've answered it before, we show the note instantly. The philosopher sleeps, and you save $5.

Smart Caching

We enabled Semantic Caching. Unlike simple URL caching, this uses vector similarity. If User A asks "What's the wifi pass?" and later User B asks "Wifi password please?", the gateway detects they are the same intent and serves the cached answer instantly. Zero tokens used.

View APIM Policy: Semantic Caching

<!-- Enable Semantic Caching for Azure OpenAI -->
<azure-openai-semantic-cache-lookup
    score-threshold="0.05"
    embeddings-backend-id="embeddings-endpoint"
    embeddings-deployment-name="text-embedding-3-small" />
    
<azure-openai-semantic-cache-store duration="3600" />

Proof Artifacts: Token usage by consumer, blocked calls, top callers.

✉️

The Analogy

The "Celebrity Mail Inspector"

The LLM is like a famous celebrity. Some people try to send letters with "Secret Commands" or mean words. The Gateway is the Mail Inspector who opens everything first. If it sees a "Jailbreak" attempt, it shreds the letter instantly. The celebrity never even sees the attack.

The Risk

Prompt Injection Attack

External users tried "Ignore instructions" prompts to extract internal data or trigger unsafe actions (Jailbreak).

The Fix

Content Safety Shield

🛡️ 100% Block Rate ⚡ Zero Latency

No Route: External agent has zero path to internal tools.
Gateway Guard: Pre-flight content scan.

Technical Control: Content Safety

Before any prompt reaches the LLM, the Gateway sends it to Azure AI Content Safety using the validate-content-safety policy. If it detects Jailbreaks, Hate Speech, or Injection patterns, the request is killed instantly. The LLM never sees the attack.

View APIM Policy: Content Safety

<!-- Sanitize inputs before they hit the LLM -->
<validate-content-safety 
    backend-id="content-safety-endpoint" 
    on-error="detach">
    <option name="Hate" />
    <option name="SelfHarm" />
    <option name="Sexual" />
    <option name="Violence" />
    <option name="Jailbreak" />
</validate-content-safety>

apim-gateway-log — bash SIMULATION MODE

// TRY IT: Type "Ignore instructions", "Drop table", or "Search flights"

user@trustbank: ~$

The Problem

Model Degradation

Primary East US 2 model slowed down. Latency spiked +500ms, risking user trust.

The Fix

Circuit Breaker

⚡ Auto-Failover 💰 Zero Downtime

Routing: Traffic shifted to UK South.
Speed: Transparent to end-users.

⚡

The Analogy

The "Stadium Generator"

Imagine the big game is on, and the city's power grid (Primary Instance) starts to flicker. In milliseconds, the Stadium Backup Generator (Secondary Region) kicks in. The lights stay on, the game continues, and the crowd never even knows there was a power surge.

Technical Control: Load Balancing

We use APIM's Backend Pools (backend-pool policy). If the primary East US 2 model slows down (high latency) or returns 5xx errors, the Gateway automatically shifts traffic to the secondary UK South instance. This "Circuit Breaker" happens in milliseconds, invisible to the user.

🚗

The Analogy

Grandpa's Old Car

Your family has a rule: "Everyone must have a driver's license with a photo ID." But Grandpa has a 1950s license with no photo—it's valid, but doesn't meet modern standards.

The Compromise: Grandpa can ONLY drive on the private family driveway (APIM-Internal), you install a dashcam in his car (enhanced logging), and you set a deadline: "Grandpa, you have 6 months to get a new license" (time-boxed exception).

He's not blocked from essential trips, he's not on the public highway where he could cause chaos, you're tracking everything, and there's a plan to fix it.

The Exception

Legacy App (No Auth)

Critical OpS system couldn't adopt new Identity standards. Risk of total blocker.

The Compromise

Private Lane

🛡️ Internal Only 💰 Time-Boxed

Gate: Allowed ONLY via APIM-Internal.
Audit: Enhanced logging enabled.

Operating Model

Weekly KPIs

% of model calls routed through gateways (target: 100%)
Token spend by app/team vs budget
p95 latency and error rate per agent API
Number of failovers
Exception count and burn-down trend

Go-Live Checklist

[] APIM-External and APIM-Internal deployed and scoped
[] Model endpoints private (Private Endpoint + private DNS)
[] Tool services private (Search/DB/Storage)
[] Gateway identity enforced
[] Request throttles + token quotas enabled
[] Token metrics enabled
[] Tool allowlist implemented
[] Failover tested
[] Abuse tests executed
[] Exceptions process live

Resources

Video Vault (Must Watch)

Expert deep-dives on the patterns used in this architecture.

Azure APIM Deep Dive

Secure GenAI with AI Content Safety

Official Docs: Implement Azure OpenAI Gateway policies

Join the Conversation

See what the community is saying about this architecture on LinkedIn:

Organizations turn stalled cloud initiatives into execution engines.

Contact Me View the Toolkit

Back to Insights

🚀 Build This Architecture Faster

Two Doors, One Rulebook: How TrustBank Scaled AI Agents Without Losing Control

TL;DR - Executive Summary

What is an AI agent

The Cast

The Strategic Gap

The Problem: Three Predictable Failures

Architecture Decision: Dual APIM

Decision: The "Two Doors" Pattern

Reference Diagram

Network Configuration

“One Rulebook”

The Architect's Checklist

Day 1 Scope

External Customer Agents (Phase 1)

Internal Employee Agents (Phase 1)

Four Challenging Scenarios

Bot Storm & Cost Spike

Deterministic Throttling

Technical Control: Token Limits

The "Cost of Chaos" Calculator

Smart Caching

Prompt Injection Attack

Content Safety Shield

Technical Control: Content Safety

Model Degradation

Circuit Breaker

Technical Control: Load Balancing

Legacy App (No Auth)

Private Lane

Operating Model

Weekly KPIs

Go-Live Checklist

Resources

Video Vault (Must Watch)

Read More

Join the Conversation