Back to Insights
Architecture

The Hub-and-Spoke Lie:
Production-Grade Networking for the Real World

Architect's Reality Check
The "textbook" Hub-and-Spoke diagram is a lie.
Real-world networks don't just "connect"—they inspect, block, and audit. In production, unmonitored peering isn't a bridge; it's an uncontained blast radius.
The Silent Risk

Reliability failures are often "Day 2" ops failures: drift, silent blocks, and opaque paths. When the Incident Commander demands an RCA (Root Cause Analysis) after a Sev-1 outage, a static Visio diagram is an admission of failure.

The Production Standard

This is not "Hello World". This is a Production Landing Zone built for auditability, using the "Trinity of Observability" to mathematically prove network health.

Observability isn't a Day 2 'extra'. It's the validation logic that proves your security policies are actualized. Without it, you aren't an architect; you're just a gambler hoping that packets find their way.

1. The Architecture: Hub-and-Spoke Reality

We implemented a Microsoft-native landing zone that mandates inspection and continuously validates assertions. This isn't just about connectivity; it's about Control.

New to Azure? Think of VNet Peering like a direct highway between two cities. By default, cars just drive through. A User Defined Route (UDR) is like a mandatory exit ramp—it forces every car to pass through a security toll booth (the Firewall) before they can finish their journey.

Azure Landing Zone Topology: Gold Standard Blueprint

The Core Components

2. Design Decision: Parallel Edge Security

A common debate in Azure networking is "Serial Chaining" (WAF -> Firewall -> App) versus "Parallel" (WAF -> App, Firewall -> Internet). We chose Parallel (Zero Trust Edge).

Parallel Edge Security: Gold Standard Framework

Why Parallel?

3. Security: Defense in Depth

Security is not a single appliance; it's layers. Our design implements a "Defense in Depth" strategy where no single failure results in compromise.

Defense in Depth: Gold Standard Engineering

The Non-Negotiables

  1. Single Inspection Plane: Azure Firewall Premium is the only path for east-west and egress traffic.
  2. Mandatory WAF: Inbound HTTP(S) must pass through App Gateway WAF v2 in Prevention Mode.
  3. Micro-segmentation: NSGs are applied on every subnet with a "Default Deny" mindset.

4. The "Trinity" of Observability

This is where most implementations fail. They build the network but forget to build the eyes to see it. We implement three pillars to "prove the negative."

Network Observability: Gold Standard Trinity

The Proof of Life: Below are the actual outcomes of the Trinity implementation. These aren't just logs—they are the operational brain of your Landing Zone.

Azure Traffic Analytics: Authentic VNet Flow Logs Dashboard
Azure Connection Monitor: Authentic Path Validation Results
Tool Role What it does
VNet Flow Logs
(Recommended)
The Wide Lens Records entire VNet traffic. Captures encryption status. The modern standard.
NSG Flow Logs
(Legacy)
The Macro Lens Records granular Subnet/NIC events. Use only for legacy parity.
Traffic Analytics The Brain Visualizes flows, detecting malicious IPs and drift.
Connection Monitor The Pulse Tests critical paths (VM -> Google DNS) 24/7.

💡 Architect's Note: VNet Flow Logs vs. NSG Flow Logs

Microsoft is evolving from NSG Flow Logs to the newer VNet Flow Logs engine. While they look similar, the distinction matters for simple reasons:

  • Simplicity: VNet Flow Logs are enabled at the VNet level, preventing the "missed subnet" configuration drift common with NSG-level logging.
  • Encryption Visibility: Only VNet Flow Logs can report the Encryption Status of traffic (critical if using VNet Encryption).

When to use which?

  • ✅ USE VNet Flow Logs: For all new deployments ("Greenfield") and when you need to audit VNet Encryption.
  • ⛔ DO NOT USE NSG Flow Logs: Unless you have a specific legacy requirement to monitor only a single NIC in a shared subnet to save fractional storage costs.
  • Blind Spots: Remember that traffic to a Private Endpoint cannot be recorded at the endpoint itself. You must capture it at the source VM (the client side).

5. The Topology Dilemma: Where does the WAF go?

One of the most heated debates in Azure architecture is the placement of the Application Gateway (WAF) relative to the Azure Firewall. Microsoft Learn documents three core patterns. Here is the realist's take on when to use which:

1. The Parallel Pattern (The Pragmatist)

Layout: AppGW and Firewall sit side-by-side. Web traffic hits AppGW; Non-web hits Firewall.

Parallel Architecture: Gold Standard Blueprint
  • ✅ Pros: Simplest routing. No double-encryption capability needed. Cheapest.
  • ⛔ Cons: Azure Firewall does NOT inspect the web traffic. You rely 100% on the WAF.
  • Verdict: Use this for 90% of standard enterprise apps where WAF protection is sufficient.

2. AppGW -> Firewall (The Fortress)

Layout: Traffic hits WAF first, then is piped through Firewall for heavy inspection.

Fortress Architecture: Gold Standard Blueprint
  • ✅ Pros: "Zero Trust". Firewall sees everything, even web packets. Preserves Client IP (via X-Forwarded-For).
  • ⛔ Cons: Complex UDRs. Needs Azure Firewall Premium to meaningful inspect encrypted web traffic (TLS Inspection).
  • Verdict: Mandatory for Banking/High-Compliance where Defense in Depth is audited.

3. Firewall -> AppGW (The Shield)

Layout: Firewall filters bad IPs at the edge, THEN passes clean traffic to WAF.

Shield Architecture: Gold Standard Blueprint
  • ✅ Pros: Stops DDoS/Scanners before they touch your expensive WAF instances.
  • ⛔ Cons: Major Trap: The WAF only sees the Firewall's IP, not the Client ID, unless you enable expensive Proxy modes.
  • Verdict: Rare. Use only if you have a massive IP-blocking requirement at the edge.

The Veteran's Trap: Asymmetric Routing

In a real-world Hub-and-Spoke, traffic often fails because it takes two different paths. Request goes through the Firewall; Response tries to bypass it via a default peering route. The Firewall sees a one-way connection and "gracefuly" drops the packet (Stateful Filtering).

The Fix: Use a Gateway UDR. You must apply a 0.0.0.0/0 route to the GatewaySubnet (in the Hub) pointing to the Firewall's internal IP. This ensures the "return leg" of the conversation is forced back through the same inspector.

The Airport Analogy: Imagine you go through **Airport Security** to catch a flight. The guard checks your ID and lets you in. If you try to return by walking through the **unmanned baggage exit**, the security system freaks out. Why? Because the guards at the exit never saw you leave, and the guards at the entrance are still waiting for you to walk back past them. A Stateful Firewall is that guard—if it doesn't see the "Hello" (Request), it will never allow the "Goodbye" (Response).

6. Troubleshooting the Trenches: The "404 Ghost"

The "404 Ghost" is the single most expensive support ticket in Azure networking. It’s rarely a 'down' service; it’s almost always a Host Header mismatch during the initial handshake.

The Problem: What actually happened

Imagine you went to a high-security office building and said to the receptionist: “Hi, I’m here to meet someone.”

The receptionist looks at you and says: “Who exactly? I only let people in if you ask for a specific name.”

That’s what our initial 404 Not Found error was. The Application Gateway was knocking on the App Service’s door but saying something vague like: “I’m here for a website.”

Azure App Service is a multi-tenant environment. It says: “I only respond if you call me by my exact name.” That “exact name” is the Host header.

Why App Service rejected the request

The Fix: Use its Real Name

We told the Gateway: “When you talk to the App Service, use its real name, not a generic one.”

pick_host_name_from_backend_address = true

This tells the Gateway: “Look at the backend URL (app-spoke2-obs…), and use that exact hostname in every request.”

The Takeaway: The app was fine. The gateway was just using the wrong name. Once it used the app’s real hostname, the 404 disappeared, and traffic flowed normally.

7. The Cost of Reality (FinOps)

"Why Premium? Why not Standard?" This is the most common question I get. The answer is simple: TLS Inspection.

Standard Firewall is blind to encrypted traffic (which is 99% of modern malware c2). If you can't inspect inside the HTTPS tunnel, you aren't secure; you're just routing packets.

The Expert's Scalability Check: In high-throughput Landing Zones, watch out for SNAT Port Exhaustion. If your Spokes are making thousands of concurrent outbound API calls, the Firewall's Public IP pool will run dry.
Solution: Scale horizontally by adding more Public IPs to the Firewall instance (up to 250 supported).

Don't forget the Observability tax: While NSG Flow Logs are cheap (pennies/GB), looking at them isn't. Traffic Analytics processing and the Connection Monitor agents (approx. $25/month per node) add up. However, compared to the $10,000/hour cost of a Sev-1 outage where you "guess" the root cause, it's a negligible insurance premium.

8. Future Hardening

To move from "Baseline" to "Enterprise Standard," the roadmap includes:



📚 Deep Dive Resource Kit

Curated assets for building and auditing production-grade Azure environments.

Ready to deploy?

The full Terraform codebase, including the Windows Server 2022 refactor and Connection Monitor logic, is available on GitHub.

Get the Code Discuss Your Architecture

Architect's Post-Deployment Checklist

Before you hand over the keys to the platform team, ensure you can answer 'Yes' to these five high-stakes questions:

UDR Enforcement: Do all Spoke subnets have a 0.0.0.0/0 UDR pointing to the Firewall Hub IP (and is BGP propagation disabled)?
Encryption Audit: Have you used VNet Flow Logs to mathematically prove that traffic between critical nodes is encrypted at 128/256-bit?
Host Header Integrity: Is pick_host_name_from_backend_address set on all App Gateway backends to kill the "404 Ghost"?
SNAT Scaling: Have you audited the Firewall's SNAT port utilization for Spokes making high-frequency external API calls?
Path Validation: Is Connection Monitor testing the Hub-to-Internet and Hub-to-Hybrid paths every 60 seconds with active alerting?
🏆

Architecture Audit Certified

Your Landing Zone meets the Senior Architect standard for production-grade networking.

Back to Insights