Multi-DC Palo Alto HA: Session Recovery Challenges During Cross-Site Failover and Return Traffic (Part 2)

In distributed enterprise architectures, leveraging independent high-availability (HA) pairs of Palo Alto firewalls across multiple data centers (DCs) provides robust redundancy for core network services. However, managing session state—particularly for UDP-based applications like wireless tunnels (CAPWAP/GRE/IPsec-ESP-UDP) and VoIP signaling (SIP)—during full-site outages and subsequent recovery introduces subtleties. Building on our prior discussions of active-active HA with OSPF steering, this post examines two recovery scenarios when traffic fails back from a secondary DC (DcB) to the primary (DcA) after a complete DcA outage. We’ll detail expected behaviors, potential disruptions, and Palo Alto Networks’ recommended mitigations, informed by PAN-OS 11.2+ best practices for multi-site deployments.

Infrastructure Overview (Text Diagram for Clarity):

[Upstream Routers / Internet]
          |
          | OSPF (Low Cost Paths)
          v
+---------------------------+     +---------------------+
| DcA HA Pair (Primary)     |     | Internal Clients     |
| +-------------+ +---------+     | - Cisco APs (CAPWAP) |
| | DcA-fw01    | | DcA-fw02   |  | - Aruba APs (GRE/    |
| | (Active)    | | (Active)   |  |   IPsec-ESP-UDP)     |
| +-------------+ +---------+     | - iPhone SIP Phones  |
|     HA2 Sync Link          |     +---------------------+
+---------------------------+               |
          | OSPF Reconvergence              v
          | (High Cost on Failover)   [Services: WLC, PBX]
          |                                  |
          v                                  |
+---------------------------+                |
| DcB HA Pair (Secondary)   | <--------------+
| +-------------+ +---------+
| | DcB-fw01    | | DcB-fw02   |
| | (Active)    | | (Active)   |
| +-------------+ +---------+
|     HA2 Sync Link          |
+---------------------------+
(No Cross-DC Session Sync)

This diagram illustrates the routed topology: Traffic prefers DcA via OSPF costs, fails over to DcB on full-site outage, with independent HA pairs per DC. Clients (e.g., APs/phones) sit behind the FWs, accessing services like WLC/PBX.

Context Recap: Assume two independent HA pairs (DcA: fw01/fw02; DcB: fw01/fw02) with no cross-DC session synchronization (standard for WAN-separated sites due to latency/bandwidth constraints). A full DcA outage routes traffic via OSPF to DcB-fw01, where new sessions form. On DcA recovery, traffic reconverges back. UDP sessions (default 30-second inactivity timeout) lack inherent teardown, amplifying stale state risks.

Scenario Setup: Full DcA Outage and DcB Failover

Outage Trigger: Both DcA-fw01 and fw02 go down (e.g., power failure or maintenance).
Routing Shift: OSPF reconverges (sub-second with tuned timers), directing traffic to DcB-fw01.
Session Impact on DcB: No prior state—new UDP sessions create on first packets. Apps recover via keepalives (e.g., CAPWAP re-joins in 10-60s; SIP re-registers in 30-300s). Minimal FW-level disruption, but app-layer delays occur.

Recovery to DcA depends on outage duration relative to UDP timeouts. Case 2 represents the “chaos” scenario—brief outages lead to stale session conflicts, creating more unpredictability and disruption than Case 1’s clean recovery.

Case 1: Failover Back After 30 Minutes (UDP Sessions Timed Out)

In this scenario, the DcA outage exceeds the default UDP timeout (30 seconds of inactivity), allowing all pre-outage sessions on DcA FWs to age out and flush from the session table.

Expected Behavior

Clean Slate on DcA Recovery:
DcA-fw01 resumes as active (HA pair re-syncs configs but finds empty session tables—no stales).
Incoming traffic from clients (e.g., APs, IP phones) arrives with no matching entries.
PAN-OS creates fresh sessions based on current 5-tuple (source/destination IP/port, protocol), applying policies, NAT, and inspections anew.
UDP App Recovery:
CAPWAP (Cisco APs): APs, having re-associated on DcB, send new heartbeat/join requests. DcA-fw01 pinholes the tunnel immediately—no conflicts.
GRE/IPsec-ESP-UDP (Aruba APs): Tunnels re-establish post-keepalive timeout; quick ~10-30s flap.
SIP (IP Phones): Phones re-register seamlessly; mid-call media resumes if PBX side persists, or restarts cleanly.
Overall Impact: Negligible at the FW layer (<1s per flow). Total recovery mirrors the initial DcB failover—app-dependent (seconds to minutes). No “ghosting” or policy mismatches (e.g., “vsys1+intrazone-default”).

Potential Edge Cases

If custom UDP timeouts are longer (e.g., 3600s via app overrides), a 30-minute outage might not fully flush—treat as Case 2.
Asymmetric routing during reconvergence could drop initial packets; monitor with show session all filter application eq sip.

Case 2: Failover Back Within 30 Seconds (Stale Sessions Persist)—The Chaos Scenario

Here, the outage is brief, so pre-outage UDP sessions remain active in DcA-fw01/fw02’s tables when traffic returns. This case introduces the most chaos due to stale state interference, often resulting in unpredictable app behaviors and requiring manual intervention.

Expected Behavior

Stale Session Conflicts:
DcA-fw01 receives packets matching pre-outage 5-tuples, hitting stale entries.
Forwarding occurs based on old state (e.g., prior NAT/port mappings), but remote endpoints (e.g., WLC/PBX) may have timed out or re-negotiated on DcB, causing:
- One-way traffic: Replies don’t match stale pinholes, leading to drops.
- Duplicate sessions: If apps re-initiated on DcB (new ports), DcA creates parallel entries—possible but risks resource contention or policy loops.
UDP App Disruptions:
CAPWAP: APs detect mismatch, flap tunnels (10-60s); clients de-associate.
GRE/IPsec-ESP-UDP: Keepalives fail against stales, triggering re-registration (seconds to minutes).
SIP: Signaling/media packets route via old state, causing jitter, call drops, or re-register loops (30-180s).
Overall Impact: Potential 30s-10min disruption on recovery—worse than initial failover. Symptoms: Stuck traffic in “aged-out” or “vsys1+intrazone-default” (as in your original issue), app flaps. This is the classic “ghosting” pain point, where UDP’s lack of teardown amplifies the mess.

Detection

CLI: show session all filter application eq capwap reveals stales with high byte counts but no recent activity.
Logs: Threat/traffic logs flag “session mismatch” or incomplete flows.

Palo Alto Recommendations for Handling These Cases

Palo Alto Networks emphasizes proactive design and automation for multi-site HA, as session sync is intra-pair only (via HA2 link). From PAN-OS documentation (e.g., HA Concepts for Multi-Site Deployments) and KB articles on UDP ghosting:

General Best Practices

Session Table Management:
Monitor via show session info and SNMP counters; set alerts for high stale ratios.
Use clear session all (or filtered, e.g., clear session all filter application eq sip) post-recovery for immediate cleanup—ideal for Case 2.
UDP-Specific Mitigations:
Enable set session teardown-upon-fwd-zonechange yes (PAN-OS 11.2+): Auto-tears down UDP sessions on detected path/zone shifts (e.g., DcB→DcA). Addresses stales in Case 2 with minimal side effects (brief UDP blips); verify with show session info.
Tune timeouts judiciously: Default 30s suits quick recovery (Case 1); override per-app (Objects > Applications > Override) for persistence without excess staleness.
Cross-DC Design:
App Resilience: Select UDP apps with robust keepalives/retry (e.g., SIP with short re-register timers). For tunnels, enable app-layer failover.
Routing Symmetry: Use PBF or OSPF to enforce bidirectional paths; adjust costs for hysteresis (e.g., higher return threshold) to avoid flap-induced stales.
Automation: Script recovery via XML API (e.g., auto-clear on HA state change) or integrate with orchestration tools like Ansible/Orchestrator.
Testing and Monitoring:
Simulate full-site outages/recoveries quarterly; measure UDP flap times.
Leverage Panorama for centralized logging; correlate with flow_basic counters for stale detection.
Advanced Options:
For stretched L2 (rare): HA over distance with dedicated HA links, but not recommended for multi-DC due to latency.
Consider GlobalProtect or Prisma Access for app-level state offload in hybrid setups.

In both cases, the key is balancing timeout aggression with app tolerance—shorter for fast recovery (Case 1), paired with teardown for stales (Case 2).

Key Takeaways

Multi-DC HA excels for site resilience but demands app-aware session handling. Case 1 offers “forgiving” recovery; Case 2 (the chaos) underscores the need for proactive clears or teardown features to avert UDP pitfalls. By aligning PAN-OS tools with your OSPF dynamics, disruptions shrink to sub-minute norms.

Encountered cross-DC flap in your env? Share configs or symptoms below—we’re all in this packet chase together.

Scenario Setup: Full DcA Outage and DcB Failover

Case 1: Failover Back After 30 Minutes (UDP Sessions Timed Out)

Expected Behavior

Potential Edge Cases

Case 2: Failover Back Within 30 Seconds (Stale Sessions Persist)—The Chaos Scenario

Expected Behavior

Detection

Palo Alto Recommendations for Handling These Cases

General Best Practices

Key Takeaways

Related posts:

Leave a Comment Cancel Reply