Troubleshooting Palo Alto Active-Active Firewall Failover: Resolving Stuck Sessions in "vsys1+intrazone-default" (Part 1)

In enterprise network environments, high-availability configurations are essential for maintaining service continuity. Recently, during the implementation of two Palo Alto PA-1420 firewalls in active-active mode, we encountered an issue during failover testing that impacted specific application traffic. By configuring OSPF with differentiated costs to preferentially route traffic to Firewall 01, we aimed to simplify monitoring and troubleshooting. However, failover simulations revealed disruptions in key wireless and VoIP sessions. This post details the problem, root cause, resolution, and preventive measures, drawing from hands-on experience to assist IT professionals managing similar setups.

Update (October 26, 2025): A follow-up query regarding the PA-1420 model on PAN-OS 11.2.4-h7 highlights a related known issue with UDP session “ghosting” after failover events (Palo Alto enhancement PAN-244348). This can manifest as stuck sessions, potentially leading to policy resolution errors like “vsys1+intrazone-default” in active-active HA setups with dynamic routing changes (e.g., OSPF failover). While not explicitly listed as a known issue in the 11.2.4 release notes, deployments on this version may still encounter it if the mitigation is not applied. To prevent recurrence, enable the session teardown feature via CLI: set session teardown-upon-fwd-zonechange yes. Verify with show session info. This enhancement, available in PAN-OS 11.2 and later, automatically tears down affected UDP sessions during forwarding path or zone changes, eliminating the need for manual clears. For full details, refer to Palo Alto KB: UDP sessions stuck after failover.

Configuration Overview: Active-Active HA with OSPF Traffic Steering

The deployment utilized two Palo Alto PA-1420 firewalls in an active-active high-availability (HA) configuration to enable load balancing and redundancy. To ensure predictable traffic flow for easier log analysis:

OSPF costs were adjusted on upstream routers, assigning lower costs to paths leading to Firewall 01 and higher costs to Firewall 02.
This directed the majority of traffic to Firewall 01 under normal conditions, while allowing failover to Firewall 02 during outages or maintenance.

The objective was to achieve seamless failover with minimal impact on mission-critical services, including wireless access points (APs) and VoIP communications.

Failover Testing: Selective Application Disruptions

During controlled failover tests—initiated by shutting down Firewall 01’s interfaces—OSPF reconverged successfully, redirecting traffic to Firewall 02. Basic connectivity, such as ICMP pings and HTTP sessions, remained unaffected. However, certain stateful, UDP-based applications experienced outages:

Cisco APs to Wireless LAN Controller (WLC): CAPWAP tunnels failed to maintain associations, leading to client de-authentication, DHCP delays, and authentication timeouts.
Aruba APs to WLC: Tunnel establishment using GRE and IPsec-ESP-UDP protocols broke down, preventing AP registration and heartbeat signals.
iPhone SIP Traffic: SIP signaling for VoIP calls dropped, resulting in failed call setups and no audio paths.

These issues were isolated to long-lived UDP flows and did not affect TCP-dominant applications. Initial diagnostics confirmed intact HA synchronization, correct policy matches on Firewall 02, and no routing anomalies.

Reversing the failover restored functionality, but repeated tests reproduced the symptoms consistently.

Root Cause Analysis: Stale Sessions and VSYS Policy Mismatches

Further investigation focused on the session tables during failover on Firewall 02. Using CLI commands such as:

show session all filter application eq capwap
show session all filter application eq gre or application eq ipsec-esp-udp
show session all filter application eq sip

We identified sessions stuck with policies resolving to “vsys1+intrazone-default”.

In Palo Alto Networks terminology:

“Intrazone-default” refers to the implicit intra-zone allow rule.
The “vsys1+” prefix indicates a potential multi-vsys context mismatch or residual state from HA session synchronization.

These were not fresh sessions but remnants of pre-failover states that persisted post-sync. In active-active setups with OSPF-driven rerouting, the aggressive session mirroring can lead to incomplete state flushing for UDP applications like CAPWAP, GRE/IPsec-ESP-UDP, and SIP, causing packets to be trapped in an unresolved policy context.

Resolution: Session Table Clearance

To mitigate the issue, we executed a targeted session purge on Firewall 02:

clear session all

This command cleared all active sessions, forcing re-establishment of connections. In a test environment, a full clear was acceptable; in production, more granular filters (e.g., clear session all filter application eq capwap) would minimize disruption.

Immediate Outcomes:

Cisco APs re-associated via CAPWAP within seconds.
Aruba APs rebuilt GRE and IPsec-ESP-UDP tunnels successfully.
iPhone SIP sessions normalized, enabling call functionality.
No further traffic was observed in “vsys1+intrazone-default.”

Subsequent failover tests, post-clearance, completed without incidents, confirming the resolution.

Key Takeaways and Best Practices

This experience highlights the interplay between HA session synchronization and dynamic routing in active-active configurations. To prevent recurrence:

Proactive Session Management: Implement scheduled drains before planned failovers. For PAN-OS 11.2+, enable set session teardown-upon-fwd-zonechange yes to auto-teardown stuck UDP sessions during path changes.
Deeper Dive on the Command: This CLI setting addresses a common pain point in failover scenarios where UDP sessions become “ghosted” or stuck due to changes in the forwarding plane, such as zone shifts triggered by Path-Based Forwarding (PBF) failures, path monitoring issues on static routes, or HA-driven rerouting like our OSPF failover.
Purpose: It instructs the firewall to automatically terminate affected UDP sessions upon detecting a forwarding plane zone change. This prevents lingering sessions from interfering with new traffic flows, which could otherwise trap packets in mismatched policy states (e.g., “vsys1+intrazone-default”). Without it, synced HA states might not fully flush, leading to disruptions in UDP-heavy apps like VoIP (SIP), wireless tunnels (CAPWAP/GRE/IPsec), or DNS.
How It Works: When a zone or path change occurs in the dataplane, the firewall evaluates active UDP sessions and tears down those impacted by the shift. This forces clean re-establishment of sessions on the new path, ensuring policy re-resolution and state consistency. It’s selective to UDP because these flows lack TCP’s inherent teardown mechanisms.
When to Enable It: Ideal for environments with dynamic routing (OSPF/BGP), multi-path setups, or frequent HA tests—especially if UDP apps show post-failover symptoms like call drops or tunnel flaps. Enable it globally via CLI in operational mode; it requires a commit to persist.
Verification: Post-enable, run show session info to confirm the setting (look for “teardown-upon-fwd-zonechange: yes”). During tests, monitor traffic logs for sessions ended with “unknown reason” (indicating teardown) and cross-check system logs for zone change events. If issues persist, consider XML API scripting for automated clears as a fallback.
This feature has been a game-changer in our lab, reducing manual interventions by proactively handling the “ghosting” we encountered.
OSPF Optimization: Ensure sub-second convergence times by tuning hello/dead intervals and keeping cost differentials moderate (e.g., 10 vs. 20) to balance steering and responsiveness.
HA Configuration Validation: Verify multi-vsys alignments and enable session owner synchronization if using virtual systems.
Comprehensive Testing: Incorporate application-specific scripts simulating CAPWAP, GRE/IPsec-ESP-UDP, and SIP traffic alongside standard connectivity checks.

Palo Alto’s active-active HA is robust for scalable environments, but attention to session state during routing changes is critical. For deeper insights, refer to the Palo Alto Networks documentation on HA session synchronization.

If you’ve encountered similar challenges in Palo Alto HA deployments or OSPF-integrated firewalls, we encourage sharing experiences in the comments to foster collective IT knowledge.

Troubleshooting Palo Alto Active-Active Firewall Failover: Resolving Stuck Sessions in “vsys1+intrazone-default” (Part 1)

Configuration Overview: Active-Active HA with OSPF Traffic Steering

Failover Testing: Selective Application Disruptions

Root Cause Analysis: Stale Sessions and VSYS Policy Mismatches

Resolution: Session Table Clearance

Key Takeaways and Best Practices

Leave a Comment Cancel Reply

Configuration Overview: Active-Active HA with OSPF Traffic Steering

Failover Testing: Selective Application Disruptions

Root Cause Analysis: Stale Sessions and VSYS Policy Mismatches

Resolution: Session Table Clearance

Key Takeaways and Best Practices

Related posts:

Leave a Comment Cancel Reply