Deep into Dead peer detection and NAT-T of ipsec VPN

What is Dead Peer Detection (DPD)?

Dead Peer Detection (DPD) is a mechanism used in IPsec VPNs to determine whether the remote VPN peer (the other endpoint of the VPN tunnel) is still reachable and responsive. IPsec VPNs rely on Security Associations (SAs) to manage encryption and authentication, but if one peer becomes unreachable—due to network issues, a device reboot, or a configuration change—the other peer might not immediately know. This can lead to “stale” SAs, where one side thinks the tunnel is still active while the other side has dropped it. DPD addresses this by periodically checking the peer’s status and taking action if the peer is deemed “dead.”

Purpose: To detect if the remote peer is unresponsive and, if so, to tear down the stale SA and optionally attempt to reestablish the tunnel.
How It Works: DPD sends periodic “hello” messages (called DPD R-U-THERE messages) to the peer. If the peer doesn’t respond after a certain number of attempts, it’s considered dead.

How DPD Works in Detail

DPD operates at the IKE (Internet Key Exchange) level, which is responsible for setting up and managing the IPsec SAs. Here’s the step-by-step process:

DPD Initiation:
- One peer (e.g., the initiator of the VPN tunnel) sends an IKE informational message called an “R-U-THERE” message to the other peer.
- This message is sent over the IKE SA (Phase 1 SA) using the same ports as IKE (UDP 500 or UDP 4500 if NAT-T is enabled).
Peer Response:
- The receiving peer, if alive, responds with an “R-U-THERE-ACK” message to confirm it’s still active.
- This response indicates that the peer is reachable and the IKE SA is still valid.
Timeout and Retry Logic:
- The initiating peer waits for a response within a configured timeout period (e.g., 10 seconds).
- If no response is received, the peer retries the R-U-THERE message a set number of times (e.g., 3 retries).
- Total timeout = interval × number of retries (e.g., 10 seconds × 3 retries = 30 seconds).
Action on Failure:
- If the peer fails to respond after all retries, it’s declared “dead.”
- The initiating peer can take one of several actions, depending on the configuration:
  - Clear: Tear down the IKE SA and all associated IPsec SAs (Phase 2 SAs), effectively dropping the tunnel.
  - Hold: Keep the SA but mark it as unusable until the peer is detected again.
  - Restart: Tear down the SA and immediately attempt to reestablish the tunnel by initiating a new IKE negotiation.
Periodic Checks:
- DPD messages are sent at a configured interval (e.g., every 10 seconds) to continuously monitor the peer’s status.
- Some implementations use “on-demand” DPD, where checks are only sent if there’s no recent traffic, to reduce overhead.

Why DPD is Important

Prevents Stale SAs: Without DPD, a peer might continue sending traffic over a tunnel that the other side has already dropped, leading to packet loss.
Improves Recovery: DPD allows the VPN to quickly detect a dead peer and take action (e.g., reestablish the tunnel), reducing downtime.
Handles Network Issues: In scenarios with unstable networks (e.g., a remote site with a dynamic IP), DPD ensures the tunnel can recover from temporary disruptions.

DPD in Action: Example Scenario

Let’s consider your scenario: a remote site with a dynamic IP behind a NAT router, connecting to a core site via an IPsec VPN.

Setup:
- Remote Site VPN Gateway: Dynamic IP (e.g., 192.168.1.10 behind NAT, public IP 198.51.100.5).
- Core Site VPN Gateway: Static IP (203.0.113.1).
- DPD Settings: Interval = 10 seconds, Timeout = 30 seconds (3 retries), Action = Restart.
Scenario:
- The remote site’s NAT router reboots, causing the NAT mapping to be lost.
- The core site sends an R-U-THERE message to the remote site every 10 seconds.
- After the NAT router reboot, the remote site doesn’t respond because the NAT mapping is gone.
- The core site retries 3 times (10 seconds each), totaling 30 seconds.
- After 30 seconds with no response, the core site declares the remote site “dead,” tears down the IKE SA, and attempts to reestablish the tunnel.
- The remote site, now back online with a new NAT mapping, responds to the new IKE negotiation, and the tunnel is restored.

DPD Configuration Example

Here’s how to configure DPD on two common platforms:

Cisco (IOS):plaintextcrypto isakmp keepalive 10 3
- 10: Interval (send R-U-THERE every 10 seconds).
- 3: Number of retries (3 attempts, so total timeout = 30 seconds).
- Action is typically “clear” by default, but some platforms allow “restart” with additional commands.
strongSwan (Linux):plaintext# In ipsec.conf conn remote-to-core dpddelay=10s dpdtimeout=30s dpdaction=restart
- dpddelay=10s: Interval (send R-U-THERE every 10 seconds).
- dpdtimeout=30s: Total timeout (30 seconds, equivalent to 3 retries).
- dpdaction=restart: Restart the tunnel if the peer is dead.

Text Diagram: DPD Process

Here’s a text-based diagram showing the DPD process between the remote and core sites:

[Remote Site VPN Gateway] ---------------- [Internet] ---------------- [Core Site VPN Gateway]
(Dynamic IP: 192.168.1.10)                 |                      (Static IP: 203.0.113.1)
(Public IP: 198.51.100.5 via NAT)          |                      (DPD: 10s interval, 30s timeout)

1. Core sends R-U-THERE (t=0s) ------------> |
2. No response (NAT mapping lost)            |
3. Core retries R-U-THERE (t=10s) ---------> |
4. No response                               |
5. Core retries R-U-THERE (t=20s) ---------> |
6. No response                               |
7. Core declares peer dead (t=30s)           |
8. Core tears down SA and restarts ---------> |
9. Remote responds to new IKE negotiation <---|
10. Tunnel reestablished -------------------->|

DPD Considerations

Overhead: DPD messages add a small amount of network overhead, but this is negligible in most cases.
False Positives: If the interval and timeout are too short, temporary network glitches might cause the peer to be falsely declared dead.
Asymmetry: DPD can be configured on one side only, but it’s best to enable it on both peers for consistent behavior.
On-Demand DPD: Some implementations (e.g., Cisco) support on-demand DPD, where messages are sent only if there’s no recent traffic, reducing unnecessary checks.

What is NAT Traversal (NAT-T)?

NAT Traversal (NAT-T) is a technique that allows IPsec VPNs to function when one or both endpoints are behind a Network Address Translation (NAT) device, such as a home router. NAT devices modify the IP headers of packets (e.g., changing the source IP from a private to a public IP), which can break IPsec because IPsec relies on the integrity of those headers for authentication and encryption.

Purpose: To enable IPsec VPNs to work through NAT devices by encapsulating IPsec packets in a way that NAT can handle.
How It Works: NAT-T encapsulates IPsec packets (ESP or AH) in UDP packets, which NAT devices can process without breaking the IPsec security mechanisms.

Why NAT Breaks IPsec

IPsec uses two main protocols for securing data:

ESP (Encapsulating Security Payload): Encrypts and authenticates the payload.
AH (Authentication Header): Authenticates the entire packet, including the IP header.

NAT modifies the IP header (e.g., changing the source IP from 192.168.1.10 to 198.51.100.5), which causes issues:

AH: Since AH includes the IP header in its authentication hash, any modification by NAT will cause the hash to fail, and the packet will be dropped.
ESP: ESP doesn’t authenticate the IP header, so it’s less affected, but NAT can still cause problems with port mappings and state tracking.
Port Issues: IKE uses UDP 500 for negotiation, but NAT devices might not maintain mappings for long periods, and ESP (IP protocol 50) isn’t a standard TCP/UDP protocol, so many NAT devices can’t handle it properly.

How NAT-T Works in Detail

NAT-T solves these issues by encapsulating IPsec packets in UDP, which NAT devices can handle. Here’s the step-by-step process:

NAT Detection During IKE Negotiation:
- During the IKE Phase 1 negotiation (IKE_SA_INIT exchange), both peers exchange “NAT-Discovery” (NAT-D) payloads.
- These payloads contain hashes of the source and destination IP addresses and ports.
- By comparing the hashes, each peer can detect if a NAT device is present in the path:
  - If the received hash doesn’t match the expected hash (based on the observed source IP/port), a NAT device is detected.
Switch to UDP Encapsulation:
- If a NAT device is detected, both peers agree to use NAT-T.
- IKE traffic (normally on UDP 500) switches to UDP 4500.
- ESP traffic (normally IP protocol 50) is encapsulated in UDP 4500 packets (called “UDP-encapsulated ESP”).
NAT Keepalives:
- To prevent the NAT device from timing out the UDP mapping (e.g., after 1 hour of inactivity), the peers send periodic “NAT keepalive” packets.
- These are small, empty UDP packets sent on port 4500 (e.g., every 20 seconds) to keep the NAT mapping alive.
ESP-in-UDP Encapsulation:
- ESP packets are wrapped in a UDP header with source and destination ports set to 4500.
- The NAT device can now treat these packets like regular UDP traffic, maintaining the mapping and allowing the packets to traverse the NAT.
Handling Dynamic IPs:
- If the remote site’s public IP changes (e.g., due to a dynamic IP reassignment), NAT-T combined with IKEv2’s MOBIKE can update the tunnel endpoints to use the new IP.

NAT-T in Action: Example Scenario

Let’s revisit your scenario: a remote site with a dynamic IP behind a NAT router, connecting to a core site.

Setup:
- Remote Site VPN Gateway: Private IP 192.168.1.10, behind NAT (public IP 198.51.100.5).
- Core Site VPN Gateway: Static IP 203.0.113.1.
- NAT-T Settings: Enabled, NAT keepalive interval = 20 seconds.
Scenario:
- The remote site initiates an IKE negotiation with the core site.
- During IKE_SA_INIT, both peers detect the NAT device (because the core site sees the remote site’s source IP as 198.51.100.5, not 192.168.1.10).
- IKE switches to UDP 4500, and ESP packets are encapsulated in UDP 4500.
- The remote site sends NAT keepalive packets every 20 seconds to maintain the NAT mapping.
- The IPsec tunnel is established, and traffic flows securely through the NAT device.

NAT-T Configuration Example

Here’s how to configure NAT-T on two common platforms:

Cisco (IOS):plaintextcrypto isakmp nat keepalive 20
- 20: NAT keepalive interval (send keepalives every 20 seconds).
- NAT-T is enabled by default in most modern Cisco devices.
strongSwan (Linux):plaintext# In ipsec.conf conn remote-to-core nat_keepalive=20s
- nat_keepalive=20s: Send NAT keepalive packets every 20 seconds.
- NAT-T is enabled by default in strongSwan (uses UDP 4500 automatically if NAT is detected).

Text Diagram: NAT-T Process

Here’s a text-based diagram showing the NAT-T process between the remote and core sites:

[Remote Site VPN Gateway] --- [NAT Router] --- [Internet] --- [Core Site VPN Gateway]
(Private IP: 192.168.1.10)   (Public IP: 198.51.100.5)      (Static IP: 203.0.113.1)

1. IKE_SA_INIT (UDP 500) -----------------------> | -----------------------> |
   (NAT-D payloads detect NAT)                    |                        |
2. Switch to UDP 4500 for IKE and ESP <---------- | <----------------------- |
3. ESP-in-UDP 4500 (encrypted traffic) ---------> | -----------------------> |
4. NAT Keepalive (UDP 4500, every 20s) ---------> | -----------------------> |
5. Secure traffic flows through NAT <------------ | <----------------------- |

NAT-T Packet Structure

Here’s a simplified view of how an ESP packet is encapsulated with NAT-T:

Without NAT-T (Normal ESP):[IP Header (Src: 192.168.1.10, Dst: 203.0.113.1)] [ESP Header] [Encrypted Payload] [ESP Trailer] [ESP Auth]
- NAT changes the source IP to 198.51.100.5, which can break AH or cause issues with ESP state tracking.
With NAT-T (ESP-in-UDP):[IP Header (Src: 198.51.100.5, Dst: 203.0.113.1)] [UDP Header (Port 4500)] [ESP Header] [Encrypted Payload] [ESP Trailer] [ESP Auth]
- The UDP header allows the NAT device to maintain a mapping (e.g., 198.51.100.5:4500 to 203.0.113.1:4500).

NAT-T Considerations

Performance Overhead: Encapsulating ESP in UDP adds a small overhead (8 bytes for the UDP header), but this is negligible.
Firewall Rules: Ensure firewalls allow UDP 500 (initial IKE) and UDP 4500 (NAT-T) traffic.
NAT Timeout: NAT devices often have a timeout for UDP mappings (e.g., 1 hour). NAT keepalives prevent this by keeping the mapping active.
Dynamic IPs: NAT-T works well with dynamic IPs, especially when combined with IKEv2’s MOBIKE, which can update the tunnel endpoints if the public IP changes.

How DPD and NAT-T Work Together

In your scenario, DPD and NAT-T are both critical for maintaining a stable IPsec VPN:

NAT-T: Ensures the IPsec tunnel can traverse the NAT router at the remote site by encapsulating traffic in UDP 4500 and using NAT keepalives to maintain the mapping.
DPD: Monitors the health of the tunnel by checking if the core site is still reachable. If the NAT mapping is lost (e.g., due to a router reboot), DPD will detect the failure and attempt to reestablish the tunnel.

Combined Example Scenario

Setup: Same as above (remote site behind NAT, core site with static IP).
Event: The NAT router at the remote site reboots, causing the NAT mapping to be lost.
NAT-T Role: Before the reboot, NAT-T ensures the tunnel works through the NAT device, and NAT keepalives (every 20 seconds) maintain the mapping.
DPD Role: After the reboot, the core site sends R-U-THERE messages every 10 seconds. After 30 seconds of no response, it declares the remote site dead and restarts the tunnel.
Recovery: The remote site, now back online with a new NAT mapping, responds to the new IKE negotiation, and the tunnel is reestablished with NAT-T handling the new mapping.

Summary

Dead Peer Detection (DPD):
- Detects if the remote peer is unresponsive by sending R-U-THERE messages.
- Configurable interval (e.g., 10 seconds), timeout (e.g., 30 seconds), and action (e.g., restart).
- Critical for recovering from network issues or peer failures.
NAT Traversal (NAT-T):
- Allows IPsec to work through NAT devices by encapsulating IKE and ESP in UDP 4500.
- Uses NAT-D payloads to detect NAT, switches to UDP 4500, and sends NAT keepalives to maintain mappings.
- Essential for scenarios where one or both peers are behind NAT, especially with dynamic IPs.