01 — Why eBPF for overlay troubleshooting
Overlay networks — VXLAN, GRE, Geneve — add encapsulation layers that make traditional packet analysis painful. When a packet traverses a VXLAN tunnel between two hypervisors, tcpdump on the outer interface shows you the encapsulated UDP frame but tells you nothing about what happened to the inner packet inside the kernel. You're debugging through a wall.
eBPF changes this by letting you attach probes directly to kernel functions — the exact points where encapsulation happens, where routing decisions are made, where packets get dropped. Instead of inferring what the kernel did from what you see on the wire, you observe the kernel's decisions directly.
The scripts below are exploratory tools, not production daemons. They're designed to be dropped onto a node, run for a few minutes during an active investigation, and removed. Think of them as a specialized toolkit that sits between tcpdump (too low-level for overlay debugging) and Kentik/Prometheus (too aggregated for ad hoc RCA).
These scripts require Linux 5.8+ with BTF (BPF Type Format) enabled, and bpftrace or bcc installed. Most modern distributions (Ubuntu 22.04+, RHEL 9+) ship with BTF support in the default kernel. Verify with ls /sys/kernel/btf/vmlinux.
02 — VXLAN encap/decap tracer
The most common overlay debugging scenario: traffic enters a VXLAN tunnel on one host and never arrives at the destination. Is it an encapsulation failure? A VNI mismatch? A routing issue on the underlay? This script traces the vxlan_xmit and vxlan_rcv kernel functions to show you exactly what's happening at each end of the tunnel.
#!/usr/bin/env bpftrace /* * vxlan_trace.bt - Trace VXLAN encap/decap events * Shows VNI, source/dest IPs, inner packet details * Usage: sudo bpftrace vxlan_trace.bt */ #include <linux/skbuff.h> #include <linux/ip.h> #include <net/vxlan.h> kprobe:vxlan_xmit { $skb = (struct sk_buff *)arg0; $iph = (struct iphdr *)($skb->head + $skb->network_header); printf("ENCAP src=%s dst=%s len=%d dev=%s\n", ntop($iph->saddr), ntop($iph->daddr), $skb->len, $skb->dev->name); } kprobe:vxlan_rcv { $skb = (struct sk_buff *)arg1; $iph = (struct iphdr *)($skb->head + $skb->network_header); printf("DECAP src=%s dst=%s len=%d dev=%s\n", ntop($iph->saddr), ntop($iph->daddr), $skb->len, $skb->dev->name); } END { printf("\nTrace complete.\n"); }
What to look for
Run this on both ends of the tunnel simultaneously. If you see ENCAP events on the source but no DECAP on the destination, the underlay is dropping the outer UDP packet — check MTU (VXLAN adds 50 bytes of overhead), firewall rules on UDP 4789, and underlay routing. If DECAP fires but the inner packet never reaches the application, the problem is post-decapsulation — likely an FDB or ARP issue in the overlay.
The vxlan_xmit probe fires on every encapsulated packet. On a busy host, this generates significant output. Use bpftrace's built-in filtering (/comm == "nginx"/) or add rate limiting for production-adjacent troubleshooting.
03 — Kernel drop watcher
When packets vanish inside the kernel, kfree_skb is the function that frees the socket buffer — and the call stack at that point tells you exactly why. This is the eBPF equivalent of Brendan Gregg's approach to understanding kernel behavior through the functions that terminate work, not just the functions that initiate it.
#!/usr/bin/env bpftrace /* * drop_watch.bt - Trace kernel packet drops with stack traces * Identifies WHERE and WHY the kernel drops packets * Usage: sudo bpftrace drop_watch.bt */ #include <linux/skbuff.h> #include <linux/ip.h> tracepoint:skb:kfree_skb { $skb = (struct sk_buff *)args->skbaddr; $reason = args->reason; // Filter for network-related drops (skip memory management) if ($reason > 0) { $iph = (struct iphdr *)($skb->head + $skb->network_header); time("%H:%M:%S "); printf("DROP reason=%d proto=%d src=%s dst=%s\n", $reason, $iph->protocol, ntop($iph->saddr), ntop($iph->daddr)); printf(" stack: %s\n", kstack); } } interval:s:30 { printf("--- still tracing (30s heartbeat) ---\n"); }
Reading the output
The kernel stack trace is the key diagnostic. If you see nf_hook_slow in the stack, a netfilter rule (iptables/nftables) is dropping the traffic — check your security policies. If you see ip_rcv_finish followed by ip_route_input_noref, it's a routing table miss. For overlay networks specifically, watch for drops in vxlan_rcv (VNI lookup failure) or udp_queue_rcv_skb (the outer UDP packet never made it to the VXLAN handler).
The reason field maps to the kernel's SKB_DROP_REASON enum — values like NOT_SPECIFIED (1), NO_SOCKET (2), NETFILTER_DROP (6) tell you the category of drop before you even read the stack.
04 — Connection tracking observer
In hybrid cloud environments where overlay networks cross security zones, connection tracking is often the invisible culprit. A conntrack table that's 90% full won't show symptoms until it hits 100% and starts dropping new connections silently. This script watches conntrack events in real time and flags table pressure.
#!/usr/bin/env bpftrace /* * conntrack_watch.bt - Monitor conntrack table activity * Tracks new connections, deletions, and table pressure * Usage: sudo bpftrace conntrack_watch.bt */ tracepoint:nf_conntrack:nf_conntrack_insert { @new_conns = count(); @by_proto[args->l4proto] = count(); } kprobe:nf_conntrack_destroy { @destroyed = count(); } kprobe:__nf_conntrack_confirm { @confirmed = count(); } // Alert on conntrack insertion failures (table full) kretprobe:__nf_conntrack_alloc /retval == 0/ { time("%H:%M:%S "); printf("ALERT: conntrack alloc FAILED — table likely full\n"); printf(" comm=%s pid=%d\n", comm, pid); @alloc_failures = count(); } interval:s:10 { time("%H:%M:%S "); printf("--- 10s summary ---\n"); print(@new_conns); print(@destroyed); print(@alloc_failures); print(@by_proto); clear(@new_conns); clear(@destroyed); clear(@alloc_failures); clear(@by_proto); }
When to reach for this
Conntrack issues are sneaky in overlay environments because the encapsulated traffic creates conntrack entries at both the outer (underlay) and inner (overlay) layers. A host handling VXLAN traffic effectively doubles its conntrack table consumption. If you're seeing intermittent connection failures that don't correlate with CPU or memory pressure, run this script and watch for alloc_failures. The fix is usually bumping nf_conntrack_max — but this script tells you whether that's actually the problem before you start tuning blind.
05 — TCP retransmit tracker
TCP retransmits inside overlay tunnels are the silent performance killer. The application sees latency spikes; the network team sees clean counters on the physical interfaces. The retransmit is happening at the inner (overlay) layer, invisible to standard interface monitoring. This script is a targeted adaptation of the approach used in BCC's tcpretrans tool, filtered for overlay-relevant interfaces.
#!/usr/bin/env bpftrace /* * overlay_retrans.bt - TCP retransmit tracker for overlay interfaces * Correlates retransmits with overlay tunnel endpoints * Usage: sudo bpftrace overlay_retrans.bt */ #include <linux/tcp.h> #include <net/sock.h> tracepoint:tcp:tcp_retransmit_skb { $sk = (struct sock *)args->skaddr; $inet = (struct inet_sock *)$sk; time("%H:%M:%S "); printf("RETRANS %s:%d -> %s:%d state=%d\n", ntop($inet->inet_saddr), $inet->inet_sport, ntop($inet->inet_daddr), $inet->inet_dport, $sk->__sk_common.skc_state); @retrans_by_dest[ntop($inet->inet_daddr)] = count(); @retrans_by_port[$inet->inet_dport] = count(); @total = count(); } tracepoint:tcp:tcp_retransmit_synack { printf("SYNACK RETRANS (possible overlay MTU issue)\n"); @synack_retrans = count(); } interval:s:30 { printf("\n--- 30s retransmit summary ---\n"); print(@total); print(@retrans_by_dest); print(@retrans_by_port); print(@synack_retrans); clear(@total); clear(@retrans_by_dest); clear(@retrans_by_port); clear(@synack_retrans); }
The MTU trap
If you see heavy SYNACK retransmits, the most likely culprit is an MTU mismatch at the tunnel boundary. VXLAN adds 50 bytes of overhead (14 outer Ethernet + 20 IP + 8 UDP + 8 VXLAN header). If the underlay MTU is the standard 1500 and the overlay isn't accounting for this, the SYN gets through (it's small) but the SYNACK with TCP options pushes past the effective MTU and gets silently dropped. The retransmit pattern — SYNACK specifically, not data segments — is the diagnostic fingerprint.
06 — Security policy violations
In zero trust environments, netfilter (iptables/nftables) and CNI network policies both make drop decisions that are logged poorly — if at all. This script traces the nf_hook_slow path where netfilter verdicts are rendered, giving you real-time visibility into which security rules are dropping traffic and between which endpoints.
#!/usr/bin/env bpftrace /* * policy_drops.bt - Trace netfilter/security policy drops * Shows which rules are rejecting traffic in real time * Usage: sudo bpftrace policy_drops.bt */ #include <linux/skbuff.h> #include <linux/ip.h> #include <linux/netfilter.h> // NF_DROP = 0, NF_ACCEPT = 1 kretprobe:nf_hook_slow /retval == 0/ { time("%H:%M:%S "); printf("NF_DROP comm=%s pid=%d\n", comm, pid); printf(" stack: %s\n", kstack); @drops_by_comm[comm] = count(); @drop_stacks[kstack] = count(); } interval:s:30 { printf("\n--- 30s policy drop summary ---\n"); print(@drops_by_comm); printf("\nTop drop stacks:\n"); print(@drop_stacks, 3); clear(@drops_by_comm); clear(@drop_stacks); }
Debugging Kubernetes network policies
When a Kubernetes CNI (Calico, Cilium) enforces a NetworkPolicy, the enforcement mechanism is typically iptables rules or eBPF programs. Either way, the drop manifests as a netfilter verdict. The @drop_stacks aggregation groups drops by their kernel call stack — if you see a cluster of drops all with the same stack signature, you've found the specific rule chain that's blocking traffic. Cross-reference with iptables -L -v --line-numbers to identify the policy.
07 — Overlay latency histogram
The final piece: measuring the overhead that the overlay adds. This script measures the time delta between a packet entering the VXLAN encapsulation path and the encapsulated packet being queued for transmission. The result is a histogram of encapsulation latency — the tax your overlay is charging per packet.
#!/usr/bin/env bpftrace /* * overlay_latency.bt - Histogram of VXLAN encap latency * Measures the per-packet cost of overlay encapsulation * Usage: sudo bpftrace overlay_latency.bt */ kprobe:vxlan_xmit { @start[tid] = nsecs; } kretprobe:vxlan_xmit /@start[tid]/ { $delta = nsecs - @start[tid]; @encap_latency_ns = hist($delta); @avg_ns = avg($delta); delete(@start[tid]); } interval:s:10 { printf("\n--- VXLAN encap latency (10s window) ---\n"); print(@encap_latency_ns); print(@avg_ns); clear(@encap_latency_ns); clear(@avg_ns); } END { clear(@start); }
Interpreting the histogram
Healthy VXLAN encapsulation typically completes in 2–10 microseconds. If you're seeing a bimodal distribution — most packets fast, but a tail at 50+ microseconds — the likely cause is cache misses in the FDB (forwarding database) lookup. The VXLAN driver looks up the destination VTEP for each inner MAC address; a cold or oversized FDB forces memory fetches that dominate the encap path. The histogram makes this bimodal pattern visible where an average alone would hide it.
These scripts scratch the surface. For production observability, consider exporting the eBPF map data to Prometheus via a BCC-based exporter, and visualizing in Grafana. The ad hoc scripts here are designed for RCA; a production pipeline would use the same kernel attachment points but aggregate differently. That pipeline architecture is a separate writeup.
08 — Setting up the test lab
Every script above was built against a reproducible test environment. The repository includes ebpf_overlay_lab.sh — a self-contained lab harness that builds a VXLAN overlay topology using Linux network namespaces. No VMs, no cloud accounts, no extra hardware. If you have a Linux box with a 5.8+ kernel, you can run the full lab.
What the lab builds
The script creates two network namespaces connected by a veth pair (the underlay), then builds a VXLAN tunnel on top (the overlay). This gives you a complete two-host topology on a single machine:
[ns-host-a] [ns-host-b] 10.0.0.1 (veth-a) ────── (veth-b) 10.0.0.2 underlay 172.16.0.1 (vxlan-a) ════ (vxlan-b) 172.16.0.2 overlay (VNI 100)
Prerequisites
You need Linux 5.8+ with BTF enabled, bpftrace, iproute2, iptables, and tc. Most modern distributions (Ubuntu 22.04+, RHEL 9+) have everything out of the box. Verify BTF with ls /sys/kernel/btf/vmlinux.
Quick start
# Clone the repo git clone https://github.com/emptyquark/network-tshoot-tools.git cd network-tshoot-tools # Build the VXLAN topology sudo ./ebpf_overlay_lab.sh setup # Run a scenario (1-6) sudo ./ebpf_overlay_lab.sh scenario 1
# Run the corresponding eBPF tracer sudo bpftrace scripts/vxlan_trace.bt
The six scenarios
Each scenario injects a specific failure condition, shows you the before (healthy) and after (broken) states, and explains exactly what to look for in the tracer output:
Scenario 1 — VXLAN Tunnel Failure
Blocks UDP 4789 with iptables. You see ENCAP events without matching DECAP — the classic one-sided tunnel problem.
Scenario 2 — Kernel Packet Drops
Injects netfilter DROP/REJECT rules. The drop watcher shows nf_hook_slow in the stack traces, pinpointing which rule chain killed the traffic.
Scenario 3 — Conntrack Exhaustion
Shrinks nf_conntrack_max to 10 and floods 50 UDP connections. The observer catches alloc_failures as the table overflows.
Scenario 4 — The MTU Trap
Drops underlay MTU to 1400 while the overlay stays at 1500. Small pings succeed; large TCP segments trigger retransmits. The SYNACK pattern is the diagnostic fingerprint.
Scenario 5 — Security Policy Violations
Adds selective iptables drops — ICMP blocked, TCP 8080 blocked, everything else passes. The tracer clusters drops by stack signature to identify the specific rule.
Scenario 6 — Latency Degradation
Uses tc netem to add 1ms delay to the underlay. The histogram shifts right, showing how overlay encapsulation cost changes under congestion.
Cleanup
# Remove all lab resources sudo ./ebpf_overlay_lab.sh teardown
The teardown is clean — it removes the network namespaces, which automatically destroys all interfaces and routes inside them. Nothing persists on your system after teardown.