eBPF for Overlay Networks — Sathya Ganesan w/Opus 4.6

01 — Why eBPF for overlay troubleshooting

Overlay networks — VXLAN, GRE, Geneve — add encapsulation layers that make traditional packet analysis painful. When a packet traverses a VXLAN tunnel between two hypervisors, tcpdump on the outer interface shows you the encapsulated UDP frame but tells you nothing about what happened to the inner packet inside the kernel. You're debugging through a wall.

eBPF changes this by letting you attach probes directly to kernel functions — the exact points where encapsulation happens, where routing decisions are made, where packets get dropped. Instead of inferring what the kernel did from what you see on the wire, you observe the kernel's decisions directly.

The scripts below are exploratory tools, not production daemons. They're designed to be dropped onto a node, run for a few minutes during an active investigation, and removed. Think of them as a specialized toolkit that sits between tcpdump (too low-level for overlay debugging) and Kentik/Prometheus (too aggregated for ad hoc RCA).

Prerequisites

These scripts require Linux 5.8+ with BTF (BPF Type Format) enabled, and bpftrace or bcc installed. Most modern distributions (Ubuntu 22.04+, RHEL 9+) ship with BTF support in the default kernel. Verify with ls /sys/kernel/btf/vmlinux.

02 — VXLAN encap/decap tracer

The most common overlay debugging scenario: traffic enters a VXLAN tunnel on one host and never arrives at the destination. Is it an encapsulation failure? A VNI mismatch? A routing issue on the underlay? This script traces the vxlan_xmit and vxlan_rcv kernel functions to show you exactly what's happening at each end of the tunnel.

vxlan_trace.bt bpftrace

#!/usr/bin/env bpftrace
/*
 * vxlan_trace.bt - Trace VXLAN encap/decap events
 * Shows VNI, source/dest IPs, inner packet details
 * Usage: sudo bpftrace vxlan_trace.bt
 */

#include <linux/skbuff.h>
#include <linux/ip.h>
#include <net/vxlan.h>

kprobe:vxlan_xmit
{
    $skb = (struct sk_buff *)arg0;
    $iph = (struct iphdr *)($skb->head + $skb->network_header);

    printf("ENCAP  src=%s dst=%s len=%d dev=%s\n",
        ntop($iph->saddr),
        ntop($iph->daddr),
        $skb->len,
        $skb->dev->name);
}

kprobe:vxlan_rcv
{
    $skb = (struct sk_buff *)arg1;
    $iph = (struct iphdr *)($skb->head + $skb->network_header);

    printf("DECAP  src=%s dst=%s len=%d dev=%s\n",
        ntop($iph->saddr),
        ntop($iph->daddr),
        $skb->len,
        $skb->dev->name);
}

END
{
    printf("\nTrace complete.\n");
}

What to look for

Run this on both ends of the tunnel simultaneously. If you see ENCAP events on the source but no DECAP on the destination, the underlay is dropping the outer UDP packet — check MTU (VXLAN adds 50 bytes of overhead), firewall rules on UDP 4789, and underlay routing. If DECAP fires but the inner packet never reaches the application, the problem is post-decapsulation — likely an FDB or ARP issue in the overlay.

Performance Note

The vxlan_xmit probe fires on every encapsulated packet. On a busy host, this generates significant output. Use bpftrace's built-in filtering (/comm == "nginx"/) or add rate limiting for production-adjacent troubleshooting.

03 — Kernel drop watcher

When packets vanish inside the kernel, kfree_skb is the function that frees the socket buffer — and the call stack at that point tells you exactly why. This is the eBPF equivalent of Brendan Gregg's approach to understanding kernel behavior through the functions that terminate work, not just the functions that initiate it.

drop_watch.bt bpftrace

#!/usr/bin/env bpftrace
/*
 * drop_watch.bt - Trace kernel packet drops with stack traces
 * Identifies WHERE and WHY the kernel drops packets
 * Usage: sudo bpftrace drop_watch.bt
 */

#include <linux/skbuff.h>
#include <linux/ip.h>

tracepoint:skb:kfree_skb
{
    $skb = (struct sk_buff *)args->skbaddr;
    $reason = args->reason;

    // Filter for network-related drops (skip memory management)
    if ($reason > 0) {
        $iph = (struct iphdr *)($skb->head + $skb->network_header);

        time("%H:%M:%S  ");
        printf("DROP reason=%d proto=%d src=%s dst=%s\n",
            $reason,
            $iph->protocol,
            ntop($iph->saddr),
            ntop($iph->daddr));
        printf("  stack: %s\n", kstack);
    }
}

interval:s:30
{
    printf("--- still tracing (30s heartbeat) ---\n");
}

Reading the output

The kernel stack trace is the key diagnostic. If you see nf_hook_slow in the stack, a netfilter rule (iptables/nftables) is dropping the traffic — check your security policies. If you see ip_rcv_finish followed by ip_route_input_noref, it's a routing table miss. For overlay networks specifically, watch for drops in vxlan_rcv (VNI lookup failure) or udp_queue_rcv_skb (the outer UDP packet never made it to the VXLAN handler).

The reason field maps to the kernel's SKB_DROP_REASON enum — values like NOT_SPECIFIED (1), NO_SOCKET (2), NETFILTER_DROP (6) tell you the category of drop before you even read the stack.

04 — Connection tracking observer

In hybrid cloud environments where overlay networks cross security zones, connection tracking is often the invisible culprit. A conntrack table that's 90% full won't show symptoms until it hits 100% and starts dropping new connections silently. This script watches conntrack events in real time and flags table pressure.

conntrack_watch.bt bpftrace

#!/usr/bin/env bpftrace
/*
 * conntrack_watch.bt - Monitor conntrack table activity
 * Tracks new connections, deletions, and table pressure
 * Usage: sudo bpftrace conntrack_watch.bt
 */

tracepoint:nf_conntrack:nf_conntrack_insert
{
    @new_conns = count();
    @by_proto[args->l4proto] = count();
}

kprobe:nf_conntrack_destroy
{
    @destroyed = count();
}

kprobe:__nf_conntrack_confirm
{
    @confirmed = count();
}

// Alert on conntrack insertion failures (table full)
kretprobe:__nf_conntrack_alloc
/retval == 0/
{
    time("%H:%M:%S  ");
    printf("ALERT: conntrack alloc FAILED — table likely full\n");
    printf("  comm=%s pid=%d\n", comm, pid);
    @alloc_failures = count();
}

interval:s:10
{
    time("%H:%M:%S  ");
    printf("--- 10s summary ---\n");
    print(@new_conns);
    print(@destroyed);
    print(@alloc_failures);
    print(@by_proto);
    clear(@new_conns);
    clear(@destroyed);
    clear(@alloc_failures);
    clear(@by_proto);
}

When to reach for this

Conntrack issues are sneaky in overlay environments because the encapsulated traffic creates conntrack entries at both the outer (underlay) and inner (overlay) layers. A host handling VXLAN traffic effectively doubles its conntrack table consumption. If you're seeing intermittent connection failures that don't correlate with CPU or memory pressure, run this script and watch for alloc_failures. The fix is usually bumping nf_conntrack_max — but this script tells you whether that's actually the problem before you start tuning blind.

05 — TCP retransmit tracker

TCP retransmits inside overlay tunnels are the silent performance killer. The application sees latency spikes; the network team sees clean counters on the physical interfaces. The retransmit is happening at the inner (overlay) layer, invisible to standard interface monitoring. This script is a targeted adaptation of the approach used in BCC's tcpretrans tool, filtered for overlay-relevant interfaces.

overlay_retrans.bt bpftrace

#!/usr/bin/env bpftrace
/*
 * overlay_retrans.bt - TCP retransmit tracker for overlay interfaces
 * Correlates retransmits with overlay tunnel endpoints
 * Usage: sudo bpftrace overlay_retrans.bt
 */

#include <linux/tcp.h>
#include <net/sock.h>

tracepoint:tcp:tcp_retransmit_skb
{
    $sk = (struct sock *)args->skaddr;
    $inet = (struct inet_sock *)$sk;

    time("%H:%M:%S  ");
    printf("RETRANS %s:%d -> %s:%d state=%d\n",
        ntop($inet->inet_saddr),
        $inet->inet_sport,
        ntop($inet->inet_daddr),
        $inet->inet_dport,
        $sk->__sk_common.skc_state);

    @retrans_by_dest[ntop($inet->inet_daddr)] = count();
    @retrans_by_port[$inet->inet_dport] = count();
    @total = count();
}

tracepoint:tcp:tcp_retransmit_synack
{
    printf("SYNACK RETRANS (possible overlay MTU issue)\n");
    @synack_retrans = count();
}

interval:s:30
{
    printf("\n--- 30s retransmit summary ---\n");
    print(@total);
    print(@retrans_by_dest);
    print(@retrans_by_port);
    print(@synack_retrans);
    clear(@total);
    clear(@retrans_by_dest);
    clear(@retrans_by_port);
    clear(@synack_retrans);
}

The MTU trap

If you see heavy SYNACK retransmits, the most likely culprit is an MTU mismatch at the tunnel boundary. VXLAN adds 50 bytes of overhead (14 outer Ethernet + 20 IP + 8 UDP + 8 VXLAN header). If the underlay MTU is the standard 1500 and the overlay isn't accounting for this, the SYN gets through (it's small) but the SYNACK with TCP options pushes past the effective MTU and gets silently dropped. The retransmit pattern — SYNACK specifically, not data segments — is the diagnostic fingerprint.

06 — Security policy violations

In zero trust environments, netfilter (iptables/nftables) and CNI network policies both make drop decisions that are logged poorly — if at all. This script traces the nf_hook_slow path where netfilter verdicts are rendered, giving you real-time visibility into which security rules are dropping traffic and between which endpoints.

policy_drops.bt bpftrace

#!/usr/bin/env bpftrace
/*
 * policy_drops.bt - Trace netfilter/security policy drops
 * Shows which rules are rejecting traffic in real time
 * Usage: sudo bpftrace policy_drops.bt
 */

#include <linux/skbuff.h>
#include <linux/ip.h>
#include <linux/netfilter.h>

// NF_DROP = 0, NF_ACCEPT = 1
kretprobe:nf_hook_slow
/retval == 0/
{
    time("%H:%M:%S  ");
    printf("NF_DROP  comm=%s pid=%d\n", comm, pid);
    printf("  stack: %s\n", kstack);
    @drops_by_comm[comm] = count();
    @drop_stacks[kstack] = count();
}

interval:s:30
{
    printf("\n--- 30s policy drop summary ---\n");
    print(@drops_by_comm);
    printf("\nTop drop stacks:\n");
    print(@drop_stacks, 3);
    clear(@drops_by_comm);
    clear(@drop_stacks);
}

Debugging Kubernetes network policies

When a Kubernetes CNI (Calico, Cilium) enforces a NetworkPolicy, the enforcement mechanism is typically iptables rules or eBPF programs. Either way, the drop manifests as a netfilter verdict. The @drop_stacks aggregation groups drops by their kernel call stack — if you see a cluster of drops all with the same stack signature, you've found the specific rule chain that's blocking traffic. Cross-reference with iptables -L -v --line-numbers to identify the policy.

07 — Overlay latency histogram

The final piece: measuring the overhead that the overlay adds. This script measures the time delta between a packet entering the VXLAN encapsulation path and the encapsulated packet being queued for transmission. The result is a histogram of encapsulation latency — the tax your overlay is charging per packet.

overlay_latency.bt bpftrace

#!/usr/bin/env bpftrace
/*
 * overlay_latency.bt - Histogram of VXLAN encap latency
 * Measures the per-packet cost of overlay encapsulation
 * Usage: sudo bpftrace overlay_latency.bt
 */

kprobe:vxlan_xmit
{
    @start[tid] = nsecs;
}

kretprobe:vxlan_xmit
/@start[tid]/
{
    $delta = nsecs - @start[tid];
    @encap_latency_ns = hist($delta);
    @avg_ns = avg($delta);
    delete(@start[tid]);
}

interval:s:10
{
    printf("\n--- VXLAN encap latency (10s window) ---\n");
    print(@encap_latency_ns);
    print(@avg_ns);
    clear(@encap_latency_ns);
    clear(@avg_ns);
}

END
{
    clear(@start);
}

Interpreting the histogram

Healthy VXLAN encapsulation typically completes in 2–10 microseconds. If you're seeing a bimodal distribution — most packets fast, but a tail at 50+ microseconds — the likely cause is cache misses in the FDB (forwarding database) lookup. The VXLAN driver looks up the destination VTEP for each inner MAC address; a cold or oversized FDB forces memory fetches that dominate the encap path. The histogram makes this bimodal pattern visible where an average alone would hide it.

Going Further

These scripts scratch the surface. For production observability, consider exporting the eBPF map data to Prometheus via a BCC-based exporter, and visualizing in Grafana. The ad hoc scripts here are designed for RCA; a production pipeline would use the same kernel attachment points but aggregate differently. That pipeline architecture is a separate writeup.

08 — Setting up the test lab

Every script above was built against a reproducible test environment. The repository includes ebpf_overlay_lab.sh — a self-contained lab harness that builds a VXLAN overlay topology using Linux network namespaces. No VMs, no cloud accounts, no extra hardware. If you have a Linux box with a 5.8+ kernel, you can run the full lab.

What the lab builds

The script creates two network namespaces connected by a veth pair (the underlay), then builds a VXLAN tunnel on top (the overlay). This gives you a complete two-host topology on a single machine:

topology diagram

  [ns-host-a]                              [ns-host-b]
   10.0.0.1 (veth-a) ────── (veth-b) 10.0.0.2        underlay
   172.16.0.1 (vxlan-a) ════ (vxlan-b) 172.16.0.2     overlay (VNI 100)

Prerequisites

You need Linux 5.8+ with BTF enabled, bpftrace, iproute2, iptables, and tc. Most modern distributions (Ubuntu 22.04+, RHEL 9+) have everything out of the box. Verify BTF with ls /sys/kernel/btf/vmlinux.

Quick start

terminal 1 bash

# Clone the repo
git clone https://github.com/emptyquark/network-tshoot-tools.git
cd network-tshoot-tools

# Build the VXLAN topology
sudo ./ebpf_overlay_lab.sh setup

# Run a scenario (1-6)
sudo ./ebpf_overlay_lab.sh scenario 1

terminal 2 bash

# Run the corresponding eBPF tracer
sudo bpftrace scripts/vxlan_trace.bt

The six scenarios

Each scenario injects a specific failure condition, shows you the before (healthy) and after (broken) states, and explains exactly what to look for in the tracer output:

Scenario 1 — VXLAN Tunnel Failure

Blocks UDP 4789 with iptables. You see ENCAP events without matching DECAP — the classic one-sided tunnel problem.

vxlan_trace.bt iptables

Scenario 2 — Kernel Packet Drops

Injects netfilter DROP/REJECT rules. The drop watcher shows nf_hook_slow in the stack traces, pinpointing which rule chain killed the traffic.

drop_watch.bt netfilter

Scenario 3 — Conntrack Exhaustion

Shrinks nf_conntrack_max to 10 and floods 50 UDP connections. The observer catches alloc_failures as the table overflows.

conntrack_watch.bt conntrack UDP flood

Scenario 4 — The MTU Trap

Drops underlay MTU to 1400 while the overlay stays at 1500. Small pings succeed; large TCP segments trigger retransmits. The SYNACK pattern is the diagnostic fingerprint.

overlay_retrans.bt MTU mismatch VXLAN overhead

Scenario 5 — Security Policy Violations

Adds selective iptables drops — ICMP blocked, TCP 8080 blocked, everything else passes. The tracer clusters drops by stack signature to identify the specific rule.

policy_drops.bt iptables zero trust

Scenario 6 — Latency Degradation

Uses tc netem to add 1ms delay to the underlay. The histogram shifts right, showing how overlay encapsulation cost changes under congestion.

overlay_latency.bt tc netem histogram

Cleanup

teardown bash

# Remove all lab resources
sudo ./ebpf_overlay_lab.sh teardown

The teardown is clean — it removes the network namespaces, which automatically destroys all interfaces and routes inside them. Nothing persists on your system after teardown.

Troubleshooting overlay networks with eBPF

01 — Why eBPF for overlay troubleshooting

02 — VXLAN encap/decap tracer

What to look for

03 — Kernel drop watcher

Reading the output

04 — Connection tracking observer

When to reach for this

05 — TCP retransmit tracker

The MTU trap

06 — Security policy violations

Debugging Kubernetes network policies

07 — Overlay latency histogram

Interpreting the histogram

08 — Setting up the test lab

What the lab builds

Prerequisites

Quick start

The six scenarios

Scenario 1 — VXLAN Tunnel Failure

Scenario 2 — Kernel Packet Drops

Scenario 3 — Conntrack Exhaustion

Scenario 4 — The MTU Trap

Scenario 5 — Security Policy Violations

Scenario 6 — Latency Degradation

Cleanup