BPF Design Flaw in the Linux Kernel
February 4, 2023
Hey there! Are you someone who likes to get their hands dirty with the technical details? If so, you’ll be excited to know that even the kernel has its flaws. No prior knowledge of BPF is necessary, but it would help if you have some familiarity with Linux, system calls, and context switching. Let me know if there’s anything else I can help you with!
I’ll start by teasing you with this code example from the Linux Documentation, to show you that in the world we live in, even documentation can not be trusted. Don’t worry, we will explain it later:
sock = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
if (sock < 0)
/* ... bail out ... */
ret = setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER, &bpf, sizeof(bpf));
if (ret < 0)
/* ... bail out ... */
Introduction
BPF (Berkeley Packet Filter) is a feature of the Linux kernel that allows to run sandboxed programs in the kernel. Originally, BPF was designed for network packet filtering, but later it was extended (eBPF) to allow a wide range of use cases, including tracing, profiling, and security. This article focuses on the original BPF implementation, and not on eBPF.
BPF is an incredibly powerful tool, but its Linux kernel implementation has a design flaw which can allow unwanted packets to infiltrate the program. If these packets are not dealt with properly, they can lead to bugs and security issues. This article outlines the flaw and provides guidance on how to avoid it.
History
Originally, BPF was introduced in the BSD kernel in 1991. Steven McCanne and Van Jacobson laid out the groundwork for the BPF design in a paper published in 1992. One of the first use cases for the technology was to enhance tools like tcpdump
and arpwatch
.
In some references it is mentioned that in 1997 (Linux kernel version 2.1.75) the first implementation of BPF was added to the Linux kernel. Linux named this feature LSF (Linux Socket Filter) and made some modifications to the original BPF design. For the purpose of this article, LSF == BPF
.
BPF Explained
User-space programs need a way to retrieve packets from the network, therefore, the kernel socket API was invented. But retrieving all the packets from the network is not efficient, therefore, the kernel provides a way to filter the packets before they are delivered to the user-space program. And this is where BPF comes in. BPF was designed to be simple enough to ensure security (sandboxing), and powerful enough to allow complex filtering.
This is how it works: the user-space program attaches a BPF program - a set of instructions - to a socket, and the kernel executes the program on each packet. The BPF program can decide to accept or reject the packet. If accepted, the packet is delivered to the user-space program; if rejected, it is dropped.
Here is a code example from the official documentation - don’t worry, we will explain it line by line:
struct sock_filter code[] = {
{ 0x28, 0, 0, 0x0000000c },
{ 0x15, 0, 8, 0x000086dd },
/* truncted for simplicity */
{ 0x06, 0, 0, 0x0000ffff },
{ 0x06, 0, 0, 0x00000000 },
};
struct sock_fprog bpf = {
.len = ARRAY_SIZE(code),
.filter = code,
};
sock = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
if (sock < 0)
/* ... bail out ... */
ret = setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER, &bpf, sizeof(bpf));
if (ret < 0)
/* ... bail out ... */
/* ... */
close(sock);
So how does it work? Let’s break it down:
-
The user-space program creates a BPF program, which is an array of
struct sock_filter
objects. Each object contains 4 fields:code
- the instruction codejt
- jump if truejf
- jump if falsek
- generic multiuse field
The BPF program is a sequence of instructions, and each instruction is executed in order. The program can jump to another instruction, and it can also return a value. An easy way to generate such bpf program is to use
tcpdump
;$ tcpdump -iany port 22 -dd
-
A raw socket is created. A raw socket is a socket that allows the user-space program to access the network directly. The
htons(ETH_P_ALL)
argument specifies that the socket will receive all the packets from the network. -
The filter is attached to the socket using the
SO_ATTACH_FILTER
option for thesetsockopt
system call.
Can you spot the problem?
The Problem
So here it is: The issue is that when you create the raw socket
, the kernel starts sending packets to the user-space program. If the execution of setsockopt
does not occur before the first packet arrives, the user-space program will receive packets without a filter.
sock = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
/* Exactly here, between these two syscalls,
our socket might receive unfiltered packets */
setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER, &bpf, sizeof(bpf));
Now imagine the unfortunate programmers who are trying to filter packets using BPF. They are unaware of the issue, and they don’t realize that when reading and parsing packets from their socket, they need to make sure the packet is valid. I’ve personally seen this happen in real-world code, and it was a nightmare to debug.
Wait, but how did no one notice this issue?
Mainly because it happens so rarely. The probability of a packet arriving between the two system calls is very low, yet it still happens. Generally, it requires a context switch to occur between the two system calls, but it may also occur if there is a large amount of network traffic, or just bad luck.
How can I avoid this issue?
So let’s think this through. We can’t prevent the kernel from sending packets to the socket, but maybe we can start with a clean slate. A valid option is to consume all the packets in the socket after attaching the filter. This way, we can be sure that the first packet the rest of the program will read from the socket is filtered.
while (recv(sock, buf, sizeof(buf), MSG_DONTWAIT) > 0)
/* ... */
This solution will work, but it is not ideal. In an overloaded computer with a lot of network traffic, the recv
system call may block for a long time.
An improved solution could be to start with setting a filter to “block all packets”, then to consume all the packets in the socket, and finally to attach the real filter. This way, we know that we need to consume only a small amount of packets, and we can be sure that the first packet the rest of the program will read from the socket is filtered.
struct sock_filter code[] = {
{ 0x28, 0, 0, 0x0000000c },
{ 0x15, 0, 8, 0x000086dd },
/* truncted for simplicity */
{ 0x06, 0, 0, 0x0000ffff },
{ 0x06, 0, 0, 0x00000000 },
};
struct sock_filter block_all_code[] = {
{ 0x06, 0, 0, 0x00000000 },
};
struct sock_fprog bpf = {
.len = ARRAY_SIZE(code),
.filter = code,
};
struct sock_fprog block_all = {
.len = ARRAY_SIZE(block_all_code),
.filter = block_all_code,
};
// Create a raw socket
sock = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
/* ... */
// Block all packets
setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER, &block_all, sizeof(block_all));
/* ... */
// Consume all packets
while (recv(sock, buf, sizeof(buf), MSG_DONTWAIT) > 0) {}
// Attach the real filter
setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER, &bpf, sizeof(bpf));
/* ... */
If you want to explore real-world code that handles this issue, you can check out the libpcap source code. It is used by many programs, including tcpdump
, wireshark
, and nmap
.
It’s worth noting that none of the official documentation mentions this issue. So consider this a friendly reminder to be careful when using BPF.
This article is also available in my Medium