SBN

Systemd is bad parsing and should feel bad

Systemd has a remotely exploitable bug in it’s DHCPv6 client. That means anybody on the local network can send you a packet and take control of your computer. The flaw is a typical buffer-overflow. Several news stories have pointed out that this client was written from scratch, as if that were the moral failing, instead of reusing existing code. That’s not the problem.

The problem is that it was written from scratch without taking advantage of the lessons of the past. It makes the same mistakes all over again.
In the late 1990s and early 2000s, we learned that parsing input is a problem. The traditional ad hoc approach you were taught in school is wrong. It’s wrong from an abstract theoretical point of view. It’s wrong from the practical point of view, error prone and leading to spaghetti code.
The first thing you need to unlearn is byte-swapping. I know that this was some sort epiphany you had when you learned network programming but byte-swapping is wrong. If you find yourself using a macro to swap bytes, like the be16toh() macro used in this code, then you are doing it wrong.
But, you say, the network byte-order is big-endian, while today’s Intel and ARM processors are little-endian. So you have to swap bytes, don’t you?
No. As proof of the matter I point to every other language other than C/C++. They don’t don’t swap bytes. Their internal integer format is undefined. Indeed, something like JavaScript may be storing numbers as a floating points. You can’t muck around with the internal format of their integers even if you wanted to.
An example of byte swapping in the code is something like this:

In this code, it’s taking a buffer of raw bytes from the DHCPv6 packet and “casting” it as a C internal structure. The packet contains a two-byte big-endian length field, “option->len“, which the code must byte-swap in order to use.

Among the errors here is casting an internal structure over external data. From an abstract theory point of view, this is wrong. Internal structures are undefined. Just because you can sort of know the definition in C/C++ doesn’t change the fact that they are still undefined.
From a practical point of view, this leads to confusion, as the programmer is never quite clear as to the boundary between external and internal data. You are supposed to rigorously verify external data, because the hacker controls it. You don’t keep double-checking and second-guessing internal data, because that would be stupid. When you blur the lines between internal and external data, then your checks get muddled up.
Yes you can, in C/C++, cast an internal structure over external data. But just because you can doesn’t mean you should. What you should do instead is parse data the same way as if you were writing code in JavaScript. For example, to grab the DHCP6 option length field, you should write something like:

The thing about this code is that you don’t know whether it’s JavaScript or C, because it’s both, and it does the same thing for both.

Byte “swapping” isn’t happening. We aren’t extracting an integer from a packet, then changing it’s internal format. Instead, we are extracting two bytes and combining them. This description may seem like needless pedantry, but it’s really really important that you grok this. For example, there is no conditional macro here that does one operation for a little-endian CPU, and another operation for a big-endian CPU — it does the same thing for both CPUs. Whatever words you want to use to describe the difference, it’s still profound and important.
The other thing that you shouldn’t do, even though C/C++ allows it, is pointer arithmetic. Again, it’s one of those epiphany things C programmers remember from their early days. It’s something they just couldn’t grasp until one day they did, and then they fell in love with it. Except it’s bad. The reason you struggled to grok it is because it’s stupid and you shouldn’t be using it. No other language has it, because it’s bad.
I mean, back in the day, it was a useful performance optimization. Iterating through an array can be faster adding to pointer than incrementing an index. But that’s virtually never the case today, and it just leads to needless spaghetti code. It leads to convoluted constructions like the following at the heart of this bug where you have to both do arithmetic on the pointer as well as on the length which you are checking against. This nonsense leads to confusion and ultimately, buffer overflows.

In a heckofalot of buffer overflows over the years, there’s a construct just like this lurking near the bug. If you are going to do a rewrite of code, then this is a construct you need to avoid. Just say no to pointer arithmetic.

In my code, you see a lot of constructs where it’s buf, offset, and length. The buf variable points to the start of the buffer and is never incremented. The length variable is the max length of the buffer and likewise never changes. It’s the offset variable that is incremented throughout.
Because of simplicity, buffer overflow checks become obvious, as it’s always “offset + x < length“, and easy to verify. In contrast, here is the fix for the DHCPv6 buffer overflow. That this is checking for an external buffer overflow is less obvious:

Now let’s look at that error code. That’s not what ENOBUFS really means. That’s an operating system error code that has specific meaning about kernel buffers. Overloading it for userland code is inappropriate.

That argument is a bit pedantic I grant you, but that’s only the start. The bigger issue is that it’s identifying the symptom not the problem. The ultimate problem is that the code failed to sanitize the original input, allowing the hacker to drive computation this deep in the system. The error isn’t that the buffer is too small to hold the output, the original error is that the input was too big. Imagine if this gets logged and the sysadmin reviewing dmesg asks themselves how they can allocate bigger buffers to the DHCP6 daemon. That is entirely the wrong idea.
Again, we go back to lessons of 20 years that this code ignored, the necessity of sanitizing input.
Now let’s look at assert(). This is a feature in C that we use to double-check things in order to catch programming mistakes early. An example is the code below, which is checking for programming mistakes where the caller of the function may have used NULL-pointers:

This is pretty normal, but now consider this other use of assert().

This isn’t checking errors by the programmer here, but is instead validating input. That’s not what you are supposed to use asserts for. This are very different things. It’s a coding horror that makes you shriek and run away when you see it. In my fact, that’s my Halloween costume this year, using asserts to validate network input.

This reflects a naive misunderstanding by programmers who don’t understand the difference between out-of-band checks validating the code, and what the code is supposed to be doing validating input. Like the buffer overflow check above, EINVAL because a programmer made a mistake is a vastly different error than EINVAL because a hacker tried to inject bad input. These aren’t the same things, they aren’t even in the same realm.
Conclusion

Rewriting old code is a good thing — as long as you are fixing the problems of the past and not repeating them. We have 20 years of experience with what goes wrong in network code. We have academic disciplines like langsec that study the problem. We have lots of good examples of network parsing done well. There is really no excuse for code that is of this low quality.
This code has no redeeming features. It must be thrown away and rewritten yet again. This time by an experienced programmer who know what error codes mean, how to use asserts properly, and most of all, who has experience at network programming.

*** This is a Security Bloggers Network syndicated blog from Errata Security authored by Robert Graham. Read the original post at: https://blog.erratasec.com/2018/10/systemd-is-bad-parsing-and-should-feel.html

Secure Guardrails