Why Meltdown exists

So I thought I’d answer this question. I’m not a “chipmaker”, but I’ve been optimizing low-level assembly x86 assembly language for a couple of decades.

The tl;dr version is this: the CPUs have no bug. The results are correct, it’s just that the timing is different. CPU designers will never fix the general problem of undetermined timing.
CPUs are deterministic in the results they produce. If you add 5+6, you always get 11 — always. On the other hand, the amount of time they take is non-deterministic. Run a benchmark on your computer. Now run it again. The amount of time it took varies, for a lot of reasons.
That CPUs take an unknown amount of time is an inherent problem in CPU design. Even if you do everything right, “interrupts” from clock timers and network cards will still cause undefined timing problems. Therefore, CPU designers have thrown the concept of “deterministic time” out the window.
The biggest source of non-deterministic behavior is the high-speed memory cache on the chip. When a piece of data is in the cache, the CPU accesses it immediately. When it isn’t, the CPU has to stop and wait for slow main memory. Other things happening in the system impacts the cache, unexpectedly evicting recently used data for one purpose in favor of data for another purpose.
Hackers love “non-deterministic”, because while such things are unknowable in theory, they are often knowable in practice.
That’s the case of the granddaddy of all hacker exploits, the “buffer overflow”. From the programmer’s perspective, the bug will result in just the software crashing for undefinable reasons. From the hacker’s perspective, they reverse engineer what’s going on underneath, then carefully craft buffer contents so the program doesn’t crash, but instead continue to run the code the hacker supplies within the buffer. Buffer overflows are undefined in theory, well-defined in practice.
Hackers have already been exploiting this defineable/undefinable timing problems with the cache for a long time. An example is cache timing attacks on AES. AES reads a matrix from memory as it encrypts things. By playing with the cache, evicting things, timing things, you can figure out the pattern of memory accesses, and hence the secret key.
Such cache timing attacks have been around since the beginning, really, and it’s simply an unsolvable problem. Instead, we have workarounds, such as changing our crypto algorithms to not depend upon cache, or better yet, implement them directly in the CPU (such as the Intel AES specialized instructions).
What’s happened today with Meltdown is that incompletely executed instructions, which discard their results, do affect the cache. We can then recover those partial/temporary/discarded results by measuring the cache timing. This has been known for a while, but we couldn’t figure out how to successfully exploit this, as this paper from Anders Fogh reports. Hackers fixed this, making it practically exploitable.
As a CPU designer, Intel has few good options.
Fixing cache timing attacks is an impossibility. They can do some tricks, such as allowing some software to reserve part of the cache for private use, for special crypto operations, but the general problem is unsolvable.
Fixing the “incomplete results” problem from affecting the cache is also difficult. Intel has the fastest CPUs, and the reason is such speculative execution. The other CPU designers have the same problem: fixing the three problems identified today would cause massive performance issues. They’ll come up with improvements, probably, but not complete solutions.
Instead, the fix is within the operating system. Frankly, it’s a needed change that should’ve been done a decade ago. They’ve just been putting it off because of the performance hit. Now that the change has been forced to happen, CPU designers will probably figure out ways to mitigate the performance cost.
Thus, the Intel CPU you buy a year from now will have some partial fixes for these exactly problems without addressing the larger security concerns. They will also have performance enhancements to make the operating system patches faster.
But the underlying theoretical problem will never be solved, and is essentially unsolvable.

This is a Security Bloggers Network syndicated blog post authored by Robert Graham. Read the original post at: Errata Security

One thought on “Why Meltdown exists

  • January 5, 2018 at 5:17 am

    As a former x86 assembly language veteran, I’d like to share a point of view.

    There has always been an informal partnership between CPU & chipset makers and firmware/OS vendors, where when new silicon is released with defects, those defects get patched where possible in software and then usually then get fixed in the next version of the silicon.

    However to suggest that the sorts of side channel opportunities exploited by Meltdown and Spectre (elements of which, such as inference from cache timings, have been known about for some time) should be fixed long-term in the OS places a burden on OS vendors to address not just these publicly known ones but possible future issues.

    I don’t understand the intricacies of superscalar speculative execution, but it seems that the root cause of the problem is that a pipeline execution engine doesn’t carry out any form of memory protection checking prior to accessing memory, but only after the event. This allows access to protected memory, whose data values can then be determined by the exemplar pseudo-code shared on some sites and measurement of cache timings, even if the actual execution sequence and data access is wound back as a result of a memory protection exception occurring.

    The current approach to pipeline execution seems to be that “if the memory access was not legitimate, the results of such an instruction stream need to be discarded”. Shouldn’t a privilege check be carried out just prior to any memory accesses during speculative execution, and execution terminated if protection breached rather than allowing an illegal access to take place by the execution unit? While this might impact performance a little – the protection check has to take place at some point anyway – wouldn’t it be a better way to address the root cause?

Comments are closed.