Retbleed Security Fix Makes Linux go 70% Slower

by Richi Jennings on September 13, 2022

The Linux kernel workaround for the ‘Retbleed’ vulnerability is causing a huge slowdown in tests. Performance runs of VMware guests show results up to 70% worse on slightly old hardware.

In a way, this was to be expected: When your workaround to a bug in a performance booster is to neuter the performance booster, you’d kinda expect the performance not to be so … boosted? Anyway, it’s hugely worrying for owners of older cloud server fleets.

Good thing Intel is working on a better fix. In today’s SB Blogwatch, we speculate to accumulate.

Your humble blogwatcher curated these bloggy bits for your entertainment. Not to mention: Horrific heatwave.

Must Try Harder

What’s the craic? Michael Larabel reports—“VM Performance Tanks Up To 70% Due To Intel Retbleed Mitigation”:

Sponsorships Available

“Performance horror show”
A performance regression in Linux 5.19 [is] affecting compute performance up to -70%, networking up to -30%, and storage up to -13%. … The heavy hitting regressions are known and a side effect of the Intel Retbleed mitigation for older processors.
…
VMware engineers were concerned by the performance regression when running Linux virtual machines with VMware ESXi. … The hardware they were testing was an Intel Xeon “Skylake” server with 112 threads and 2TB of RAM. … Retbleed further impacts the … performance cost beyond Spectre, Meltdown, and the other CPU speculative execution vulnerabilities of recent years.
…
Due to Retbleed, Intel Indirect Branch Restricted Speculation (IBRS) is flipped on by default for affected CPUs. … Intel engineer [Peter Zijlstra] previously referred to the Retbleed/IBRS situation as a “performance horror show.”

Ret-whatnow? Bill Toulas repeats the scary stat—“70% drop”:

“Abuser of the mitigation”
Retbleed is a speculative execution attack discovered in July 2022 that can leverage return instructions in the CPU to extract sensitive information. Examples of data that Retbleed can leak include items contained in kernel memory, such as root password hashes. … Speculative execution is a performance-enhancing feature on modern processors [but it] has negative consequences from the perspective of security because it makes side-channel attacks possible.
…
Spectre … was mitigated with the “Retpoline” fix, a software-based solution that had minimal performance impact. Retbleed is, in fact, not only a bypass of the Retpoline fix, but an abuser of the mitigation … to inject branch targets in the kernel address space.

Subsequently, Simon Sharwood says—“Retbleed fix slugs Linux VM performance”:

“Torvalds appears not to be concerned”
Because speculative execution exists to speed processing, it is no surprise that disabling it impacts performance. A 70 percent decrease in computing performance will, however, have a major impact on application performance that could lead to unacceptable delays for some business processes.
…
Intel Skylake CPUs [were] between 2015 and 2017 [and so] will still be present in many server fleets. … Many VMware users will likely have Skylake CPUs in production, or (perhaps unwittingly) use them in clouds.
…
Subsequent CPUs addressed the underlying issues that allowed Retbleed and other Spectre-like attacks. … Emperor penguin Linus Torvalds, appears not to be concerned by the situation.

Presumably because Linus knows what’s coming next? Peter Zijlstra eventually explains—“[PATCH v2 00/59] x86/retbleed”:

“Massive improvement”
At long last a new version of the call depth tracking patches. … As a refresher; the theory behind call depth tracking:
…
The Return-Stack-Buffer (RSB) is a 16 deep stack that is filled on every call. … Once the RSB is empty, the CPU falls back to other predictors—e.g., the Branch History Buffer—which can be mistrained by user space and misguides the (return) speculation path to a disclosure gadget of your choice.
…
Call depth tracking is designed to break this speculation path by stuffing speculation trap calls into the RSB whenever the RSB is running low. This way the speculation stalls. … Stuffing at the 12th return is sufficient to break the speculation before it hits the underflow [and] brings the signal to noise ratio down to the crystal ball level. … Netperf-TCP [and] perfsyscall [show] a massive improvement with a major reduction in system CPU usage.

Is there a workaround? Sure, but it’s not safe for servers, says Sunspark:

mitigations=off in the grub kernel options.

You don’t need mitigations if you’re a server where you control every single running service and application. You don’t need mitigations if you’re [not a server—unless] you click on everything randomly and say “yes” to every prompt.

I do not. Nate Amsden has his mind blown:

I was surprised at the hit … for the original Spectre/Meltdown performance hits on Linux. … I have been actively avoiding the fixes for years whether it be firmware updates or microcode updates and using “spectre_v2=off nopti” as kernel boot options. … I have always felt these are super over hyped scenarios.
…
I was given a couple of Lenovo T450s laptops … and I put Linux Mint 20 on them. They both had 12G of ram and SSDs, but both felt so slow they were practically unusable. [And then] I realized I hadn’t put those “spectre_v2=off nopti” settings in the kernel yet. I put them in and it was amazing the improvement in performance.
…
I really couldn’t believe the improvement in performance. For just basic desktop tasks it blows my mind.

Hey, Intel! I want my money back. Tough cookies, thinks mainframe lover oiaohm:

With CPU defects, in most cases, as long as the CPU will still work the same as what it did when you bought it you can not get a refund. … x86 servers are basically the cheap choice with no on going payments.

When it comes to security with x86 you are basically buying it “as is” … so sometime you get bitten. (Yes, with the recent x86 CPU design faults people have been bitten a lot.)

But this Anonymous Coward blames VMware:

This problem has been known and discussed for many months now and affects all sorts of subsystems. They are working on mitigations that hopefully have less of a performance impact in the future.

VMware is late to the game, ignorant, or just ****posting at this point. They were told to stop on the mailing list. Nothing they posted is new information.

Meanwhile, cabirum waxes philosophical:

Is it even possible to have speculative execution and not be vulnerable to related exploits?

And Finally:

In which Simon uses the word “perfunctory”

Previously in And Finally

You have been reading SB Blogwatch by Richi Jennings. Richi curates the best bloggy bits, finest forums, and weirdest websites … so you don’t have to. Hate mail may be directed to @RiCHi or [email protected]. Ask your doctor before reading. Your mileage may vary. E&OE. 30.

Image sauce: Elina Emurlaeva (via Unsplash; leveled and cropped)