Nick Walter, Principal Architect
At House of Brick we’ve been closely following the latest news stories regarding the recently revealed Meltdown and Spectre vulnerabilities, which affect modern CPU architectures. Meltdown, as first widely reported by The Register, concerns a flaw in speculative execution in Intel CPUs that may allow user level code to access kernel level memory in certain cases. The similar Spectre vulnerability is more subtle, as it revolves around tampering with the contents of CPU level buffers, but so far has only been shown to be able to retrieve contents from user level memory and not kernel level memory. These vulnerabilities suggest the frightening possibility that lots of very secure memory areas, possibly containing highly sensitive information, could be open to access by malicious programs. Worse yet, is the possibility that the vulnerability could potentially cross guest/hypervisor boundaries in virtualized systems and allow a single guest VM running malicious code to gain access to protected memory areas of other guest VMs, or even the hypervisor itself.
Potential Performance Implications
Patches have already been released by all major O/S and hypervisor vendors to completely mitigate against Meltdown, and partially protect against Spectre. An unfortunate side effect of these patches is that the mitigations involve a performance penalty for certain workloads. As the Linux kernel is open source, the exact mitigation strategy, known as Page Table Isolation (PTI), is public and available for inspection. What we can glean from the PTI patches is that the kernel can no longer trust the hardware level protection of kernel system memory from user space processes, and thus the kernel memory can no longer be mapped into the virtual address space of a user level process. This indicates that kernel system calls by user level processes will get a good deal more expensive for a CPU to execute, as the processor-level Translation Lookaside Buffers (TLB) that deal with virtual memory management have to be flushed and reloaded when the kernel executes a system call. In practical terms, this translates to a slowdown in execution for applications, which perform a lot of system calls to perform I/O to either the network or disk. For slow I/O subsystems, most of the time in the kernel call for I/O is actually spent performing the I/O, so the slowdown won’t be very noticeable. However for very high speed I/O, such as local loopback network communications, interactions with named pipes, or I/O to very high-speed disk or flash arrays, the slowdown becomes more prominent.
The exact nature and degree of that slowdown will be a critical issue for many House of Brick customers. Some early reports are speculating slowdowns of 5%-30% for certain workloads. While these numbers are only general speculation for most platforms, there have already been benchmarking tests performed on a Linux system patched with the fixes. Michael Larabel published the first Linux benchmark numbers of a patched system at Phoronix. These benchmarks suggest that the performance hit is, as expected, very minimal except in high I/O tests to very fast storage. Unfortunately for many House of Brick customers, the kind of high-performance business-critical workloads that merit costly all-flash array storage, such as critical SQL Server or Oracle databases, are the workloads where a noticeable performance hit is the least palatable.
The news of this security vulnerability and the potential performance implications of the fix can lead to a dilemma for IT departments already dealing with performance challenges in business-critical workloads. This has implications for many aspects of IT operations. The looming threat of a potential out-of-cycle emergency patch is only the first issue to consider, there’s also the longer-term issue of when and how to refresh hardware with platforms that don’t have the hardware vulnerability.
Key Considerations and Recommendations
They key questions to consider, in House of Brick’s estimation, are:
- Can the patching be deferred? Immediate patching of high threat issues is an IT industry best practice for very good reason, but it may be tempting to defer patching on business-critical systems that are already operating at (or near) the edge of their performance envelope. House of Brick recommends that any decision to defer patching should be well documented and made in close consultation with security teams, compliance officers, and legal counsel.
- Should hardware refreshes be postponed? Any House of Brick customer on the brink of ordering new server hardware should definitely take a moment to consider whether the purchase is imminently needed, or if it can be deferred until CPUs without the hardware vulnerability are available. Intel has, as of yet, released no public statement on when chips without the vulnerability will be released.
- Will this impact cloud workloads or cloud migration plans? Public cloud environments present the largest area of exposure to these vulnerabilities, due to the fact that there is no way to be sure that a particular virtual instance isn’t sharing hardware with a malicious one that might be running Meltdown or Spectre attack code. Major cloud providers like Amazon Web Services (AWS) and Microsoft Azure are well aware of this issue and have already begun patching their hypervisors. AWS/Azure customers will most likely need to perform an out-of-cycle emergency patch for the guest OS on their own cloud instances. There have already been threads in AWS forums about performance reductions being noticed after instances are rebooted for patching, indicating that the performance impacts are already affecting certain workloads. House of Brick doesn’t recommend abandoning current public cloud workloads or planned migrations to the cloud. Rather, House of Brick suggests evaluating the possibility of deferring any migrations planned, or in process currently, in order to take stock of the performance impacts and ensure that adequate capacity is being provisioned for the workloads being migrated. As always, House of Brick stands ready to advise customers looking to optimize the performance of their business-critical applications.
- How should potential performance impacts be managed? The worst way to find out that a business critical workload cannot perform acceptably is waiting until users are affected, or once there’s an operational outage. When patching for Meltdown and Spectre, it is strongly recommended to be as proactive as possible in assessing the performance impacts of these patches. Ensure that there are good before/after metrics available on system performance, so that CPU levels, I/O latency, and overall system throughput can be compared in pre and post patch states. Quickly identifying which workloads seem to be suffering the worst slowdowns from the Meltdown/Spectre patches will allow IT staff the opportunity to hopefully provision extra capacity or redistribute workloads in order to mitigate the performance hit prior to slowdowns or outages becoming visible at the end-user level.
House of Brick realizes that these vulnerabilities present an extra challenge to IT departments, many of which are already fully committed to day-to-day operational activities. However with proper management, these vulnerabilities shouldn’t present a catastrophic performance or security threat. Performance analysis is one of our core competencies here at House of Brick, so anyone concerned about analyzing performance, or optimizing to offset the performance penalty of the patches for these vulnerabilities, should feel free to reach out to us for assistance.