Understanding vNUMA (Virtual Non-Uniform Memory Access)

3d bussines white and red humans make a handshakeSometime last year, I had a conversation about some interesting concepts around vSphere design and vNUMA was brought up as well as some considerations for large virtual machines. It’s funny how starting a conversation on one topic can lead to something completely different and address some of the issue or even expose some of the constraints that we can miss.

As I am preparing for the VCAP-DCD/DCA I figured why not just create a post explaining the basics of UMA, NUMA, vNUMA, and SMP. I find it pretty important to understand these details because they play an important part in designing VMware vSphere Environments.

 

SYMETRIC MULTIPROCESSING (SMP)

To keep it simple, SMP architecture allows for multiprocessor servers to share a single bus and memory, while being controlled by the operating system. In other words, applications that are optimized for multi-threading can take advantage of servers that are equipped with multiple processors. Most modern Unix and Windows operating systems will support SMP architecture. Same idea applies to CPU’s with multiple cores. Each core will be treated as a separate CPU allowing for added benefit of multi-threading for applications and operating systems that support it.

However, as you’ve probably realized, sharing memory and bus can lead to a performance issue if we start adding more processors.

SMP architecture within a server

SMP_system_vmdc

Unified Memory Architecture (UMA) may also be known as Shared Memory Architecture (SMA) is where the CPU’s within the single server both share the same memory uniformly.

 

NON-UNIFORM MEMORY ACCESS (NUMA)

NUMA architecture works by grouping or clustering CPU and Memory together to create what we call a NUMA node. The key benefit of NUMA architecture is to reduce memory latency and increase application memory performance by grouping memory and CPU together in multiprocessor servers.

Now, the memory within the same NUMA node becomes local to the CPU and provides dedicated memory access for that particular CPU. Every time the CPU tries to access memory in a different NUMA node, it is considered remote access. Remote access means higher latency and higher latency can translate to reduced application memory performance.

NUMA architecture

Numa_System_vmdc

 

It is also important to understand why NUMA plays a key role when it comes to virtualization. In fact, Frank Denneman wrote a great article which I will link here. He also has an article on NUMA Scheduling if you’re interested, you can find it here.

 

VIRTUAL NON-UNIFORM MEMORY ACCESS

In virtualized environments, the hypervisor has to be able to manage compute resources really well for the sake of virtual machine performance. Remember, we can place many virtual machines on one host, that’s the idea behind server virtualization, but those virtual machines still need to run just as well as if they were running on a physical host. Virtual Non-Uniform Memory Access allows for better virtual machine placement. When we refer to placement, we’re referring to virtual machine to NUMA node alignment/placement

With vNUMA enabled, the hypervisor will be able to map out a reference NUMA topology of the underlying system and then present this topology to the virtual machine.

vNUMA disabledvnuma_server_02_01

As you can see, the virtual machine above, does not really know about NUMA, meaning, there is no reference topology of the underlining NUMA environment so the virtual machine will use any CPU/Memory for it’s Operating system and applications

Here’s another diagram with vNUMA enabled
vnuma_server_01_01 

As you can see from the second image, the virtual machine nicely aligns with the NUMA node allowing for better performance. Where this get’s interesting is when we’re dealing with large virtual machines, by large, I mean virtual machines that have more vCPU or vCores then the underlying physical CPU in the host. We will talk about it next.

For vNUMA support, we have to be at vSphere 5 and the virtual machine needs to be version 8

 

DEALING WITH LARGE VIRTUAL MACHINES (MONSTER VM)

One of the main reasons for vNUMA is to address large virtual machines that have more vCPU/vCores than the NUMA nodes that are presently available. For example, let’s say that you have a NUMA node with 1 CPU and 8 cores, however your monster virtual machine requires 1 vCPU with 16 cores. This means that without vNUMA enabled, your virtual machine will access any available CPU/Memory regardless of the latency. That’s because it will see these resources as one big pool of available Memory and CPU. The outcome of such placement could result in poor application performance.

By enabling vNUMA, the large virtual machines will be presented with the underlying NUMA nodes and placed intelligently.

You can refer to the two diagrams above for regarding vNUMA.

 

CONCLUSION

vNUMA plays an important part when dealing with large virtual machines. It’s important that we understand the inner workings of such architecture because let’s be realistic, benchmarking virtual machine or application performance is a lot more than just testing memory and CPU, there has to be architecture in the background like NUMA that helps optimize the communication between the virtual machines and the CPU/Memory of servers.

Andrey Pogosyan

Andrey Pogosyan is a Virtualization Architect who’s focus is on infrastructure virtualization involving mainly VMware and Citrix products. Having worked in the IT industry for 10+ years, Andrey has had the opportunity to fulfill many different roles ranging from Desktop Support and all the way up to Architecture and Implementation. Most recently, Andrey has taken a great interest in the datacenter technology stack encompassing Virtualization, mainly VMware vSphere\View, Citrix XenApp\XenDesktop and Storage (EMC, HP, NetApp).

10 Responses

  1. Awesome article Andrey! Thanks for sharing

  2. Great post Andrey, well explained.

    Thought I would share the fact that if you enable CPU HotAdd, this disables vNUMA see KB http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2040375

  3. Gabriel Santamaria says:

    Very clear and concise, nice visuals

  4. Vincent says:

    Hello Andrew,

    Great blog about vNUMA. I’ve still one thing that’s not clear to me.

    By using the default setting of Node Interleaving (disabled), the system will build a System Resource Allocation Table (SRAT).
    ESX uses the SRAT to understand which memory bank is local to a pCPU and tries to allocate local memory to each vCPU of the virtual machine.
    By using local memory, the CPU can use its own memory controller and does not have to compete for access to the shared interconnect (bandwidth) and reduce the amount of hops to access memory (latency).
    Source: http://frankdenneman.nl/2010/12/28/node-interleaving-enable-or-disable/

    Is vNUMA enabled/disabled based on the availability of NUMA architecture or depends it on if node interleaving is enabled or disabled?

    E.g. vNUMA is enabled when a vm have 9 vCpu’s or more while node interleaving is disabled(by default) and ESXi uses SRAT.

    Cheers,
    Vincent

    • Node interleaving is typically introduced with NUMA ready servers. What that means is that aside from just having NUMA capable server, the option of Node Interleaving should also be disabled. This will force the ESXi host to use SRAT to better place virtual machines on the correct NUMA node.

      Node Interleaving simply lets the CPU chose where to place the memory so if you think about it, when disabled, ESXi will need to rely on SRAT to properly place the virtual machine on the correct NUMA node.

      In some cases, enabling Node Interleaving can increase the performance, but not in the case of ESXi where you’re hosting multiple instances of virtual machines

      NUMA architecture + Interleaving Disabled = vNUMA / SRAT

Leave a Reply