Sunday, January 31, 2016

High CPU in Windows VM

Summary:  When VMware Resource Pool constrains CPU resources on a VM within the pool, the guest OS can indicate high CPU usage when VMware showed it as not particularly busy.  CPU Ready % from esxtop shows the VM has a high %.  Resolution is to increase the CPU limit in the resource pool settings, power off some VMs or check for process in the VMs using excessive CPU.

I had been struggling with  a Windows 2012 R2 VM running on VMware.  I have a few hundred of these but this one happened to be a domain controller and was painful to remote desktop to. Further, it is running on hyper-converged Nutanix host in India. I know the host and storage are fine.  I even had the fine folks at Nutanix help look at the issue just in case there was some odd issue with storage for this one VM.

To be honest, I have a couple more VMs experiencing issues.  One of the issues we did find was CPU was often in a high Ready percent.  One of traps we fall get caught in is there is ample CPU % free on the host.  Around 40% free. But that does not mean VMs are not waiting for CPU.  Clearly mine were with 10 or 15% Ready.

I manually control memory and CPU for each of my resource pool.  basically, each resource pool get a percentage of usable CPU and memory.  I deduct 20% of each CPU and Memory from the total of all hosts in the cluster then calculate what that means in KB and Mhz and allocate these per resource pool.  This way, I can ensure the machine never exceeds 80% resource utilization.  I had to create a spreadsheet to calculate these values for me.  It is tedious and when I add a host to the cluster or worse, a new resource pool, there is a lot of updating to the spreadsheet then updating resource pool settings.

So, I checked my resource pool usage for this problem VM and while it is close to the max, it is around 5% free CPU resources.  So, I have this VM that is basically idle but using 90% CPU in the OS.  VMware performance stats show it rather idle.  Remote Desktop control is painfully slow just to open control panel.  It is acting poorly as a DC.  Plenty of free resources on the host.

I see the VM is an exceedingly and pretty constant high CPU Ready %.  What could cause this I thought.  Then, I realize the only thing that would create this condition is if the resource pool limits was constraining the VMs in the pool despite looking OK.

So, I add 20,000 MHz to the resource pool and the CPU usage for the pool in the summary display makes this weird jump and I can see now that the resource pool was actually under duress and the VMs were being heavily constrained.  Within a few minutes the VM returns to normal.  The CPU % in Windows drops to a normal few percent and RDP is working normally and the CPU Ready % for the VM is under 1% again.

