18

ESXi host performance issues and the importance of thermal paste

We haven’t had our HP BL460c Gen8’s with the new Xeon E5-2697 v2 12 core processors long. Last week we started to get e-mails from the help desk that users were complaining about sluggish performance in Citrix. Oddly, all of the XenApp VM’s happened to live on the same ESXi host. I say oddly because performance issues rarely seem to fall in line as they did here. We immediately evacuated the host and admitted it to the infirmary cluster.

The host was powered down over the weekend and we ran 40 loops of full HP Insight Diagnostics hoping to catch a CPU or memory error, but no such luck; all tests passed. Our next test yielded results we couldn’t quite explain. I loaded the host up with 12 Stress Linux VM’s with four vCPU’s each and ran the following command on all of them.

Stress --cpu 2000 --verbose

Normally, this would put enough of a load on any host to peg the CPU usage at 100%, but not this tough guy. We could not push the CPU usage much beyond 50%. Ok. Game on, Rambo!

Stress --cpu 10000 --verbose

The host still refused to go much beyond 50%. This doesn’t make any sense, the host should buckle under this sort of test.

The host won't go above 50%

As an aside: When we started to look at the load on the individual cores we weren’t sure what to make of it. When we pushed the CPU over the edge we noticed each of the sockets clamping down to ~25% and ~30%. This didn’t make any sense at first but we believe we pushed the processors beyond the point where there were no cycles left for hyper-threading to do anything and thus rendering 24 of 48 logical processors useless. Try it on one of your healthy test hosts with hyper-threading and let me know what you see. It will look the same.

The clamping effect didn't make sense

We noticed the host would not go much over 240 watts and suspected a power issue, but the host’s BIOS settings looked OK.

  • Power Profile = Maximum Performance
  • HP Power Regulator = HP Static High Performance Mode
  • Intel QPI Link Power Management = Disabled
  • Minimum Processor Idle Power State = No C-States
  • Minimum Processor Idle Power Package State = No Package State
HP recommends allowing OS control for power on VMware, but this was not the issue with this exercise.

HP wasn’t much help as they looked at the active health system logs, the test results, and the VMware logs and determined this was definitely not a hardware issue. HP said it must be a VMware configuration issue (I’m still looking for the setting to cut my host performance in half). In order to validate/disprove this theory we pulled the disks from the faulty host and placed them in a brand new, identical spare. As expected, our problems went away after running the exact same tests.

The new hardware resolved the issuesThe new host could push past the previous 240 watt barrier

We thought this still had to be a BIOS configuration issue. My teammate told me that we had a script in our Altiris imaging environment that would pull all the bios settings and spit it out to an XML file. We were going to diff the XML files from the faulty and the new hosts. I’m thankful we went this route because whatever pre-boot Linux distro was loaded started to throw the following errors. “Package temperature above threshold, cpu clock throttled

clockthrottled

This makes sense. After looking in iLO we saw both CPU’s were running at 69°c (156°F) versus ~40°c (104°F) of other hosts running normally. Ambient temperature was 15°c (59°F). Yes, our data-center is almost cold enough to be a walk in beer cooler.

CPUTempBefore

It didn’t seem to matter if the CPU’s were under load or idle, the temperature would not stray from 69°c. This had to be an issue with the temperature sensors, I thought. So we pulled the host and removed the heat sinks so we could look at the CPU’s through a thermal camera we borrowed from engineering. The results were inconclusive considering the CPU’s had enough time to cool while dismantling the server. However, upon removing the heat-sinks I noticed the lack of care given to properly apply the thermal paste.

We require more thermal paste

After performing the inconclusive thermal test I cleaned off the old paste and gave the prescribed 2.5ml of paste per processor.

2.5ml of paste applied to each processor

We brought the host back to life and began to run stress tests once more. I couldn’t believe what we saw…

New Paste yielded perfect results

CPU was at 100% and temperatures were running in the 40°c range for over an hour.

Temps have returned to normal

Had I not been there, I wouldn’t believe it. But this makes perfect sense. If the CPU’s can’t properly transfer the thermal energy to the heat-sinks, they are forced to clock down in order to prevent burning up. To prove this we noticed a second host that had one processor running at 40°c and another at 67°c.

Can you guess which was which?

Can you guess which was which?

My takeaways from this are:

  1. Thermal paste really can impact performance!
  2. I wish the HP Active Health System logs would include when CPU’s clock down to prevent overheating.
  3. CPU clock throttled error message don’t appear in ESXi logs.

Matt Bradford

18 Comments

  1. Definitely interesting, I also wonder about the quality of the thermal paste being used. It looks as though the white paste in the original configuration seemed to have ran down the CPU as if it melted when heated. The paste that you used seems to be, correct me if I’m wrong, Arctic Silver. I don’t have any idea what the other white stuff would be. Did you by chance take a picture of the coverage using your five dot method?

    • Hi Matt,
      I don’t know what the former paste was, and I’ll have to get back to you on the new stuff. But I’ll bet you’re right. We had a few other hosts with similar but less pronounced issues and I used AOS-silicone XT on those. I don’t have any pictures of the coverage but I checked one and it seemed more than adequate and will only get better as heat and pressure are applied. The five dot method was suggested by two veteran HP resources, so who am I to argue? 🙂 Thank you for your comment!

  2. Hi Matt,

    well done on this.

    I am trying to do the same test as we are having similar problems with citrix environment.

    I am kind of new server troubleshooting but how did you get to find out what the temperature stats where? Is there an option in the ilo for this?

    your assistance much appreciate on this.

    • Hi Bala,
      You can get the temperature readings from both iLO and the vSphere client.

      In iLO 3 and 4 click on System Information and select the Temperatures tab.

      In the vSphere Web Client select your host, click on the Monitor tab, select Hardware Status, and expand Temperature.
      In the vSphere C# Client select your host, click on the Hardware Status tab, and expand Temperature.

      Best of luck!
      Matt

  3. Hi Matt,

    thanks for getting back to me on this, so I checked it out on a couple of host where i customers have been experiencing slows. At the moment the CPU not hitting 100% which is not a suprise, but the temp reading are coming up 40C. which is the same as your reading.

    do you think the server is behaving the same as yours?

    Cheers

  4. Hi Bala,

    There may be something else going on in your environment. If the CPU’s are reading 40°C then they shouldn’t be clocking down. One of the simpler things to check would be to make sure the power profile of the host is at the very least set to maximum performance. Otherwise I’d be looking closely at the performance of the VM’s and Hosts to identify where the constraints are.

    Hope this helps.
    Matt

  5. Nice catch, Matt. Quality parts (and specifically paste) is something most of us probably take for granted. Scanning through my farm of G8’s, I do see some CPUs that are potentially experiencing this issue.. especially when three out of four CPUs in a specific blade are at 40 and the fourth is in the high 50s, low 60s.

    These stats can also be polled via SNMP to ESXi (if the HP management suite is loaded) or iLO directly under the .1.3.6.1.4.1.232.6.2.6.8.1.4.* OID. A dual socket BL460 series will use .2 and .3 to represent CPU 1/2 respectively (and .2 thru .5 for a quad socket BL660).

  6. Is there a defined CPU temperature that will trigger a throttle? I have several BL460c blades with CPU temps of ~58c, and many in the low ~40c.

    Do you have the specific Linux distro that showed that “cpu throttled” error? Or any other way to tell if the CPU is being throttled on a BL460c?

    Thanks!

  7. Matt, thanks for the post. It’s a bit disconcerting that ESXi didn’t surface any warning around the CPU throttling, but the Altiris boot image did. You think this is just a gap in the ESXi alerting?

  8. Would using a tool inside a VM allow you to monitor the clock speed. If all power saving options are off the guest CPU reporting tool should match the rated CPU speed for the host. Good catch. I found one host running at 65 Cel and stepped down as well. Not a HP server either.

Leave a Reply