We haven’t had our HP BL460c Gen8’s with the new Xeon E5-2697 v2 12 core processors long. Last week we started to get e-mails from the help desk that users were complaining about sluggish performance in Citrix. Oddly, all of the XenApp VM’s happened to live on the same ESXi host. I say oddly because performance issues rarely seem to fall in line as they did here. We immediately evacuated the host and admitted it to the infirmary cluster.
The host was powered down over the weekend and we ran 40 loops of full HP Insight Diagnostics hoping to catch a CPU or memory error, but no such luck; all tests passed. Our next test yielded results we couldn’t quite explain. I loaded the host up with 12 Stress Linux VM’s with four vCPU’s each and ran the following command on all of them.
Stress --cpu 2000 --verbose
Normally, this would put enough of a load on any host to peg the CPU usage at 100%, but not this tough guy. We could not push the CPU usage much beyond 50%. Ok. Game on, Rambo!
Stress --cpu 10000 --verbose
The host still refused to go much beyond 50%. This doesn’t make any sense, the host should buckle under this sort of test.
As an aside: When we started to look at the load on the individual cores we weren’t sure what to make of it. When we pushed the CPU over the edge we noticed each of the sockets clamping down to ~25% and ~30%. This didn’t make any sense at first but we believe we pushed the processors beyond the point where there were no cycles left for hyper-threading to do anything and thus rendering 24 of 48 logical processors useless. Try it on one of your healthy test hosts with hyper-threading and let me know what you see. It will look the same.
We noticed the host would not go much over 240 watts and suspected a power issue, but the host’s BIOS settings looked OK.
- Power Profile = Maximum Performance
- HP Power Regulator = HP Static High Performance Mode
- Intel QPI Link Power Management = Disabled
- Minimum Processor Idle Power State = No C-States
- Minimum Processor Idle Power Package State = No Package State
HP wasn’t much help as they looked at the active health system logs, the test results, and the VMware logs and determined this was definitely not a hardware issue. HP said it must be a VMware configuration issue (I’m still looking for the setting to cut my host performance in half). In order to validate/disprove this theory we pulled the disks from the faulty host and placed them in a brand new, identical spare. As expected, our problems went away after running the exact same tests.
We thought this still had to be a BIOS configuration issue. My teammate told me that we had a script in our Altiris imaging environment that would pull all the bios settings and spit it out to an XML file. We were going to diff the XML files from the faulty and the new hosts. I’m thankful we went this route because whatever pre-boot Linux distro was loaded started to throw the following errors. “Package temperature above threshold, cpu clock throttled”
This makes sense. After looking in iLO we saw both CPU’s were running at 69°c (156°F) versus ~40°c (104°F) of other hosts running normally. Ambient temperature was 15°c (59°F). Yes, our data-center is almost cold enough to be a walk in beer cooler.
It didn’t seem to matter if the CPU’s were under load or idle, the temperature would not stray from 69°c. This had to be an issue with the temperature sensors, I thought. So we pulled the host and removed the heat sinks so we could look at the CPU’s through a thermal camera we borrowed from engineering. The results were inconclusive considering the CPU’s had enough time to cool while dismantling the server. However, upon removing the heat-sinks I noticed the lack of care given to properly apply the thermal paste.
After performing the inconclusive thermal test I cleaned off the old paste and gave the prescribed 2.5ml of paste per processor.
We brought the host back to life and began to run stress tests once more. I couldn’t believe what we saw…
CPU was at 100% and temperatures were running in the 40°c range for over an hour.
Had I not been there, I wouldn’t believe it. But this makes perfect sense. If the CPU’s can’t properly transfer the thermal energy to the heat-sinks, they are forced to clock down in order to prevent burning up. To prove this we noticed a second host that had one processor running at 40°c and another at 67°c.
Can you guess which was which?
My takeaways from this are:
- Thermal paste really can impact performance!
- I wish the HP Active Health System logs would include when CPU’s clock down to prevent overheating.
- CPU clock throttled error message don’t appear in ESXi logs.