It was a Tuesday morning and things were beginning to ramp up in the Sys Admin office. We were doing our routine AM checks when we noticed a production VM had the word “inaccessible” next to it in the console. The VM was completely un-responsive; we couldn’t open a console in vSphere and it wasn’t responding to pings. None of the typical vSphere operations did anything. 30 minutes later we started to get calls that one of the spam filters (a VM on a different host) had gone down. Same issue, the VM was labeled “inaccessible” and was dead to the world. And finally, a third VM on yet another host showed the same symptoms a while later.
We hadn’t seen this sort of thing in VMware before. One thought was to remove the VM’s from inventory and import them back into vCenter. But what if there were something more to it? The fact that it happened on three different hosts at three different times led us to believe this wasn’t just an environmental thing. We decided to place the hosts in maintenance mode and reboot them, after which the VM’s returned to normal operation.
VMware found identical issues on all three hosts.
2014-07-22T13:07:17.211Z cpu16:33706)HBX: 2692: Waiting for timed out [HB state abcdef02 offset 4087808 gen 201 stampUS 330309104827 uuid 53c95722-895ca497-ff59-00215a9b0500 jrnl drv 14.60] on vol 'SAN_Datastore1' 2014-07-22T13:07:21.682Z cpu4:1158818)WARNING: lpfc: lpfc_abort_handler:2989: 1:(0):0748 abort handler timed out waiting for aborting I/O xri x4c5 to complete: ret xbad0001, cmd x88, tgt_id x1, lun_id x0 2014-07-22T13:07:21.683Z cpu24:1271247)WARNING: lpfc: lpfc_abort_handler:2989: 1:(0):0748 abort handler timed out waiting for aborting I/O xri x513 to complete: ret xbad0001, cmd x88, tgt_id x1, lun_id x0
Support recommended we downgrade the LPFC (Emulex OneConnect OCe11100 HBA) driver from 10.2.216.7 to 10.0.727.44. If you’ve followed our saga, you might recall VMware recommended version 10.2.216.7 in PSOD Blues Part 2. Apparently we had been misinformed. Nice!
Meanwhile one of our vendor’s engineers mentioned he had seen similar anomalies with other customers. There was just something about running the ESXi HP build on Gen 8 hardware that caused frequent PSOD’s and even inaccessible VM’s. His recommendation was to run the “generic” VMware ESXi build as he had seen no issues after dozens of implementations with Gen 8 servers. All of this seemed to make perfect sense. My co-worker and I had never seen any instability at our previous jobs and neither of us had run vendor builds. Within a week and a half we had rebuilt every one of our hosts without any downtime (isn’t Virtualization great?). The only change I made was to downgrade the bundled HPSA driver from 126.96.36.199 to 188.8.131.52 as described in PSOD Blues Part 1.
It’s been just over a week since rebuilding our hosts and so far things have been stable. I’ll give an update here in a few weeks on how things are going.