PSOD Blues (when drivers crash) part 1

Hyper-V was an only child in our environment until a few large outages caught the eye of management and end users. A fellow admin and I pitched the perfect solution. Sure, VMware will cost you more, but you are buying rock solid stability. Besides, neither of us had ever observed an ESXi outage before.

I have never seen the fabled purple screen of death (PSOD) outside of blog posts or Google searches. Maybe it was Karma keeping us humble, or maybe it was more.

If you get a purple screen, you’re going to have a bad time!

We asked for it all, distributed switches and host profiles; we wanted our environment to be consistent. Testing went smooth and we were ready to start accepting production tenants. The first three hosts went smooth and we had VM’s in the double digits. We were confident and more importantly management was confident. So much that it was decided to start moving our XenApp environment off of Hyper-V and on to VMware. Our new platform was going to be in the spotlight in front of 10,000 users.

The migration of our Hyper-V PVS image to VMware was challenging but eventually overcome. Our XenApp users were now on a new hypervisor and had no idea a change of this scale had even occurred. The migration went perfect and things were going great until 11:58 am on May 31st 2014. Sitting in my inbox was an e-mail with the subject “vSphere HA initiated a virtual machine failover action.” Shit.

This was one of our XenApp cluster hosts and affected maybe 40 users. Thankfully it decided to go down on a Saturday. Had it gone down on a weekday it could have impacted 500 users. The PSOD showed Exception 14 in world blah blah blah. A quick Google search came up with VMware KB1020181 which explained that a page was being requested that hadn’t been loaded into memory. Needing further explanation I opened a case with VMware.

If you’ve ever opened a case with VMware you know how complicated it can be just to get things going. What’s the difference between my account number and my customer number? Why do I have to provide the order number for my license? You’ll be doing yourself and your fellow admins a huge favor by compiling all that information beforehand rather than fumbling through the customer portal. We employ Confluence for our in house documentation and all this information is now in there. Aside from that VMware support is amazing.

The support engineer asked for an export of our system logs from vCenter. This was pretty straight forward, but I found the following steps to be more direct and useful for my own use.

Start the SSH service on your host. You don’t normally keep this service running, do you?
Open an SSH session and log on to the host
CD to /var/core
Execute following to extract the relevant information from the VMkernel dump

# esxcfg-dumppart -L vmkernel-zdumpfilename.1
Created file vmkernel-log.1

Open WinSCP and download vmkernel-log.1 to your local system.

This log file showed the following:

2014-05-31T15:57:32.579Z cpu14:9607609)@BlueScreen: #PF Exception 14 in world 9607609:vmklinux_9:h IP 0x418018b7e594 addr 0x0 PTEs:0x206ba74023;0x12f727023;0x12f728023;0x0;
2014-05-31T15:57:32.579Z cpu14:9607609)Code start: 0x418018200000 VMK uptime: 29:20:14:10.166
2014-05-31T15:57:32.579Z cpu14:9607609)0x412426e5dda0:[0x418018b7e594]hpsa_update_scsi_devices@#+0x39c stack: 0x410b0c037060
2014-05-31T15:57:32.580Z cpu14:9607609)0x412426e5de20:[0x418018b7f28f]hpsa_scan_start@#+0x187 stack: 0x412426e5de60
2014-05-31T15:57:32.580Z cpu14:9607609)0x412426e5de90:[0x418018b807af]hpsa_kickoff_rescan@#+0x20f stack: 0x0
2014-05-31T15:57:32.580Z cpu14:9607609)0x412426e5df30:[0x41801890175d]kthread@com.vmware.driverAPI#9.2+0x185 stack: 0x0
2014-05-31T15:57:32.581Z cpu14:9607609)0x412426e5df80:[0x4180188fee5b]LinuxStartFunc@com.vmware.driverAPI#9.2+0x97 stack: 0x100d
2014-05-31T15:57:32.581Z cpu14:9607609)0x412426e5dfd0:[0x4180182bb14f]vmkWorldFunc@vmkernel#nover+0x83 stack: 0x0
2014-05-31T15:57:32.581Z cpu14:9607609)0x412426e5dff0:[0x418018453532]CpuSched_StartWorld@vmkernel#nover+0xfa stack: 0x0
2014-05-31T15:57:32.584Z cpu14:9607609)base fs=0x0 gs=0x418043800000 Kgs=0x0
2014-05-31T15:57:32.584Z cpu14:9607609)vmkernel 0x0 .data 0x0 .bss 0x0

VMware support verified that our POSD was caused by the HPSA driver as documented in KB2075978 which references HP Advisory c04302261. HPSA version 5.5.0.58 is known to cause PSOD under ESXi 5.5. A quick check verified that we were running the version in question.

# cat /proc/scsi/hpsa/2
HP HPSA Driver (v 5.5.0.58-1OEM)

At this point 5.5.0.60 hadn’t been released yet so the only solution was to roll back to 5.5.0.50.

To install the async driver:

Browse to the local datastore on the host and upload the offline-bundle.zip. In this case we used hpsa-5.5.0-offline_bundle-1287942.zip.
Place the host in maintenance mode
Enable SSH on the host and open an SSH session as root
Verify the current version of the driver to be replaced as documented above
CD to the root of the datastore

# cd vmfs/volumes/localdatastore/

Copy the offline bundle to the /var/log/vmware folder

# cp hpsa-5.5.0-offline_bundle-1287942.zip /var/log/vmware

Install the drivers

# esxcli software vib install -d /var/log/vmware/hpsa-5.5.0-offline_bundle-1287942.zip
Installation Result
Message: The update completed successfully, but the system needs to be rebooted for the changes to be effective.
Reboot Required: true
VIBs Installed: Hewlett-Packard_bootbank_scsi-hpsa_5.5.0.50-1OEM.550.0.0.1198611
VIBs Removed: Hewlett-Packard_bootbank_scsi-hpsa_5.5.0.58-1OEM.550.0.0.1331820
VIBs Skipped:

Reboot the host

# reboot -f

After the reboot I verified that the drivers were properly loaded

# cat /proc/scsi/hpsa/2
HP HPSA Driver (v 5.5.0.50-1OEM)

Since 5.5.0.60 has been released we’ve decided to stay on 5.5.0.50 to be safe and has treated us well thus far.


July 22, 2014
Alfonso Casimiro

Dear Matt,

First of all, thanks for publishing this info, it was really helpful and I could consider as solved the PSOD on a new deployment with VMWare 5.5 and HP hardware.

FYI, there is a typo with the esxcfg-dumppart command in your post (a “p” is missing). In my case I upgraded to 5.5.0.60-1OEM version, I hope this would fix the problem. I will let you know if I find more issues with the new driver.
- Reply
  
  July 22, 2014
  Matt Bradford
  
  Hi Alfonso,
  
  Thanks for sharing, I’m glad this helped you out and please keep us updated on how 5.5.0.60 is working out for you. We’re still on 5.5.0.50 and it’s been stable thus far. Thank you for the correction, the article has been updated.
  
  Cheers!

August 21, 2014
Alfonso Casimiro

Hi Matt,

I had the servers working with new driver during 1 month. These servers perform intensive I/O tasks (a BI sollution is installed in the VMs) and no issues at all since I installed the 5.5.0.60 driver.

And again, thank you! 🙂
- Reply
  
  August 22, 2014
  Matt Bradford
  
  That is excellent news! Thank you for the update.

PSOD Blues (when drivers crash) part 1

Like this:

Matt Bradford

4 Comments

Leave a ReplyCancel reply

Share this:

Like this:

Matt Bradford

4 Comments

Leave a ReplyCancel reply