0

The leap second is coming. Is your VMware environment ready?

Another year is almost in the books. Love it, or hate it, 2016 has been an interesting year. The Chicago Cubs won their first World Series in 108 years, Britain voted to leave the EU, a real estate developer with a reality show was elected president, and we lost more beloved celebrities like Alan Rickman. Nooooo!!

hans_gruber

Just like villain extraordinaire, Hans Gruber at the end of Die Hard, 2016 is hanging on to every last second. On December 31st at 23:59:59 UTC, an extra second will be added to account for inconsistencies in the Earth’s rotation. How does this impact today’s computers who’s time structures define a day as 86,400 seconds?

In June of 2012, a number of websites including Reddit experienced outages because their Linux hosts couldn’t cope with the additional second. Linux has built-in high-resolution timers (known as hrtimers) which the kernel uses for a number of things that require precisely timed events such as drivers. When the leap second was added, the hrtimer subsystem got confused and caused systems to freeze. This issue was patched in March of that year, but as updates go, they don’t often reach users production environments for some time to follow.

There are two ways modern Linux kernels handle leap seconds. The first method is to simply roll the system clock back one second once the UTC clock strikes midnight. The second method is called slew mode where micro delays are inserted over a period of time. This method is sometimes preferred when it’s acceptable for the system clock to be slightly off. Take for example a syslog server. It could get confusing to determine the order of events that occur at midnight after a leap second and therefore time slew might be a more desirable approach.

How does Windows handle leap seconds? The answer is, it doesn’t really. When a leap second occurs, the Windows time will just be one-second faster than the actual time. It’s not until the next time it syncs with an NTP server that it makes the adjustment. All this is fine as most applications aren’t impacted by the change. Drift happens, and Windows corrects its time as a part of normal operations so a leap second isn’t usually anything special.

VMware recently published KB 2147498 which lists the VMware products that are unaffected by the leap second as well as products that could be. You’ll notice the affected products are appliances that run on SUSE Linux operating systems. SUSE published document 7016150 to address known issues caused by leap seconds in 2015 (document 7017873 was released in 2016 and says they are unaware of any new bugs and to follow document 7016150 for guidance). In most cases, the recommendation is to either apply an update or enable slew mode 24 hours prior to the leap second and for 48 hours afterward. VMware and SUSE recommend enabling slew mode prior to the leap second insertion if an update isn’t possible between now and the end of the month. VMware has documented the process for enabling slew mode in KB 2121016.

The recommendation to enable slew mode 24 hours prior to the leap second injection is because an adjtime() syscall can occur anytime within this window. If this happens, the system’s clock will be corrected and there will be no need for time slew. Enabling slew mode more than 24 hours prior to the leap second injection should be fine; however, don’t push this any further than you absolutely have to. If you enable it too early and your system clock drifts, it can take a long time for it to sync back up.

Once the leap second has occurred and ntpd notices that its time is off, slew mode will slow down the virtual frequency of the software clock by about 0.5ms per second. This will continue until the local system time matches the time of the NTP server which will take approximately 33 minutes to make up one second. The recommendation to keep slew mode enabled for 48 hours after the leap second injection is for safety. If slew mode is disabled before the local system time synchronizes with the NTP server, the system clock will be adjusted by stepping the time forward or backward.

Thankfully, ESX/ESXi versions 3.5 and newer are not impacted because they use the RFC-1589 clock model which includes a leap-warning condition which tells dependent processes that a second is going to be added or subtracted. Even though there have never been any subtractions since the leap second was introduced in 1972, the procedure was added due to the somewhat unpredictable nature of Earth’s rotation.

It’s tough to say what might happen if the leap second insertion is left unaddressed. The bugs documented by SUSE are possible but not 100% likely to occur. Your appliance could experience the equivalent of an administrator changing the time, or it could hang in a timing deadlock and require a reboot. We can’t say for certain what will happen and therefore urge everyone to err on the side of caution.

Matt Bradford

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.