The hectic week of the leap second
Thursday, July 5, 2012 2:43:10 PM
- we've been hit by the leap second announcement bug
- others around the internet have been, as well
- ...not to mention those hit by the leap second itself
- Wired put my name on an article twice, which started a "citation spree" around the globe
- I've finally realized my proposition to start on Twitter (@brontolinux)
- I've been appointed as one of the CFEngine champions 2012
I'll try to sum up, and conclude with a take-away lesson for next leap second.
For sure, a huge misunderstanding of the press worldwide was about how many bugs were around. There have been at least two distinct problems connected to the leap second: the first one started to happen right after June 30th, 00:00:00 UTC at the leap second announcement, while the other(s) were caused by the insertion of the leap second itself when implemented as a step back of the clock by one second, and are more like the "classic" case. Let's dissect them.
The announcement bug
What is a leap second announcement? When a leap second is scheduled, and the NTP service is in use to synchronise the system clocks, a leap second announcement is sent to the clients at the first opportunity in the 24 hours preceding the actual insertion. Upon receiving the announcement, ntpd (the reference implementation of the NTP protocol) asks the kernel to arm the leap second. As it turns out, the request is repeated at every repeated announcement.
There was a bug in the Linux kernel which, it seems, was introduced in version 2.6.26 and patched in the 3.3 or 3.4 series (I couldn't find a definite version for that). When the kernel is affected and the machine is under heavy load, a leap second announcement can make the kernel hang badly.
For sure, this bug was completely unexpected to everyone: up to my knowledge, it's the first time that we are hit by a problem at the announcement (but if you know of similar bugs in the past, please let me know!!!).
The insertion bug
Other bugs are of the more classic type, and happened on those systems where the leap second is implemented as a step back one second at midnight UTC, and the applications don't cope with the repeated second at 23:59:59-00:00:00. MySQL, Firefox, Thunderbird, Java... were affected, again on Linux, due to another bug that made CPUs spike at 100%. No crash then, but some systems have become unresponsive -- not to mention power consumption in big datacenters due to the CPUs pumping up electricity like crazy.
A number of other "classic" crashes may have happened, but I've seen no reports about those.
The leap second has usually got little attention on the media, but this time it was different. The reason is pretty easy: a number of well known websites (LinkedIn and Reddit to mention just a few) have been badly impacted by the bug, and that could not go unnoticed. But how did it happen that the media found... me? If this is not the first time you hit my blog, you know that I posted a lot about the leap second recently, but no: that's not enough to make me famous for a week. Let's go back to the morning of June 29th.
Let the game begin
For us at $WORK, everything started on Friday, June 29th, with a crash of our mail system storage. Totally unrelated to the leap second -- it was an hardware failure -- but counted for an interesting "distraction".
The storage was repaired early the 30th, and we started to bring up the systems using it. And there, the mail system didn't want to stay up, showing interesting and sudden crashes, so sudden that kdump could not log the crash.
While a colleague of mine was trying to hunt down the root cause of the crashes, there we see a problem in another datacenter. The cause was identified in another bug, but we couldn't understand what triggered it. We didn't connect it with the leap second announcement, and again that will count as a "distraction" later, when leap-second related problem started to arose.
Around noon (an hour more or less) we start seeing systems crashes all across the board: Europe or USA, east or west, north or south, crashes started to happen in a seemingly random way. Someone started to suspect this could be related to the leap second, but I was sceptical. After all, the leap second had not yet happened and we only had the ntp servers and clients get the announcement. How much could it do harm, right?
Yet, that was so strange that similar servers with similar OSs, even with different kernel versions, were consistently crashing all over the world. My colleague Bron decided to look far away and around, and posted his now famous question on ServerFault, asking if someone else was experiencing bad crashes. In a few hours, he got a lot of good feedback (and later, the post was added on Wikipedia in the bibliography of the item "Leap second"). Eventually, he mentioned my posts regarding the leap second, and it was probably there that Wired found me.
Anyway, when I got back in touch with my colleagues, it had become quite clear what the cause of the problem was. Some crazy people around the internet were suggesting to disable ntpd and reboot all machines to disarm the leap second: the more the servers to reboot, the crazier the idea. But there my earlier experiments came handy: I pointed Bron to a script of mine, which he modified and used to disarm the leap second without a reboot. I don't know how many other people used it, but the more people found it useful, the happier I am.
On July 1st, I noticed my name on Wired, and other media that took it as a source. That was strange, and exciting. However, there was an inaccuracy in that the article said that I had foretold the specific issues happened in the preceding days, so I wrote an email to Cade Metz to rectify. He promptly rectified the article, and asked if he could interview me on the phone for an article they were going to publish the following day. We did the interview, and there I was in the second article; in the same article with Linus Torvalds... I don't believe we are at the same level, honestly (and you can tell yourself who's the best one of the two but that was incredible again.
Of course, this short spike of fame, and the links in Wired's article, brought more people than usual to my blog, and a number of comments. It was because of one of those comments that I finally decided to get a twitter account, something that I had planned since a long time, but I never did. The feedback was great there, too, with 20 people following me after a very short time.
From July 1st to today, July 5th, I counted some 30 articles around the globe mentioning my name and blog posts in relation to the leap second. I'll try to make a review of the papers before I go back to the "very unknown person" status (And, by the way, it's so bad that many of those articles are just a copy&paste of another one...)
CFEngine champion appointment
Yesterday, July 4th, it was made official that I was appointed as one of the four CFEngine champions for 2012. I already knew that, but I didn't tell and waited for the news to be published. That was very strange, too. I am in the same list of people like Diego Zamboni, Aleksey Tsalolikhin or Neil Watson (to mention just a few of the previous champions) whose contributions to the community make mine just disappear in compare. But that was great, and I thank a lot CFEngine AS for that.
Take-away lessons for the next leap second
As it turns out, a lot of the hassle could be avoided if the kernel discipline was disabled in ntpd before the announcement . That will be added to my procedure next time: >24 hours before the announcement, restart ntpd with kernel discipline disabled! This, of course, unless someone finally decides to implement "the Mills' technique" in the system. We'll see, we have at least 6 months before the leap second happens again.