Debugging NTP again (part 4 and last)
Wednesday, August 18, 2010 9:19:38 AM
The problem appears to be caused by a kernel bug. If it was fixed in more recent kernels and/or patches, I just don't know. Anyway, the kernel bug is worked around if you set disable kernel in your ntp.conf. Besides, you should set xen.independent_wallclock to zero, and set your clocksource to xen.
Sure, it would not have been possible with the help of other people, so let's tell the full story and provide acknowledgements for those who deserve it.
I sent an email to the SAGE list on Monday. I got no answer.
Yesterday I noticed from the website support.ntp.org that there is an #ntp support channel on freenode. I went there and asked the question again. After a few hours, I was proposed a workaround, that is: setting independent wallclocks to 1 on dom0 and all domU's. This kind of worked (see the aggregate graph from around second 40000 to around second 55000), but it was not what I wanted: domU's should sync with dom0, and dom0 should sync with NTP; that's just how it should be.
Luckily, another guy popped up, and gave more useful suggestions. An excerpt of our conversation follows:
(16:55:22) mlichvar: bronto: i think i have seen the same problem some time ago (16:55:48) bronto@freenode: mlichvar: good... erm... sort of ;) How did you manage to solve it? (16:56:26) mlichvar: bronto: i was just helping one guy and he didn't solve it :) (16:56:45) mlichvar: bronto: it looked like broken PLL in the kernel (16:57:47) bronto@freenode: mlichvar: erm... what's a PLL? :( (16:59:44) mlichvar: the thing that adjusts offset and frequency (17:00:14) mlichvar: in the offset plot it looked like the time constant was too short (17:07:12) mlichvar: there is one easy thing you could try first (17:07:52) mlichvar: disabling kernel discipline by adding "disable kernel" to ntp.conf (17:08:04) mlichvar: if that works, it's definitely a kernel bug (17:14:26) ***bronto@freenode adding disable kernel to ntp.conf on the Xen servers
This happens somewhere around second 55000. As you can see, the two clocks step again and then start converging towards zero, even if with a different pattern than the "stable" Xen's. And if you look at today's graph, the offset is small enough that you can appreciate the very little "hiccups" on all the plots.
And, as mlichvar says, we have a kernel bug here. The kernel bug is somehow ruled out when you set all independent_wallclock to 1 --it doesn't solve the problem, it's actually something that works around a bug. But the correct solution, that is the one that configures ntpd as it should be, is the one listed at the top.