Re-arming existing callouts slightly optimized

After general testing, I narrowed down most of the missed-callout problems to my modification of callout_reset_on() in which I remove and reinsert (the old way) callouts that are pending and still on the callout queue.

EDIT: Well turns some interrupts are being missed, not callouts. :-\ Will have to find better ways to test.

Instead of removing and re-inserting the callouts, I’m now using the MINHEAP_KEY_CHANGE() macro to simply modify and rearrange the callout queue as appropriate. In the limited time I’ve been testing it, everything seems to be working much better. I still have doubts that some callouts are being delivered later than they should, though ktrdump(8) doesn’t show anything obvious. No callouts are being missed, though, and the wifi LED as well as the CPU frequency scaling are no longer getting stuck.

From the kernel trace dumps I also see that on a newly-booted system, the callout queue is pretty small, only about 65 callouts are pending at any given time. After starting X, a window manager, a browser and a few terminals, the queue becomes around 85 in size. With heavy network transfers going on, it can go up to 120 pending callouts. I’ve not been able to create a situation where the number of pending timeouts is higher. It seems like such a waste to pre-allocate c. 18000 (!! as my kern.ncallout key says at the moment) when only a couple hundred are going to be used at any given time.

I also re-tested the binary heap implementation, with over 1200 callouts in the queue and it passes all tests. So there can’t be any more bugs in it.

So barring any significant discoveries or malfunction, I consider this phase of the project completed. Next week I’ll be adding new callout api functions as well as (still) modifying kern_timeout.c to have my own locking scheme (having a queue implementation on which performing any operation leaves it in a consistent state means we can get away with much more finer grained locking). During this phase I’m hoping to implement some of the suggestions that rwatson has: more info on this mailing list post.

To wrap up, here is another example of why I dislike perforce:

/usr/home/pvaibhav/p4/calloutapi/src/sys$ p4 open kern/kern_timeout.c

Path ‘/usr/home/pvaibhav/p4/calloutapi/src/sys/kern/kern_timeout.c’ is not under client view ‘/home/pvaibhav/p4’

/usr/home/pvaibhav/p4/calloutapi/src/sys$ cd ~/p4/calloutapi/src/sys/
~/p4/calloutapi/src/sys$ p4 open kern/kern_timeout.c

//depot/projects/soc2009/calloutapi/src/sys/kern/kern_timeout.c#4 - opened for edit

Lame.

Comments (View)

Issues with hardware: LAPIC et al

I did some more “naive” testing: using the system for general and not-so-general tasks, trying to identify any ill effects. The first problem I encountered (which I had been simply enduring for 2+ months) was sluggish system response. Turns out it was because my powerd configuration was set to adaptive instead of hiadaptive. I had always thought the sluggish response was due to bad Intel xorg drivers. Anyway, with that out of the way, and a relatively fast system (graphics as bad as always, however), I resumed testing other parts.

First off I noticed that the periodic blinks of the wireless card LED were sometimes taking longer than usual, and sometimes they were totally stuck in the “on” state. I had never paid attention to this earlier, but with the new kernel and new callout system, I was trying to find every single clue that might point at a problem.

I found something very frustrating: The local APIC, the hardware that delivers the preferred timer interrupts, apparently loses ticks when the processor is in a deep sleep state (C-state >= C2). As if a variable TimeStamp Counter (TSC) losing ticks in slower P-states was not enough trouble on a Pentium M, even the LAPIC had to lose ticks too — in C-states.

After setting the lowest C-state to only C1 (hw.acpi.cpu.cx_lowest=C1), and the corresponding increase in CPU temperature in this scorching 43 degree C Indian summer, the wireless LED started behaving.

I also made a few minor changes in softclock() -

  1. Reverted to using a cached value of cc_monoticks (our per-CPU monotonically increasing tick counter — I could just use the global variable ticks but I decided to keep the timers per-CPU. This is kinda irrelevant now as this whole infrastructure will be replaced soon). Earlier I had switched to using cc_monoticks directly, despite the fact that it could be increased by hardclock when it calls callout_tick(). The reasoning was that softclick() expires present and past callouts, so even if cc_monoticks had been increased while we were still processing the callouts, any newly-“missed” callouts would still get expired anyway, and this will save us having to reschedule softclock() again on the next HW tick. However, I changed to using cached value and allowing reschedules, because: not doing so would mean softclock() could potentially run for a long duration, in case it gets behind hardclock, and will continue “chasing” newly-missed callouts. Since during this process, the cc_lock is held, it makes things really nasty. So: idea dropped.

  2. I put back the idea of checking how man callouts softclock() has to examine during one tick, and temporarily unlocking and relocking cc_lock to “give interrupts a chance.” It’s made easier for us as in general, the callout queue cannot be changed in such a way that unlocking/relocking will make resuming normal operation impossible. In other words, even if during the time we drop the cc_lock, some other thread inserts or removes a callout, the callout queue will still always be in a consistent state (because each insertion/removal will preserve the heap property). There are no links and pointers to manage — internally the heap uses an array and each callout knows its own “selfindex.” So no harm done. However, the important part is that we must extract head only after we have reacquired the callout queue lock, which brings me to the last change :

  3. I made a mistake when adding the previous feature: I made softclock() extract the head first, and then drop the callout lock temporarily. This was wrong order, since during the time the lock was dropped, someone could have inserted a callout which ended up as the head, so the head we already extracted was no longer the “real” head. So I changed it to do this dropping/relocking business first, and only extract the head after we have reacquired the lock. In other words, only operate on a callout while holding the callout queue lock.

I’ll continue doing more casual tests, although finally I have to devise a proper plan to make sure things are working before I can move on and start adding new timer hardware interfaces. I’ve just read up on DTrace and am thinking of ways to use it for testing the callout system (which already implements a couple of DTrace probes). But I’m also planning to continue using ktr(9) — as the old saying goes

Real programmers don’t use Dtrace, they use printf(..)

Hehe.

Comments (View)

Done: Prototype implementation of binary-heap based callout system

After a couple hours of debugging and analyzing ktr(9) traces in ddb, I managed to figure out the problem which was causing kernel panic just 3 hardclock ticks into bootup. I had conveniently forgotten to unlock a mutex in callout_tick() — not exactly forgotten, but overlooked the case when there were no callouts to expire during the current hardclock tick. The mutex should have been unlocked regardless of whether we are scheduling softclock() to run or not.

As soon as I fixed this, the kernel booted up well and everything seems to be working so far:

[pvaibhav@matrix:src/sys]$ uptime
 4:58AM  up  2:07, 2 users, load averages: 0.13, 0.32, 0.39

Although my laptop’s wireless LED seems to get stuck every now and then, when it should be blinking depending on traffic. I’m not sure if this is because the signal is weak right now, or whether the callouts needed by the LED taskqueue in the iwi(4) driver are not being serviced properly.

Testing has also only been done on my uniprocessor setup, so it remains to be seen how it performs in SMP environment. I’ve left the locking semantics largely untouched from the original implementation so there shouldn’t be any problems. I also need to devise a plan to test the whole thing more thoroughly than simply.. using the kernel. Perhaps it’ll be interesting to see how (if at all) the performance is affected by using the binary heap and not having to loop through all the callouts stored in each “bucket” of the old callout wheel, during a hardware interrupt.

Comments (View)