Status at end of GSoC

So GSoC has officially come to an end, it’s time to summarize what has been and has not been achieved during this time.

Two of the goals were achieved, while the third is quite far off right now.

  1. The callout “wheel” was replaced with a binary heap. This forms the basis of the suggested improvements to the callout subsystem. The binary heap was implemented as a set of macros which provide a min-heap as well as a max-heap, for various general-purpose priority queues. More information can be found in the comments in the kernel source file sys/sys/bheap.h and in the previous blog posts.

  2. The new API was implemented from the “client-side”, i.e. the functionality is usable by callout consumers now, instead of the existing API. It was greatly simplified, though there could be some more modifications to make it more useful. The API stands as follows :

  • callout_handle_t — The handle (actually a typedef’s struct) used to identify a single callout (can be reused). The client should malloc a callout_handle_t and zero it out before using it.

  • hwclocktick_t — A semi-opaque hardware-dependent clock tick count. Clients need to use auxiliary functions to convert to/from real time and hardware-dependent time. The hwclocktick_t is limited to nanosecond resolution (because the conversion functions currently use timespec)

  • absolute_time_to_hw_time(const struct timespec* a, hwclocktick_t* t) — This function converts a timespec to a hardware dependent tick count. The timespec structure can hold time in the format of seconds + nanosecond.

  • hw_time_to_absolute_time(const hwclocktick_t* t, struct timespec* a) — This function converts given hardware-dependent ticks into a timespec.

  • time_interval_to_hw_time(int interval, hwclocktick_t* h, enum time_scale ts) — This is probably the function which will be used the most. It converts a given number of real time units into a hwclocktick_t which can be later used to schedule a callout. The time_scale is used to describe what units interval is specified in, and can be one of: TIMESCALE_SECOND, TIMESCALE_MS, TIMESCALE_US (microsecond), TIMESCALE_NS or TIMESCALE_MINUTE.

The above 3 functions can be used to create a hwclocktick_t of the appropriate duration. If the same duration will be used frequently, it makes sense to do the conversion once and cache the resultant hwclocktick_t.

Behind the scenes, the callout subsystem will “ask” the current timer hardware to do the conversion for it. This will use the Kobj kernel object system with a common interface that each timer hardware will need to specify. Currently this is not in a working state yet, so the functions use traditional hz arithmetic to do the conversions. In any case, this is transparent to the callout client.

Once the required hardware-dependent tick count is generated, a new callout can be armed by calling:

callout_arm(callout_handle_t* c, hwclocktick_t when, void (*func)(void*), void* arg, int flags, mtx* mutex) — Pass the allocated callout handle, the required callback time as a hw tick, a pointer to the callback function, an argument to pass to it, some flags (same as old API) and any mutex as required. This combines the functionality of callout_init(), callout_init_mtx(), callout_reset() etc. all in one function. This might be a good thing or a bad thing.

There is no callout_init[_mtx]() anymore. The callout_arm() function will handle initialization. There is no reason to call callout_cancel() on an unarmed callout, so explicitly initializing it is not needed. But if your design does require it, make sure to zero out the callout_handle_t structure after memory for it is acquired. Calls to callout_cancel() with an unarmed callout handle will simply fail. Results of calling callout_cancel() without a zeroed-out, unarmed callout are undefined!

If you frequently need to re-arm callouts, for which only the callout time changes, callout_rearm() can be used as a shorthand. Obviously callout_rearm() should only be called with a callout handle that has been armed before (else results are undefind).

Calling callout_arm() on an already pending callout will also simply reschedule it, so if you need a different callback function or have to pass another argument in addition to rescheduling it, this method will suffice. Note that this is different from the current behaviour where the function and argument must also match for most functions.

There is also callout_[re]arm_within() — this function can be used to arm or re-arm a callout with an imprecise time-out. That is to say, one can specify a time period during which the callout can be serviced. This will allow for optimizations in the future. Currently we take the lazy approach and schedule such callouts to the farthest time specified.

callout_cancel() and callout_drain() remain basically the same. When the underlying implementation will be rewritten, atomic operation of callout_cancel() will be guaranteed - i.e. if the callback function has not already been executed, upon return of callout_cancel(), the callout will be guaranteed to have been stopped. There will not be a grey-zone during which if callout_cancel() is called, the callout may still run. This is not implemented yet!

That is about it for the new API. As mentioned before they are currently simply wrappers around existing implementation, and both implementations currently exist in the kernel (in my experimental branch).

I modified the iwi driver to use the new API as a quick test and demonstration. It works, but we’re not using it in the most efficient way yet. I’ll be checking the driver’s logic in more detail so we can replace most instances of callout_arm() with callout_rearm() instead and avoid internally re-initializing the callout structure.

On the Kobj side, I’ve started writing the interface and implemented a timer_hz class which will source off the traditional periodic timer via hardclock(). The idea is that various timer hardware can implement this timer interface to vend its services, and register with the callback system at boot-time (or even run-time). The user (or system) will then select whichever timer it needs. This part has also not been implemented yet.

So to summarize:

  1. The callout internally uses a binary heap which should be more efficient specially when re-arming existing callouts.

  2. The new API is basically usable, although still works as a wrapper layer over the existing implementation.

  3. Multiple hardware timer support is still in its very early implementation stages, Kobj will be used for this.

  4. Tickless capability is dependent on point 3 being finished.

I plan to continue working on the project so hopefully in the next month or two, #3 and subsequently #4 will be done and we can have a high-resolution, tickless capable callout subsystem in FreeBSD!

Comments (View)

First post!

I decided to maintain a development blog for my Google Summer of Code 2009 project with FreeBSD. The main intent is to document my thought process, successes and failures through this project: mostly as a scratch pad. I hope this will also be useful for those curious to track the progress.

Background information for the project can be found on the FreeBSD wiki. In short, the project’s goal is to redesign the way the FreeBSD kernel handles callouts/timeouts. The changes include a new API for clients, underlying data structure changes for better efficiency, and the most important of all (in my opinion): the possibility of making the kernel tickless by doing away with periodic timer interrupts, and using programmable deadline timers exclusively.

Notes about the current status and implementation details:

  1. A generic binary heap has been implemented, and priority queue based on it now works. Both MIN and MAX heaps are working, although we will only consume the MINHEAP variant. The design is similar to the SLIST, TAILQ and friends found in sys/queue.h. The advantage of using this implementation of a priority queue is that the binary heap, on which this PQ is based, provides O(1) time complexity on an average for most operations including inserting and removing nodes or changing their priority. Although the upper bound time complexity is still O(log n), in general this will only occur occasionally when a new callout is scheduled which must bubble up to the root (ie., have the smallest timeout).

  2. callout_clock() was modified to peek at the first pending callout, and schedule the software clock handler softclock() only if the expiry time has arrived (or is past, if ticks were missed). Peeking at the first item is O(1) worst-case, so this operation is extremely efficient.

  3. softclock(), the software interrupt handler, will extract items in priority-order and if their expiry time has come or is in the past, will run the associated function and remove the callout from the queue. It then checks the next callout in line and repeats this process until the expiry time of the next callout is found to be in the future. At this point, softclock()’s job is done (for now).

  4. Auxiliary functions to reset/stop callouts have been implemented, and they all use the new priority queue.

The current issues to note are:

  1. The kernel doesn’t boot to desktop! Although the build is successful and the kernel can run until the time the first callout is scheduled, the kernel panics with the message mi_switch: switch in a critical section. I am currently trying to debug this, although it’s made difficult by the fact that I can’t obtain a dump of the crash because the harddrives are not even ready when this occurs. I’m looking to use ktr(9), the kernel trace facility to see if we can figure out where the problem is.

  2. The kernel still has to service periodic interrupts. The tickless-work has not yet started — and won’t be until the current implementation works with binary heaps as its queue instead of the call “wheel.” Because of this, we’re still using hardclock ticks as the timing unit, and not the opaque hwclocktick_t that I have proposed in the plan. Once the binary heap priority queue is working, I’ll start to implement the common timer interface (most of this already exists, just have to add functions to get frequency and convert between walltime and hardware time). Only after this can we begin implementing the new functions which work with opaque time values.

  3. The extremely old timeout() and untimeout() calls are still supported, but they are a pain to manage. These functions do not mandate clients to allocate their own callout structures, so the kernel must do it for them. Right now, the existing method of allocating a fixed number of callouts locally, is used. We are manually managing the callouts that have been used and freed, using a linked list. In the future I want to use uma(9) zone allocator which is meant for this kind of work. However, I couldn’t find a way to create a zone if I already have some memory preallocated (although this is mentioned in the man pages!). So this part has been postponed for later.

That’s about it for now. Later I’ll post more info about the heavyweight debug session I’m about to begin. Let’s hope I’m able to track it down!

Comments (View)