Kernel development [LWN.net]

Kernel release status

The current 2.6 development kernel is 2.6.28-rc6, released by Linus on November 20, just before he fled town for a scuba diving trip. (At least one assumes he fled town; it is not the best season for ocean sports in Portland.) It includes a number of fixes, including one for the high-profile vmalloc() regression. The long-format changelog has the details.

The current stable 2.6 kernel is 2.6.27.7, also released on November 20. It includes a fair number of fixes, including one with a CVE number attached.

Comments (none posted)

Quotes of the week

+/*
+ * "Define 'is'", Bill Clinton
+ * "Define 'if'", Steven Rostedt
+ */
+#define if(cond) if (__builtin_constant_p((cond)) ? !!(cond) :		\
+	({								\
+		int ______r;						\
+		static struct ftrace_branch_data			\
+			__attribute__((__aligned__(4)))			\
+			__attribute__((section("_ftrace_branch")))	\
+			______f = {					\
+				.func = __func__,			\
+				.file = __FILE__,			\
+				.line = __LINE__,			\
+			};						\
+		______r = !!(cond);					\
+		if (______r)						\
+			______f.hit++;					\
+		else							\
+			______f.miss++;					\
+		______r;						\
+	}))

-- Steven Rostedt debuts the new "if"

Working on lkml often sounds like everyone is screaming NO, channeling nothing but stop energy. Sometimes people are, but more often what they really mean is you just have to take your time and do things right. Admittedly it is a lot of iteration, but Linux is a noble pursuit.

-- Robert Love

But let's look at the problem which we're actually trying to solve. Developer A wishes to write some kernel monitoring/controlling code, so he is forced to stick it on his website, keep reminding people to download updates, act as an independent target of other people's patches, etc, etc. It's all a pain and horror, so developer A gives up and implements his userspace code in the kernel instead. It is, as a result, technically inferior and English-only, but at least it got there.

-- Andrew Morton

Comments (18 posted)

Ksplice and kreplace

By Jonathan Corbet
November 24, 2008

Rebooting a system to apply a security update is a pain. In some situations, it's more than a pain; for various reasons, many systems cannot be taken down at all without compromising the work they are supposed to be doing. Back in April, LWN looked at Ksplice, a mechanism designed to enable the installation of kernel updates without the need to reboot the system. Since then, work has continued on Ksplice, a new version has been posted, and the project is starting to push toward mainline inclusion. So another look is called for.

The core idea behind Ksplice remains the same: when given a source tree and a patch, it builds the kernel both with and without the patch and looks at the differences. To that end, the compilation procedure is modified to put every function and data structure into its own executable section. That makes life a little harder for the compiler and the linker, but developers are notably insensitive to the difficulties faced by those tools. With things split up this way, it is relatively easy to identify a minimal set of changes in the binary kernel image which result from the patch. Ksplice can then, with some care, patch the new code into the running kernel. Once this work is done, the old kernel is running the new code without ever having been rebooted.

This technique works well for code changes, but different challenges come with changes to data structures. Back in April, Ksplice could not handle that kind of change. Even so, the project's developers claimed to be able to apply the bulk of the kernel's security updates using ksplice. Since then, though, the developers have applied some energy to this problem. With the addition of a couple of new techniques - which require extra effort on the part of the person preparing the patch for Ksplice - it is now possible to apply 100% of the 65 non-DOS security patches released for the kernel since 2005.

In some cases, a kernel patch will simply require that a data structure be initialized differently. The way to handle this change in an update through Ksplice is to modify the relevant data structures on the fly. To effect such changes, a patch can be modified to include code like the following:

    #include <ksplice-patch.h>

    ksplice_apply(void (*func)());

While Ksplice is applying the changes - and while the rest of the system is still stopped - the given func will be called. It can then go rooting through the kernel's data structures, changing things as needed. For example, CVE-2008-0007 came about as a result of a failure by some drivers to set the VM_DONTEXPAND flag on certain vm_area_struct structures. Ksplice is able to apply the fix to the drivers without trouble, but that is not helpful for any incorrectly-initialized VMAs present on the running system. So the modifications to the patch add some functions which set VM_DONTEXPAND on existing VMAs, then use ksplice_apply() to cause those functions to be executed. The result is a fully-fixed system.

Changes to data structure definitions are harder. If a structure field is removed, the Ksplice version of the patch can just leave it in place. But the addition of a new field requires more complicated measures. Simply replacing the allocated structures on the fly seems impractical; finding and fixing all pointers to those structures would be difficult at best. So something else is needed.

For Ksplice, that something else is a "shadow" mechanism which allocates a separate structure to hold the new fields. Using shadow structures is a fair amount of additional work; the original patch must be changed in a number of places. Code which allocates the affected structure must be modified to allocate the shadow as well, and code which frees the structure must be changed in similar ways. Any reference to the new field(s) must, instead, look up the shadow structure and use that version of the field. All told, it looks like a tiresome procedure which has a significant chance of introducing new bugs. There is also the potential for performance issues caused by the linear linked list search performed to find the shadow structures. The good news is that it is only rarely necessary to modify a patch in this way.

The Ksplice developers do not appear to be done yet; from the latest patch posting:

We're currently working on the problem of making it feasible to apply the entire stable tree using Ksplice. Although Ksplice's original evaluation focused on patches for CVEs, we understand the idea that "security bugs are just 'normal bugs'" (i.e., tracking security bugs separately from normal bugs can be difficult and isn't necessarily advisable). We ultimately want to provide to long-running machines hot updates for all of the bug fixes that go into the corresponding stable tree.

This is an ambitious goal; a single stable series can add up to hundreds of changes, some of which can be reasonably large. It will be interesting to see how many users are really interested in this particular sort of update; sites running critical systems tend to have older "enterprise" kernels which are no longer receiving stable tree updates. But a Ksplice which is flexible enough to handle that kind of update stream should also be useful for distributors wanting to provide no-reboot patches to their customers.

Meanwhile, Nikanth Karthikesan has posted a facility called kreplace. On the surface, it looks similar to Ksplice, but the goal is a little different: its purpose is to allow a developer to quickly try out a change on a running kernel. Kreplace works by simply patching out and replacing one or more functions in the kernel. Kreplace may have its value, but the initial reaction has not been greatly enthusiastic. Among other things, it has been pointed out that Ksplice also has a facility to allow for quick experimentation with changes - though it will be quick only if the developer is already set up to use Ksplice with the running kernel.

A final concern with either of these solutions is that they are, for all practical purposes, employing rootkit techniques. A mechanism which can be used by distributors to patch running systems can also be (mis)used by others. Vendors of binary-only modules could, for example, use Ksplice or kreplace to get around GPL-only exports and other inconvenient features of contemporary kernels. Crackers could also use it, of course, but they already have their own rootkit tools and gain no real benefit from an officially-supported runtime patching mechanism. Whether this aspect of Ksplice is of concern to the development community may be seen in the coming months as this code gets closer to mainline inclusion.

Comments (4 posted)

Character devices in user space

By Jake Edge
November 25, 2008

There is a lot of functionality—things like filesystems and device drivers—that are normally considered to be kernel tasks, but have, over time, been allowed to move into user space. The UIO user space driver framework came along in 2.6.23, while filesystems in user space (FUSE) have been around since 2.6.14. Tejun Heo would like to see this idea broadened even further with the character devices in user space (CUSE) patches.

At first blush, the uses for a character device implemented in user space are not obvious. Looking a bit deeper, though, one finds numerous programs—both open and closed source—that rely on legacy character drivers. Those drivers are currently in the kernel, but need not be if there were a way to implement them in user space. In addition, older, deprecated interfaces, such as Open Sound System (OSS) can be better supported without constantly fiddling with the in-kernel emulation.

Providing better OSS support is one of the prime motivators for CUSE as Heo announced in a linux-kernel posting introducing the OSS proxy. The proxy uses CUSE to implement the /dev/dsp, /dev/adsp, and /dev/mixer devices that programs using OSS expect. Adrian Bunk didn't necessarily see this as a good thing:

Sorry for being destructive, but 6 years after ALSA went into the kernel we are slightly approaching the point where all applications support ALSA.

The application you list on your webpage is UML host sound support, and I'm wondering why you don't fix that instead of working on a better OSS emulation?

But Heo sees the current state of OSS emulation as a rather complicated mess that, for better or worse, needs cleaning up:

We now have in-kernel OSS emulation which can't mux with other streams, aoss [ALSA OSS emulation] with its own supported and broken list and can also be routed through PA [PulseAudio] by configuring ALSA right and then padsp [PA OSS emulation] with its own supported and broken list and nothing works good enough. So, if we have one thing which just works, we can in time put all those to rest.

But there are other uses for CUSE too. Greg Kroah-Hartman notes that legacy software for talking to Palm Pilots, much of which is binary-only, expects to talk to a /dev/pilot serial port. The kernel carries around a driver, but "a libusb userspace program can handle all of the data to the USB device instead". So CUSE could be used to eventually remove another crufty driver from the kernel, while still maintaining compatibility with old user space code.

CUSE is implemented on top of FUSE as there is a fair amount of overlap between them. Character devices and filesystems implement many of the same file operations—things like open(), close(), read(), and write()—which makes them a good match. Heo has a separate patchset for FUSE that implements additional operations for filesystems some of which will be used by CUSE.

The additional FUSE operations include an implementation of ioctl() that is necessarily rather ugly. Because an ioctl implementation can access memory in unpredictable ways—and those data structures can be arbitrarily deep—there needs to be a mechanism for user-space CUSE devices to read and write that memory. The CUSE server does not have direct access to the caller's memory, so a multi-step ioctl() with retries must be implemented. This particular bit of ugliness is only allowed for in-kernel use, so that CUSE (or other things like it) can allow "unrestricted" ioctl() implementations. All FUSE filesystems are still required to have "restricted" ioctls where the kernel can determine the direction and amount of data that is transferred. poll() support has also been added to FUSE, which, in turn, requires a separate patch that allows poll() callbacks to sleep (described in this article).

Once the FUSE changes are in place, the actual implementation of CUSE is relatively small, weighing in around 1000 lines plus some housekeeping to rename and export FUSE symbols. At its core, it collects up a FUSE-mounted filesystem that connects to the user-space implemented device along with the kernel-exported character device, binding the two together. FUSE handles the interaction with the user-space code, in the same way that it does for a filesystem.

CUSE creates a device for commands, /dev/cuse, which is opened by a program that wants to implement a particular character device. CUSE queries the opener to determine which device it is implementing and then creates the device node. For most operations, CUSE just hands off to FUSE, but for open() it, instead, opens a file from the FUSE mount, storing the file handle for use by later operations.

In many ways, CUSE is a kind of impedance matching layer that creates something that acts like a character device, but has no hardware directly behind it. This allows CUSE to ignore things like hardware interrupts; those would need to be handled by something else, typically a downstream driver—the soundcard driver in the OSS proxy case. This is one of the big differences between UIO and CUSE. UIO is much more like a regular kernel device driver that requires kernel code to handle interrupts. CUSE drivers, on the other hand, can be created without ever touching kernel space.

The only objection so far seems to be Bunk's complaint about supporting OSS when it has been deprecated for so long. As Heo points out, though, there are still many applications that only support OSS. In addition, all of the code that has been submitted is "way smaller than the in-kernel ALSA OSS emulation which is somewhat painful to use these days", Heo says. Since there are other potential users of CUSE, not just the OSS proxy, it would seem that, absent any major objections, CUSE could make it into 2.6.29.

Comments (5 posted)

Driver API: sleeping poll(), exclusive I/O memory, and DMA API debugging

By Jonathan Corbet
November 24, 2008

There are currently a number of proposed driver API changes being discussed on the lists. None of them are major, but they are worth being aware of.

poll()

Most of the functions in the file_operations structure are concerned with I/O. So it is not surprising that these functions are allowed to sleep. Except that, as it turns out, one of them - poll() - cannot. There is nothing inherent in the poll() or select() system calls which would require the driver poll() callback to be nonblocking; this requirement is, instead, a result of the implementation. In essence, the core poll() implementation looks like this:

    for (;;)
        set_current_state(TASK_INTERRUPTIBLE)
    	for each fd to poll
	    ask driver if I/O can happen
	    add current process to driver wait queue
        if one or more fds are ready
	    break
 	schedule_timeout_range(...)

The problem is relatively straightforward: if a specific driver chooses to sleep in its poll() callback, the current task state will get set back to TASK_RUNNING and schedule_timeout_range() will return immediately. So a sleeping driver turns the main loop into a busy-wait.

The solution, as developed by Tejun Heo, is also straightforward. His patch causes sys_poll() to define a custom wakeup function which, in turn, sets a new triggered flag when called. That eliminates the need to put the process into TASK_INTERRUPTIBLE for the duration of the main loop; that can be done, instead, right before actually sleeping.

Most driver writers can remain unaware of this change, which looks highly likely to be merged for 2.6.29. But, for those who need it, there will be one more degree of flexibility in the implementation of poll() callbacks.

Exclusive I/O memory

For a while, developers involved in the hunt for the e1000e corruption bug thought that the X server might be the problem. The real bug turned out to be elsewhere, but the suspicion cast upon X led to the development of a new API designed to make it harder for user-space programs to interfere with the operation of an in-kernel driver.

In particular, it seemed sensible to prevent user space from manipulating I/O memory which has been allocated by device drivers. This can be achieved by not allowing an mmap() call on /dev/mem to map regions already given to drivers. If the STRICT_DEVMEM configuration option is set, the kernel will protect its own memory from mapping by user space; protecting I/O memory is really just a matter of extending that mechanism.

Arjan van de Ven has implemented that feature in his MMIO exclusivity patch. He chose, however, not to make this protection the default. Instead, drivers which want exclusive access to an I/O memory region should call one of these new functions:

    int pci_request_region_exclusive(struct pci_dev *pdev, int bar, 
                                     const char *res_name);
    int pci_request_regions_exclusive(struct pci_dev *pdev, 
                                      const char *res_name);
    int pci_request_selected_regions_exclusive(struct pci_dev *pdev,
				               int bars, 
					       const char *res_name);

There is also a new, low-level allocation macro:

    request_mem_region_exclusive(start, n, name);

In each case, these functions are equivalent to their non-exclusive cousins, except for the changed name and the resulting exclusive allocation.

There may be cases where a developer wants to be able to map a region from user space on a development system, regardless of what the driver thinks. For such situations, there is a new iomem=relaxed boot parameter. When relaxed is selected, exclusive allocations are not enforced. Clearly this is not an option which one would want to set on a production system, but it may be useful in development environments.

DMA API debugging

The last topic is not actually an API change, but it's worth a look anyway. The kernel provides a nice API for setting up DMA operations. In many cases, the associated functions do little or no work; the system they are running on does not require any additional effort. The result is that a lot of "tested" driver code may, in fact, have serious errors in its use of the DMA API. When those drivers are run on a different system - one with an I/O memory management unit (IOMMU) in particular - those errors could lead to no end of unpleasant behavior.

Kernel developers like the idea of finding bugs before they bite users on remote systems. To help make that happen with the DMA API, Joerg Roedel has posted a new DMA API debugging facility. This feature, when built into the kernel, should make it possible to find a number of previously-hidden bugs in device drivers. It has, in fact, already turned up a few problems with in-tree drivers, mostly in the networking subsystem.

Use of this facility simply requires enabling a configuration option; the API itself does not change. Once it's enabled, this code will check for a number of problems, including freeing DMA buffers with a different size than was given at allocation time, freeing buffers which were never allocated at all, mixing coherent and non-coherent functions on the same buffer, confusion over I/O directions, and more. Each of these problems might slip by on a developer's test system, but might create havoc where an IOMMU is being used. When a problem is found, a warning and stack traceback are logged.

The response to this API has been positive. The biggest complaint seems to be about the fact that this API is implemented as an x86-specific feature. So it will probably have to be made generic before merging - after all, developers on other platforms are entirely capable of introducing DMA-related bugs too. Once it goes in, this feature should probably be enabled on any system used for driver development.

Comments (none posted)

Linus Torvalds Linux 2.6.28-rc6 ?

Greg KH Linux 2.6.27.7 ?

Dimitri Sivanich SGI RTC: add clocksource/clockevent driver and generic timer vector ?

Antonio R. Costa [RFC PATCH] Support for AT572D940HF-EK [RFC PATCH] ?

Keika Kobayashi softirq: Introduce statistics for softirq ?

Nikanth Karthikesan kreplace: Rebootless kernel updates ?

Jeff Arnold Ksplice: Rebootless kernel updates ?

Joe Korty create /proc/timer-wheel-list ?

Tejun Heo poll: allow f_op->poll to sleep ?

Mathieu Desnoyers [RFC PATCH] Poll : add poll_wait_set_exclusive (fixing thundering herd problem in LTTng) ?

Andrey Mirkin In-kernel process restart ?

Rusty Russell cpumask conversion patches for sched ?

Peter Zijlstra hrtimer: removing all ur callback modes ?

Catalin Marinas Kernel memory leak detector (updated) ?

Joerg Roedel DMA-API debugging facility ?

Vegard Nossum sysrq-j: emergency shell Nov 22

=?utf-8?q?T=C3=B6r=C3=B6k=20Edwin?= tracing: userspace stacktraces ?

Arjan van de Ven ftrace: Add a C/P state tracer to help power optimization Oct 03

Arjan van de Ven [PATCH] scripts: script from kerneloops.org to pretty print oops dumps Nov 05

Tejun Heo CUSE: implement CUSE, take #2 ?

Yu Zhao PCI: Linux kernel SR-IOV support ?

Shane McDonald Resurrect IT8172 IDE controller driver ?

Rodolfo Giometti LinuxPPS (Version 8): the PPS Linux implementation. ?

Arjan van de Ven resource: allow MMIO exclusivity for device drivers ?

David Daney ide: New libata driver for OCTEON SOC Compact Flash interface. ?

Hardik Shah V4L2 driver on Tomis DSS patches ?

Antonio Ospite gspca: ov534 camera driver ?

hvaibhav@ti.com TVP514x V4L int device driver support ?

Robert Jarzmik soc_camera: add format translation structure ?

David Daney libata: Cavium OCTEON SOC Compact Flash driver (v2) ?

Jaya Kumar [RFC 2.6.27] am300/broadsheet: add E-Ink Broadsheet controller and AM300 kit ?

Michael Kerrisk CLONE_NEWNET documentation ?

Gui Jianfeng introduce bio-cgroup into io-throttle ?

Tejun Heo FUSE: extend FUSE to support more operations, take #2 ?

Tejun Heo FUSE: implement direct mmap ?

David Howells Permit filesystem local caching [ver #41] ?

NeilBrown RFC: allow md devices to disappear when not in use. ?

Evgeniy Polyakov Distributed storage. ?

Evgeniy Polyakov Inotify: nested attributes support. ?

Theodore Tso ext4: add fsync batch tuning knobs ?

Eric Paris [PATCH -v3 0/8] file notification: fsnotify a unified file notification backend ?

Ying Han page_fault retry with NOPAGE_RETRY ?

Lee Schermerhorn - support inheritance of mlocks across fork/exec ?

Patrick McHardy pkt_sched: add DRR scheduler ?

Eric Dumazet net: Convert TCP/DCCP listening hash tables to use RCU ?

Stephen Hemminger netdev: generate kobject uevent on network events. ?

Inaky Perez-Gonzalez merge request for WiMAX kernel stack and i2400m driver ?

Tetsuo Handa TOMOYO Linux ?

Mimi Zohar integrity ?

Rafael J. Wysocki 2.6.28-rc6-git1: Reported regressions from 2.6.27 ?

Rafael J. Wysocki 2.6.28-rc6-git1: Reported regressions 2.6.26 -> 2.6.27 ?

Hans de Goede libv4l release: 0.5.6 (The UVC release) ?

Luis R. Rodriguez CRDA and wireless-regdb release ?

Kernel development

Brief items

Kernel release status

Kernel development news

Quotes of the week

Ksplice and kreplace

Character devices in user space

Driver API: sleeping poll(), exclusive I/O memory, and DMA API debugging

poll()

Exclusive I/O memory

DMA API debugging

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Memory management

Networking

Security-related

Benchmarks and bugs

Miscellaneous