Inevitably, though, it can't define the full behaviour of an ACPI system. It doesn't explicitly state what should happen if you violate the spec, for instance. Obviously, in a just and fair world, no systems would violate the spec. But in the grim meathook future that we actually inhabit, systems do. We lack the technology to go back in time and retroactively prevent this, and so we're forced to deal with making these systems work.
This ends up being a pain in the neck in the x86 world, but it could be much worse. Way back in 2008 I wrote something about why the Linux kernel reports itself to firmware as "Windows" but refuses to identify itself as Linux. The short version is that "Linux" doesn't actually identify the behaviour of the kernel in a meaningful way. "Linux" doesn't tell you whether the kernel can deal with buffers being passed when the spec says it should be a package. "Linux" doesn't tell you whether the OS knows how to deal with an HPET. "Linux" doesn't tell you whether the OS can reinitialise graphics hardware.
Back then I was writing from the perspective of the firmware changing its behaviour in response to the OS, but it turns out that it's also relevant from the perspective of the OS changing its behaviour in response to the firmware. Windows 8 handles backlights differently to older versions. Firmware that's intended to support Windows 8 may expect this behaviour. If the OS tells the firmware that it's compatible with Windows 8, the OS has to behave compatibly with Windows 8.
In essence, if the firmware asks for Windows 8 support and the OS says yes, the OS is forming a contract with the firmware that it will behave in a specific way. If Windows 8 allows certain spec violations, the OS must permit those violations. If Windows 8 makes certain ACPI calls in a certain order, the OS must make those calls in the same order. Any firmware bug that is triggered by the OS not behaving identically to Windows 8 must be dealt with by modifying the OS to behave like Windows 8.
This sounds horrifying, but it's actually important. The existence of well-defined OS behaviours means that the industry has something to target. Vendors test their hardware against Windows, and because Windows has consistent behaviour within a version the vendors know that their machines won't suddenly stop working after an update. Linux benefits from this because we know that we can make hardware work as long as we're compatible with the Windows behaviour.
That's fine for x86. But remember when I said it could be worse? What if there were a platform that Microsoft weren't targeting? A platform where Linux was the dominant OS? A platform where vendors all test their hardware against Linux and expect it to have a consistent ACPI implementation?
Our even grimmer meathook future welcomes ARM to the ACPI world.
Software development is hard, and firmware development is software development with worse compilers. Firmware is inevitably going to rely on undefined behaviour. It's going to make assumptions about ordering. It's going to mishandle some cases. And it's the operating system's job to handle that. On x86 we know that systems are tested against Windows, and so we simply implement that behaviour. On ARM, we don't have that convenient reference. We are the reference. And that means that systems will end up accidentally depending on Linux-specific behaviour. Which means that if we ever change that behaviour, those systems will break.
So far we've resisted calls for Linux to provide a contract to the firmware in the way that Windows does, simply because there's been no need to - we can just implement the same contract as Windows. How are we going to manage this on ARM? The worst case scenario is that a system is tested against, say, Linux 3.19 and works fine. We make a change in 3.21 that breaks this system, but nobody notices at the time. Another system is tested against 3.21 and works fine. A few months later somebody finally notices that 3.21 broke their system and the change gets reverted, but oh no! Reverting it breaks the other system. What do we do now? The systems aren't telling us which behaviour they expect, so we're left with the prospect of adding machine-specific quirks. This isn't scalable.
Supporting ACPI on ARM means developing a sense of discipline around ACPI development that we simply haven't had so far. If we want to avoid breaking systems we have two options:
1) Commit to never modifying the ACPI behaviour of Linux.
2) Exposing an interface that indicates which well-defined ACPI behaviour a specific kernel implements, and bumping that whenever an incompatible change is made. Backward compatibility paths will be required if firmware only supports an older interface.
(1) is unlikely to be practical, but (2) isn't a great deal easier. Somebody is going to need to take responsibility for tracking ACPI behaviour and incrementing the exported interface whenever it changes, and we need to know who that's going to be before any of these systems start shipping. The alternative is a sea of ARM devices that only run specific kernel versions, which is exactly the scenario that ACPI was supposed to be fixing.
 Defined by implementation, not defined by specification
 Windows may change behaviour between versions, but always adds a new _OSI string when it does so. It can then modify its behaviour depending on whether the firmware knows about later versions of Windows.
I primarily work on Linux, so I put this in my Emacs config:
; Linux mode for C (setq c-default-style '((c-mode . "linux") (other . "gnu")))
However, other projects like QEMU have their own style preferences. So here’s what I added to use a different style for that. First, I found the qemu C style defined here. Then, to only use this on some C code, we attach a hook that only overrides the default C style if the filename contains “qemu”, an imperfect but decent-enough test.
(defconst qemu-c-style '((indent-tabs-mode . nil) (c-basic-offset . 4) (tab-width . 8) (c-comment-only-line-offset . 0) (c-hanging-braces-alist . ((substatement-open before after))) (c-offsets-alist . ((statement-block-intro . +) (substatement-open . 0) (label . 0) (statement-cont . +) (innamespace . 0) (inline-open . 0) )) (c-hanging-braces-alist . ((brace-list-open) (brace-list-intro) (brace-list-entry) (brace-list-close) (brace-entry-open) (block-close . c-snug-do-while) ;; structs have hanging braces on open (class-open . (after)) ;; ditto if statements (substatement-open . (after)) ;; and no auto newline at the end (class-close) )) ) "QEMU C Programming Style") (c-add-style "qemu" qemu-c-style) (defun maybe-qemu-style () (when (and buffer-file-name (string-match "qemu" buffer-file-name)) (c-set-style "qemu"))) (add-hook 'c-mode-hook 'maybe-qemu-style)
Since my blogging tsunami almost a month ago, I’ve been pretty quiet. The reason being that I’ve been heads down working on some new features for trinity which have turned out to be a lot more involved than I initially anticipated.
Trinity does all of its work in child processes continually forked off from a main process. For a long time I’ve had “investigate using pthreads” as a TODO item, but after various conversations at kernel summit, I decided to bump the priority of that up a little, and spend some time looking at it. I initially guessed that it would have take maybe a few weeks to have something usable, but after spending some time working on it, every time I make progress on one issue, it becomes apparent that there’s something else that is also going to need changing.
I’m taking a week off next week to clear my head and hopefully return to this work with fresh eyes, and make more progress, because so far it’s been mostly frustrating, and there may be an easier way to solve some of the problems I’ve been hitting. Sidenote: In the 15+ years I’ve been working on Linux, this is the first time I recall actually ever using pthreads in my own code. I can’t say I’ve been missing out.
Unrelated to that work, a month or so ago I came up with a band-aid fix for a problem where trinity would corrupt its own structures. That ‘fix’ turned out to break the post-mortem work I implemented a few months prior, so I’ve spent some time this week undoing that, and thinking about how I’m going to fix that properly. But before coming up with a fix, I needed to reproduce the problem reliably, and naturally now that I’ve added debug code to determine where the corruption is coming from, the bug has gone into hiding.
I need this vacation.
This is a small release; the more notable changes in man-pages-3.72 are the addition of three new pages by Peter Schiffer that document glibc commands used for memory profile and malloc tracing:
Just an FYI, I lost my GPG key a few months back during an upgrade, and have created a new one. This was signed by folk at LinuxCon/KS last month.
The new key ID / fingerprint is: D950053C / 8327 23D0 EF9D D46D 9AC9 C03C AD98 4BBF D950 053C
Please use this key and not the old one!
Have you ever played the "press any key to stop autoboot" game, followed by copying boot commands from your notes, because you wanted to keep boot loader in original (early project phases) or final (late project phases) configuration? Have you reached level 2, playing autoboot game over internet?
If so, you may want to take a look at boot shell (bs) from Not Universal Test System project. In ideal case, it knows how to turn off/on the target, break into autoboot, boot your target in development mode, and login as root when user land is ready.
Upstream Review Training Slides
As well as many smaller fixes to various pages, the more notable changes in man-pages-3.71 are the following:
- A subheading "C library/kernel ABI differences" has been added to various pages that describe cases where the glibc wrapper for a system call provides some different behavior from the underlying system call.
- Details about the ancient Linux libc (which ceased to be used in the late 1990s) have now been removed from various pages.
- A new group_member(3) page documents the group_member() library function.
- A new isfdtype(3) page documents the isfdtype() library function.
- Vince Weaver added documentation of various new perf_event_open() features to the perf_event_open(2) page.
- I've added documentation of a number of /proc files to the proc(5) page.
It became clear that there is much confusion over read and write; eg. Linus thought read() was like write() whereas I thought (prior to my last post) that write() was like read(). Both wrong…
Both Linux and v6 UNIX always returned from read() once data was available (v6 didn’t have sockets, but they had pipes). POSIX even suggests this:
The value returned may be less than nbyte if the number of bytes left in the file is less than nbyte, if the read() request was interrupted by a signal, or if the file is a pipe or FIFO or special file and has fewer than nbyte bytes immediately available for reading.
But write() is different. Presumably so simple UNIX filters didn’t have to check the return and loop (they’d just die with EPIPE anyway), write() tries hard to write all the data before returning. And that leads to a simple rule. Quoting Linus:
Sure, you can try to play games by knowing socket buffer sizes and look at pending buffers with SIOCOUTQ etc, and say “ok, I can probably do a write of size X without blocking” even on a blocking file descriptor, but it’s hacky, fragile and wrong.
I’m travelling, so I built an Ubuntu-compatible kernel with a printk() into select() and poll() to see who else was making this mistake on my laptop:
cups-browsed: (1262): fd 5 poll() for write without nonblock cups-browsed: (1262): fd 6 poll() for write without nonblock Xorg: (1377): fd 1 select() for write without nonblock Xorg: (1377): fd 3 select() for write without nonblock Xorg: (1377): fd 11 select() for write without nonblock
This first one is actually OK; fd 5 is an eventfd (which should never block). But the rest seem to be sockets, and thus probably bugs.
What’s worse, are the Linux select() man page:
A file descriptor is considered ready if it is possible to perform the corresponding I/O operation (e.g., read(2)) without blocking.
... those in writefds will be watched to see if a write will not block...
POLLOUT Writing now will not block.
Man page patches have been submitted…
For the last of these breakdowns, I’ll focus on fifth place: networking.
Linux supports many different network protocols, so I spent quite a while splitting the net/ tree into per-protocol components. The result looks like this.
The networking code has gotten noticably better over the last year. When I initially introduced these components they were all well into double figures. Now, even crap like DECNET has gotten better (both users will be very happy).
“Everything else” above is actually a screw-up on my part. For some reason around 50 or so netfilter issues haven’t been categorized into their appropriate component. The remaining ~70 are quite a mix, but nearly all small numbers of issues in many components.Things like 9p, atm, ax25, batman, can, ceph, l2tp, rds, rxrpc, tipc, vmwsock, and x25. The Lovecraftian protocols you only ever read about.
So networking is in pretty good shape considering just how much stuff it supports. While there’s 24 issues in a common protocol like ipv4, they tend to be mostly benign things rather than OMG 24 WAYS THE NSA IS OWNING YOUR LINUX RIGHT NOW.
That’s the last of these breakdowns I’ll do for now. I’ll do this again maybe in six months to a year, if things are dramatically different, but I expect any changes to be minor and incremental rather than anything too surprising.
After I get back from kernel summit and recover from travelling, I’ll start a series of posts showing code examples of the top checkers.
In fourth place on the list of hottest areas of the kernel as seen by Coverity, is drivers/net/wireless.
I mentioned in my drivers/staging examination that the realtek wifi drivers stood odd as especially problematic. Here we see the same situation. Larry Finger has been working on cleaning up this (and other drivers) for some time, but it apparently still has a long way to go.
It’s worth noting that “Atheros” here is actually a number of drivers (ar5523, ath10k, ath5k, ath6k, ath9k, carl9170, wcn36xx, wil6210). I’ve not had time to break those down into smaller components yet, though a quick look shows that ath9k in particular accounts for a sizable portion of those 74 issues)
I was actually surprised at how low the iwlwifi and b43 counts were. I guess there’s something to be said for ubiquitous hardware.
What of all the ancient wireless drivers ? The junky pcmcia/pccard drivers like orinoco and friends ?
They’re in with those 65 “everything else” bugs, and make up < 5-6 issues each. Considering their age, and lack of any real maintenance these days, they’re in surprisingly good shape.
Just for fun, here’s how the drivers above compare against the wireless drivers currently in staging.
The filesystem code shows up in the number two position of the list of hottest areas of the kernel. Like the previous post on drivers/scsi, this isn’t because “the filesystem code is terrible”, but more that Linux supports so many filesystems, the accumulative effect of issues present in all of them adds up to a figure that dominates the statistics.
The breakdown looks like this.
fs/*.c accounts for the VFS core, AIO, binfmt parsers, eventfd, epoll, timerfd’s, xattr code and a bunch of assorted miscellany. Little wonder it show up with so high, it’s around 62,000 LOC by itself. Of all the entries on the list, this is perhaps the most concerning area given it affects every filesystem.
A little more concerning perhaps is that btrfs is so high on the list. Btrfs is still seeing a lot of churn each release, so many of these issues come and go, but it seems to be holding roughly at the same rate of new incoming issues each release.
EXTn counts for ext2, ext3, and ext4 combined. Not too bad considering that’s around 74,000 LOC combined. (and another 15K LOC for jbd/jbd2)
The CIFS, NFS and OCFS filesystems stand out as potentially something that might be of concern, especially if those issues are over-the-wire trigger-able.
XFS has been improving over the past year. It was around 60-70 when I started doing regular scans, and continues to move downward each release, with few new issues getting added.
The remaining filesystems: not too shabby. Especially considering some of the niche ones don’t get a lot of attention.
drivers/scsi showed up in third place in the list of hottest areas of the kernel. Breaking it down into sub-components, it looks like this.
All these components have been steadily improving over the last year. The obvious stand-out is “Everything else” that looks like it needs to be broken out into more components.
But drivers/scsi is one area of the kernel where we have a *lot* of legacy drivers, many of them 10-15 years old. (Remarkably, some of these are even still in regular use). Looking over the list of filenames matching the “Everything else” component, pretty much every driver that isn’t broken out into its own component is on the list. 3w-9xxx, NCR5380, aacraid, advansys, aic94xx, arcmsr, atp870, bnx2i, cxgbi, dc395x, dpt_i2o, eata, esas2, fdomain, fnic, gdth, hpsa, imm, ipr, ips, mvsas, mvumi, osst, pmcraid, qla1280, qlogicfas, stex, storvsc_drv, sym53x8xx, tmscsim.
None of these are particularly worse than the others, most averaging less than a half dozen issues each.
Ignoring the problems I currently have adding more components, it’s not particularly helpful to break it down further when the result is going to be components with a half dozen issues. It’s not that there’s a few awful drivers dragging down the average, it’s that there’s so many of them, and they all contribute a little bit of awful.
Something I’d like to component-ize, but can’t easily without crafting and maintaining ugly regexps, is the core scsi functionality and its libraries. The problem is that drivers/scsi/*.c includes both legacy drivers, and also scsi core functionality & library functions. I discussed potentially moving all the old drivers to a “legacy” or “vintage” sub-directory at LSF/MM earlier this year with James, but he didn’t seem overly enthusiastic. So it’s going to continue to be lumped in with “Everything else” for now.
The difficulty with figuring out whether many of these issues are real concerns is that because they are hardware drivers, the scanner has no way of knowing what range of valid responses the HBA will return. So there are a number of issues which are of the form “This can’t actually happen, because if the HBA returned this, then we would have called this other function instead”.
Not a problem unique to SCSI, and something that’s seen across many different parts of the kernel.
And for those ancient 15 year old drivers ? It’s tough to find someone who either remembers how they work on a chip level, or cares enough to go back and revisit them.
In my previous post, I mentioned that drivers/staging took the top spot for number of issues in a component.
Here’s a ‘zoomed in’ look at the sub-components under drivers/staging.
|everything else in drivers/staging/ (40 other uncategorized drivers)||95|
Some of the sub-components with < 10 issues are likely to have their categories removed soon. When they were initially added, the open issues counts were higher, but over time they’ve improved to the point where they could just be lumped in with “everything else”
When Lustre was added back in 3.12, it caused a noticable jump in new issues detected. The largest delta from any one single addition since I’ve been doing regular scans. It’s continuing to make progress, with 20 or so issues being knocked out each release, and few new issues being introduced. Lustre doesn’t suffer from any one issue overly, but has a grab-bag of issues from the many checkers that Coverity has.
Amusingly, Lustre is the only part of the kernel that has Coverity annotations in the code.
Second on the list is the bcm Wimax driver. This has been around in staging for years, and has had a metric shitload of checkpatch type stylistic changes made to it, but relatively few actual functionality fixes. (confession: I was guilty of ~30 of those cleanups myself, but I couldn’t bare to look at the 1906 line bcm_char_ioctl function: Splitting that up did have a nice side-effect though). A lot of the issues in this driver are duplicates due to a problem in a macro being picked up as a new issue for every instance it gets used.
Something that sticks out in this list is the cluster of rtl* drivers. At time of writing there are seven drivers for various Realtek wireless chips, all of varying quality. Much of the code between these drivers is cut-and-pasted from previous drivers. It seems each time Realtek rev new silicon, they do another code-drop with a new driver. Worse yet, many of the fixes that went into the kernel variants don’t make it back to the driver they based their new work on. There have been numerous cases where a bug fixed in one driver has been reintroduced in a new variant months later. There’s a ton of work going on here, and a lot more needed.
Somewhat depressingly, even the not-in-staging rtlwifi driver that lives in drivers/net/wireless has ~100 issues. Many of them the exact same issues as those in the staging drivers.
As bad as it seems, staging is serving its purpose for the most part, and things have gotten a lot quieter each merge window when the staging tree gets pulled. It’s only when it contains something new and huge like Lustre that it really shows up noticeably in the daily stats after each scan. The number of new issues being added are generally lower than the number being fixed. For the 3.17 pull for example, 67 new issues, 132 eliminated. (Note: Those numbers are kernel wide, not *just* staging, but staging made up the majority of the results change on that day).
Something that bothers me slightly is that a number of drivers have ‘escaped’ drivers/staging into the kernel proper, with a high number of open issues. That said, many of those escapees are no worse than drivers that were added 10+ years ago when standards were lower. More on that in a future post.
One of the time-consuming parts of organizing the data generated by Coverity has been sorting it into categories, (or components as Coverity refers to them). A component is a wildcard (or exact filename) that matches a specific subsystem, driver, filesystem etc.
As the Linux kernel has thousands of drivers, it isn’t really practical to add a component per-driver, so I started by generalizing into subsystems, and from there, broke down the larger groupings into per-driver components, while still leaving an “everything else” catch-all for drivers within a subsystem that hadn’t been broken out.
According to discussions I’ve had with Coverity, we are actually one of the more ‘heavy’ users of components, and we’ve hit a few scalability problems as we’ve added more and more of them, which has been another reason I’ve not broken things down more than the ~150 components we have so far. Also, if a component has less than 10 or so issues, it’s really not worth the effort of splitting up. (I may revise that cut-off even higher at some point just to keep things managable).
Before the big reveal, some caveats:
- Something having ‘100’ issues may not be 100 individual problems. For example if a problem is in a macro, Coverity flags a new issue for every use of that macro. For something heavily used, like a formatted printk debug wrapper, this could account for many many warnings.
- Many of these issues aren’t actual bugs. At the same time, the checker isn’t wrong, but has no way to infer that the use is ok. I’ll explain more about these in a future post when I start showing some actual warnings.
- Sometimes a combination of both the previous points. As an example: The nouveau driver this week had ~100 issues open against it, making it the #1 in the list of drm drivers with issues. Ben Skeggs spent some time going over them all, and closed out 80-90 of them as intentional/false positives, and came away with around a half dozen or so issues that actually need code changes, and around 20 issues that are still undecided. It’s a laborious time-consuming effort to dig through them, and in many cases, only the person who wrote the code can really determine if the intent matches what the code actually does.
Right now, the top ten ‘hot areas’ of the kernel (these include accumulated broken-out drivers), sorted by number of issues are:
It should come as no surprise really that the staging drivers take the number one spot. If something had beaten it, I think it would have highlighted a somewhat embarrassing problem in our development methods.
In the next posts, I’ll drill down into each of these categories, and figure out exactly why they’re at the top of the list.
For the impatient: once this series is over, I intend to show breakdowns of the various types of issues being detected, but it’s going to take me a while to get to (probably post kernel summit). There’s an absolute ton of data to dig through, and I’m trying to present as much of it in bite-sized chunks as possible, rather than dozens of pages of info.
Next week at kernel summit, I’m going to be speaking about the Coverity scans, and have come up with more material than I have time to present in the short slot, so I’ve decided to turn it into a series of blog posts in a hope to kickstart some discussion ahead of time.
I started doing regular scans against the Linux kernel in July 2013. In that time, I’ve sent a bunch of patches, reported many bugs, and spent hours going through the database categorizing, diagnosing, and closing out issues where possible.
I’ve been doing at least one build per day during each merge window (except obviously on days when there haven’t been any commits), and at least one per -rc once the merge window closes.
A few people have asked me about the config file that I use for the builds.
It’s pretty much an ‘allmodconfig’, except where choices have to be made, I’ve tried to pick the general case that a distribution would select. For some of these, I will occasionally flip between them (for eg, SLAB/SLOB/SLUB, PREEMPT_NONE/PREEMPT_VOLUNTARY/PREEMPT) just for coverage. In total, currently 6955 CONFIG_ options are enabled, 117 disabled. (None by choice, they are all the deselected parts of multi-choice options).
The builds are done x86-64 only. At this time, it’s the only architecture Coverity scan supports. I do have CONFIG_COMPILE_TEST set, so non-x86 drivers that can be built do get scanned. The architecture specific code in arch/ and drivers not covered under COMPILE_TEST being the only parts of the kernel we’re not covering.
Builds take about an hour to build on a 24-core Nehalem. The results are then uploaded to a server which takes another 20 minutes. Then a script kicks something at Coverity to pick up the new tarball and scan it. This can take any number of hours. At best, around 5-6 hours, at worst I’ve seen it take as long as 12 hours. This hopefully answers why I don’t do even more builds, or builds of variant trees. (Although I’m still trying to figure out a way to scan linux-next while having it inherit the results of the issues already marked in Linus tree). Thankfully much of the build/upload/scan process is automated, so I can do other things while I wait for it to finish.
Over the year, the overall defect density has been decreasing.
Moving in the right direction, though things have slowed a little the last few releases. At least in part due to my spending more time on Trinity than going through the Coverity backlog. The good news is that the incoming rate of new bugs each window has also slowed.
Newer issues when they are getting introduced, are getting jumped on faster than before. Many developers have signed up for accounts and are looking over their subsystems each release, which is great. It means I have to spend less time sending email :)
Eventually I hope that Coverity implements a feature I asked for allowing each component to have a designated email address that new reports get sent to. With that in place, plus active triage on the backlog, a real dent could be made in the ~4700 outstanding issues.
Throughout the past year Coverity has made a number of improvements server-side, some at the behest of the scans, resulting in fewer false positives being found by some checkers. A good example of this was some additional heuristics being added to spot intentional ‘missing break in switch statement’ situations. I’ve also been in constant communication whenever an interesting bug was found upstream that Coverity didn’t detect, so over time, additional checkers should be added to catch more bugs.
How do we compare against other projects ?
I picked a few at random.
|FreeBSD||0.54||(~15m LOC)||14655 total, 6446 fixed, 8093 outstanding.|
|Firefox||0.70||(~5.4m LOC)||9008 total. 5066 fixed. 3786 outstanding.|
|Linux||0.53||(~9m LOC)||13337 total. 7202 fixed. 4761 outstanding.|
|Python||0.03 !||(~400k LOC)||1030 total. 895 fixed. 3 outstanding.|
(LOC based on C preprocessor output)
FreeBSD’s defect density is pretty much the same as Linux right now, despite having a lot more code. I think they include all their userspace in their scans also, so it’s picked up gcc, sendmail, binutils etc etc.
The Python people have made a big effort to keep their defect density low (afaik, the lowest of all projects in scan). They did however have a lot fewer issues to begin with, and have a much smaller codebase. Firefox by comparison seems to have a lot of the same problems Linux has. A large corpus of pre-existing issues, and a large codebase (probably with few people with ‘global’ knowledge)
In my next post, I’ll go into some detail about where some of the more dense areas of the kernel are for Coverity issues. Much of it should be no surprise (old, unmaintained/neglected code etc), but there are a few interesting cases).
update : added FreeBSD statistics.
update 2 : (hi hackernews!) added blurb about coverity improvements.
Before my talk, Michael Tautschnig gave a presentation (based on this paper) on an interesting prototype tool (called “mole,” with the name chosen because the gestation period of a mole is about 42 days) that helps identify patterns of usage in large code bases. It is early days for this tool, but one could imagine it growing into something quite useful, perhaps answering questions such as “what are the different ways in which the Linux kernel uses reference counting?” He also took care to call out the many disadvantages of testing, which include not being able to test all paths, all possible races, all possible inputs, or all possible much of anything, at least not in finite time.
I started my talk with an overview of split counters, where each CPU (or task or whatever) updates its own counter, and the aggregate counter is read out by summing all the per-CPU counters. There was some concern expressed by members of the audience about the possibility of inconsistent results. For example, if one CPU adds five, another CPU adds seven, and a third CPU adds 11 to initially zero counter, then two CPUs reading out the counter might see 12 and 18, respectively, which are inconsistent (they differ by six, and no CPU added six). To their credit, the attendees picked right up on a reasonable correctness criterion. The idea is that the aggregate counter's value varies with time, and that any given reader will be guaranteed to return a value between that of the counter when the reader started and that of the counter when the reader ended: Consistency is neither needed nor provided in a great number of situations.
I then gave my usual introduction to RCU, and of course it is always quite a bit of fun introducing RCU to people who have never encountered anything like it. There was quite a bit of skepticism initially, as well as a lot of questions and comments.
I then turned to validation, noting the promise of some formal-validation tooling. I ended by saying that although I agreed with the limitations of testing called out by the previous speaker, the fact remains that a number of people have devised tests that had found RCU bugs (thank you, Stephen, Dave, and Fengguang!), but no one has yet devised a hard-core formal-validation tool that has found any bugs in RCU. I also pointed out that this is definitely not because there are no bugs in RCU! (Yes, I have gotten rid of all day-one bugs in RCU, but only by having also gotten rid of all day-one code in RCU.) When asked if I meant bugs in RCU usage or in RCU itself, I replied “Either would be good.” Several people wrote down where to find RCU in the Linux kernel, so it will be interesting to see what they come up with. (Perhaps all too interesting!)
There were several talks on analyzing weakly ordered systems, but keep in mind that for these guys, even x86 is weakly ordered. After all, it allows prior stores to be reordered with later loads.
Another interesting talk was given by Kapil Vaswani on the topic of wait freedom. Recall that in a wait-free algorithm, every process is guaranteed to make some progress in a finite time, even in the presence of arbitrarily long delays for any given process. In contrast, in a lock-free algorithm, only one process is guaranteed to make some progress in a finite time, again, even in the presence of arbitrarily long delays for any given process. It is worth noting that neither of these guarantees is sufficient for real-time programs, which require a specified amount of progress (not merely some progress) in a bounded amount of time (not merely a finite amount of time). Wait-freedom and lock-freedom are nevertheless important forward-progress guarantees, and there are numerous other similar guarantees including obstruction freedom, deadlock freedom, starvation freedom, many more besides.
It turns out that most software in production, even in real-time systems, is not wait-free, which has been a source of consternation for many researchers for quite some time. Kapil went on to describe how Alistarh et al. showed that, roughly speaking, given a non-hostile scheduler and crash-free execution, lock-free algorithms have wait-free behavior.
The interesting thing about this is that you can take it quite a bit farther, and those of you who know me well won't be surprised to learn that I did just that in a question to the speaker. If you have a non-hostile scheduler, crash-free execution, FIFO locks, bounded lock-hold times, no lock nesting, a finite number of processes, and so on, you can obtain the benefits of the more aggressive forward-progress guarantees. The general idea is that if you have at most N processes, and if the maximum lock-hold time is T, then you can wait at most (N-1)T time to acquire a given lock. (Those wishing greater rigor should read Bjoern B. Brandenburg's dissertation — Full disclosure: I was on Bjoern's committee.) In happy contrast to the authors of the paper mentioned in the previous paragraph, the speaker and audience seemed quite intrigued by this line of thought.
In all, it was an extremely interesting and thought-provoking time. With some luck, perhaps we will see some powerful software tools introduced by this group of researchers.
With the 3.17 merge window opening up this week, it’s been kinda busy.
I also made a few enhancements to Trinity, so it found some bugs that have been there for a while.
- btrfs extents related insert_state() BUG. This was some new code I added to Trinity to open a bunch of real files on startup, and keep reusing those fd’s for various file operations. Until now we only really used pre-existing files from sysfs, procfs and /dev. The idea is to eventually flesh things out so it starts doing fsx-like operations on these real files. (I’ve still no idea how feasible it is to add the integrity checking that fsx does, given Trinity is threaded. Tricky). Shortly after reporting that bug, I inadvertently made it not reproducable when I “cleaned up” some code that was using fileno to get the fd of an fopen, to make it just use open() instead. So it turns out that bug only happens on file streams. Upon realizing this, I wrote a bunch of code in Trinity so it now randomly chooses whether to fileno(fopen) or use open(). Nothing additional has fallen out (yet).
- Tuesday saw an x86 merge, and with it something bothersome. I started seeing watchdog hangs. At the time I reported this, it always seemed to happen during boot up from power off. Later in the week I started hitting it during fuzzing sessions too.
- sending IPIs from NMI context ends badly, film at 11
- RCU warning spew from tracepoints. Quickly fixed up.
- Some fun with fault injection broke procfs.
- The same fault injector found that btrfs handles memory failure badly in extent allocation.
In addition to this, I started pulling together a talk for kernel summit based on all the stuff that Coverity has been finding. I’ll eventually get around to turning those into blog posts too, as there’s a lot of material.