Anyway, upload of points I collected is now finished, so situation near Prague should improve rapidly.
What is expected range of wifi? Normally, 300 meters is quoted, but I frequently see way bigger distances in my testing. Is it possible that there are multiple transmitters with same MAC address?
First day back after a week off. Skimmed through thousands of emails.
Damage so far:
- Router OOM’d itself to the point of locking up completely last night. Normally I reboot it daily into Linus’ kernel of the day, but as I was on vacation, something kept leaking memory over a period of a week. Updated everything to current, with a plan to keeping an eye on it over the next week. Looking at mrtg graphs, the memory goes down every night after cron runs, and is never reclaimed.
- hit a GPF in aio_migratepage
- Still seeing some bugs from before my vacation that aren’t fixed yet.
- My AMD test box completely died. Powers up for a second and then turns itself off. Annoying, as this was my I/O test machine.
Yesterday, I got to play with Mozilla location service again, and decided to check their progress... (And learn how to deal with json over http in the process). Results are not exactly pretty: my GSM database had over 72000 measurements, with 2065 unique cells (mostly Czech republic). Of those, Mozilla knew location of 6 cells...
(I'm now uploading my measurements to mozilla, so results for center of Czech republic should get a bit better.)
If someone is interested in the code, it is at tui project at sourceforge.
Situation with wifi seems to be a bit better, but scripts are slow, so it will take few hours before I know the results.
To avoid repeating all the same mistakes again I've written up some of the lessons learned while botching the job for the drm/i915 driver. Most of these only cover technicalities and not the big-picture issues like what the command submission ioctl exactly should look like. Learning these lessons is probably something every GPU driver has to do on its own.
PrerequisitesFirst the basic - without these the fail already starts with the need for a 32bit compat layer:
- Only use fix sized integers. To avoid conflicts with typedefs in userspace the kernel has special types like
__s64. Use them.
- Align everything to the natural size and use explicit padding. 32bit platforms don't necessarily align 64bit values to 64bit boundaries, but 64bit platforms do. So we always need padding to the natural size to get this right.
- Pad the entire struct to a multiple of 64bits - the structure size will otherwise differ on 32bit versus 64bit. Which hurts when passing arrays of structures to the kernel. Or with the ioctl structure size checking that e.g. the drm core does.
- Pointers are
__u64, cast from/to a
uintprt_ton the userspace side and from/to a void
__user *in the kernel. Try really hard not to delay this conversion or worse, fiddle the raw
__u64through your code since that diminishes the checking tools like sparse can provide.
BasicsWith the joys of writing a compat layer avoided we can take a look at the basic fumbles. Neglecting these will make backward and forward compatibility a real pain. And since getting things wrong on the first attempt is guaranteed you will have a second iteration or at least an extension for any given interface.
- Have a clear way for userspace to figure out whether your new ioctl or ioctl extension is supported on a given kernel. If you can't rely on old kernels rejecting the new flags/modes or ioctls (since doing that was botched in the past) then you need a driver feature flag or revision number somewhere.
- Have a plan for extending ioctls with new flags or new fields at the end of the structure. The drm core checks the passed-in size for each ioctl call and zero-extends any mismatches between kernel and userspace. That helps, but isn't a complete solution since newer userspace on older kernels won't notice that the newly added fields at the end get ignored. So this still needs a new driver feature flags.
- Check all unused fields and flags and all the padding for whether it's 0, and reject the ioctl if that's not the case. Otherwise your nice plan for future extensions is going right down the gutters since someone will submit an ioctl struct with random stack garbage in the yet unused parts. Which then bakes in the ABI that those fields can never be used for anything else but garbage.
- Have simple testcases for all of the above.
Fun with Error PathsNowadays we don't have any excuse left any more for drm drivers being neat little root exploits. Which means we both need full input validation and solid error handling paths - GPUs will die eventually in the oddmost cornercases anyway:
- The ioctl must check for array overflows. Also it needs to check for over/underflows and clamping issues of integer values in general. The usual example is sprite positioning values fed directly into the hardware with the hardware just having 12 bits or so. Works nicely until some odd display server doesn't bother with clamping itself and the cursor wraps around the screen.
- Have simple testcases for every input validation failure case in your ioctl. Check that the error code matches your expectations. And finally make sure that you only test for one single error path in each subtest by submitting otherwise perfectly valid data. Without this an earlier check might reject the ioctl already and shadow the codepath you actually want to test, hiding bugs and regressions.
- Make all your ioctls restartable. First X really loves signals and second this will allow you to test 90% of all error handling paths by just interrupting your main test suite constantly with signals. Thanks to X's love for signal you'll get an excellent base coverage of all your error paths pretty much for free for graphics drivers. Also, be consistent with how you handle ioctl restarting - e.g. drm has a tiny
drmIoctlhelper in its userspace library. The i915 driver botched this with the
set_tilingioctl, now we're stuck forever with some arcane semantics in both the kernel and userspace.
- If you can't make a given codepath restartable make a stuck task at least killable. GPUs just die and your users won't like you more if you hang their entire box (by means of an unkillable X process). If the state recovery is still too tricky have a timeout or hangcheck safety net as a last-ditch effort in case the hw has gone bananas.
- Have testcases for the really tricky cornercases in your error recovery code - it's way too easy to create a deadlock between your hangcheck code and waiters.
Time, Waiting and Missing itGPUs do most everything asynchronous, so we have a need to time operations and wait for oustanding ones. This is really tricky business, at the moment none of the ioctls supported by the drm/i915 get this fully right. Which means there's still a tons more lessons to learn here.
CLOCK_MONOTONICas your reference time, always. It's what alsa, drm and v4l use by default nowadays. But let userspace know which timestamps are derived from different clock domains like your main system clock (provided by the kernel) or some independent hardware counter somewhere else. Clocks will mismatch if you look close enough, but if performance measuring tools have this information they can at least compensate. If your userspace can get at the raw values of some clocks (e.g. through in-command-stream performance counter sampling instructions) consider exposing also those.
__u64nanoseconds to specify time. It's not the most convenient time specification, but it's mostly the standard.
- Check that input time values are normalized and reject them if not. Note that the kernel native
struct ktimehas a signed integer for both seconds and nanoseconds, so beware here.
- For timeouts, use absolute times. If you're a good fellow and made your ioctl restartable relative timeouts tend to be too coarse and can indefinitely extend your wait time due to rounding on each restart. Especially if your reference clock is something really slow like the display frame counter. With a spec laywer hat on this isn't a bug since timeouts can always be extended - but users will surely hate you if their neat animations starts to stutter due to this.
- Consider ditching any synchronous wait ioctls with timeouts and just deliver an asynchronous event on a pollable file descriptor. It fits much better into event driven applications' main loop.
- Have testcases for corner-cases, especially whether the return values for already-completed events, successful waits and timed-out waits are all sane and suiting to your needs.
Leaking Resources, NotA full-blown drm driver essentially implements a little OS, but specialized to the given GPU platforms. Which means a driver needs to expose tons of handles for different objects and other resources to userspace. Doing that right entails its own little set of pitfalls:
- Always attach the lifetime of your dynamically created resources to the lifetime of a file descriptor. Consider using a 1:1 mapping if your resource needs to be shared across processes - fd-passing over unix domain sockets also simplifies lifetime management for userspace.
- Always have O_CLOEXEC support.
- Ensure that you have sufficient insulation between different clients. By default pick a private per-fd namespace which forces any sharing to be done explictly. Only go with a more global per-device namespace if the objects are truly device-unique. One counterexample in the drm modeset interfaces is that the per-device modeset objects like connectors share a namespace with framebuffer objects, which mostly are not shared at all. A separate namespace, private by default, for framebuffers would have been more suitable.
- Think about uniqueness requirements for userspace handles. E.g. for most drm drivers it's a userspace bug to submit the same object twice in the same command submission ioctl. But then if objects are shareable userspace needs to know whether it has seen an imported object from a different process already or not. I haven't tried this myself yet due to lack of a new class of objects, but consider using inode numbers on your shared file descriptors as unique identifiers - it's how real files are told apart, too. Unfortunately this requires a full-blown virtual filesystem in the kernel.
Last, but not LeastNot every problem needs a new ioctl:
- Think hard whether you really want a driver-private interface. Of course it's much quicker to push a driver-private interface than engaging in lengthy discussions for a more generic solution. And occasionally doing a private interface to spearhead a new concept is what's required. But in the end, once the generic interface comes around you'll end up maintainer two interfaces. Indefinitely.
- Consider other interfaces than ioctls. A sysfs attribute is much better for per-device settings, or for child objects with fairly static lifetimes (like output connectors in drm with all the detection override attributes). Or maybe only your testsuite needs this interface, and then debugfs with it's disclaimer of not having a stable ABI would be better.
|Opened since 2013-11-08||2||18||6||3||(29)|
|Closed since 2013-11-08||5||19||3||3||(30)|
|Changed since 2013-11-08||7||38||6||7||(58)|
bugs bugs bugs.
- softirq lockdep trace during interface bringup. This only affected 32-bit. I hadn’t tested x86-32 for a month or so, so this was a worthwhile use of time.
- Another sysfs lockdep trace.
- perf code using smp_processor_id() in preemptible  code.
Made a bunch of small improvements to trinity (mostly fixing up warnings). Coming closer to another point release before merging some interesting stuff.
Found a bunch of bugs with trinity today after tweaking some code that caused it to hang when closing bluetooth sockets. (Still not sure I want to commit the workaround I came up with). Now that it’s back in action, the roadkill is piling up.
- an rcu locking bug in sys_getcwd(). Al already had a fix for this queued, and it’s now fixed in Linus’ tree.
- a lockdep trace from sysfs
- another recursive locking lockdep trace in the coredump code.
- an oops in tcp_get_metrics, that Eric Dumazet fixed up pretty quickly (not yet merged).
This specifically excludes testing with humans somewhere in the loop. We are extremely limited in our validation resources, every time we put something new onto the "manual testing" plate something else will fall off.
I've let this float for quite a while both internally in Intel and on the public mailing lists. Thanks to everyone who provided valuable input. Essentially this just codifies the already existing expectations from me as the maintainer, but those requirements haven't really been clear and a lot of emotional discussions ensued. With this we should now have solid guidelines and can go back to coding instead of blowing through so much time and energy on waging flamewars.
Why?There are a bunch of reasons why having good tests is great:
- More predictability. Right now test coverage often only comes up as a topic when I drop my maintainer review onto a patch series. Which is too late, since it'll delay the otherwise working patches and so massively frustrates people. I hope by making test requirements clear and up-front we can make the upstreaming process more predictable. Also, if we have good tests from the get-go there should be much less need for me to drop patches from my trees after having them merged.
- Less bikeshedding. In my opinion test cases are an excellent means to settle bikesheds - we've had in the past seen cases of endless back&forths where writing a simple testcase would have shown that all proposed color flavours are actually broken.
The even more important thing is that fully automated tests allow us to legitimately postpone cleanups. If the only testing we have is manual testing then we have only one shot at a feature tested, namely when the developer tests it. So it better be perfect. But with automated tests we can postpone cleanups with too high risks of regressions until a really clear need is established. And since that need often never materializes we'll save work.
- Better review. For me it's often helps a lot to review tests than the actual code in-depth. This is especially true for reviewing userspace interface additions.
- Actionable regression reports. Only if we have a fully automated testcase do we have a good chance that QA reports a regression within just a few days. Everything else can easily take weeks (for platforms and features which are explicitly tested) to months (for stuff only users from the community notice). And especially now that much more shipping products depend upon a working i915.ko driver we just can't do this any more.
- Better tests. A lot of our code is really hard to test in an automated fashion, and pushing the frontier of what is testable often requires a lot of work. I hope that by making tests an integral part of any feature work and so forcing more people to work on them and think about testing we'll advance the state of the art at a brisker pace.
Risks and ButsBut like with every change, not everything is all glorious and fun:
- Bikeshedding on tests. This plan is obviously not too useful if we just replace massive bikeshedding on patches with massive bikeshedding on testcases. But right now we do almost no review on i-g-t patches so the risk is small. Long-term the review requirements for testcases will certainly increase, but as with everything else we simply need to strive for a good balance to strike for just the right amount of review.
Also if we really start discussing tests before having written massive patch series we'll do the bikeshedding while there's no real rebase pain. So even if the bikeshedding just shifts we'll benefit I think, especially for really big features.
- Technical debt in test coverage. We have a lot of old code which still completely lacks testcases. Which means that even small feature work might be on the hook for a big pile of debt restructuring. I think this is inevitable occasionally. But I think that doing an assessment of the current state of test coverage of the existing code before starting a feature instead of when the patches are ready for merging should help a lot, before everyone is invested into patches already and mounting rebase pain looms large.
Again we need to strive for a good balance between "too many tests to write up-front for old code" and "missing tests that only the final review and validation uncovers creating process bubbles".
- Upstreaming of product stuff. Product guys are notoriously busy and writing tests is actual work. On the other hand the upstream codebase feeds back into all product trees (and the upstream kernel), so requirements are simply a bit higher. And I also don't think that we can push the testing of some features fully to product teams, since they'll be pissed really quickly if every update they get from us breaks their stuff. So if these additional test requirements (compared to the past) means that some product patches won't get merged, then I think that's the right choice.
- But ... all the other kernel drivers don't do this. We're also one of the biggest driver's in the kernel, with a code churn rate roughly 5x worse than anything else and a pretty big (and growing) team. Also, we're often the critical path in enabling new platforms in the fast-paced mobile space. Different standards apply.
Test Coverage ExpectationsSince the point here is to make the actual test requirements known up-front we need to settle on clear expectations.
- Tests must fully cover userspace interfaces. By this I mean exercising all the possible options, especially the usual tricky corner cases (e.g. off-by-one array sizes, overflows). It also needs to include tests for all the userspace input validation (i.e. correctly rejecting invalid input, including checks for the error codes). For userspace interface additions technical debt really must be addressed. This means that when adding a new flag and we currently don't have any tests for those flags, then I'll ask for a testcase which fully exercises all the flag values we currently supported on top of the new interface addition.
- Tests need to provide a reasonable baseline coverage of the internal driver state. The idea here isn't to aim for full coverage, that's an impossible and pointless endeavor. The goal is to have a good starting point of tests so that when a tricky corner case pops up in review or validation that it's not a terribly big effort to add a specific testcase for it. This is very much a balance thing to get right and we need a bit of experience to get a good handle here.
- Issues discovered in review and final validation need automated test coverage. This includes any bugs found after a feature has already landed and is even more important for regressions. The reasoning is that anything which slipped the developer's attention is tricky enough to warrant an explicit testcase, since in a later refactoring there's a good chance that it'll be missed again. This has a bit a risk to delay patches, but if the basic test coverage is good enough as per the previous point it really shouln't be an issue.
- Finally we need to push the testable frontier with new ideas like pipe CRCs, modeset state cross checking or arbitrary monitor configuration injection (with fixed EDIDs and connector state forcing). The point here is to foster new crazy ideas, and the expectation is very much not that developers then need to write testcases for all the old bugfixes that suddenly became testable. That workload needs to be spread out over a bunch of features touching the relevant area. This only really applies to features and code paths which are currently in the "not testable" bucket anyway.
Specific testcases in i-g-t are obviously the preferred form, but for some features that's just not possible. In such cases in-kernel self-checks like the modeset state checker of fifo underrun reporting are really good approaches. Two caveats apply:
- The test infrastructure really should be orthogonal to the code being tested. In-line asserts that check for preconditions are really nice and useful, but since they're closely tied to the code itself they have a good chance to be broken in the same ways.
- The debug feature needs to be enabled by default, and it needs to be loud. Otherwise no one will notice that something is amiss. So currently the fifo underrun reporting doesn't really count since it only causes debug level output when something goes wrong. Of course it's still a really good tool for developers, just not yet for catching regressions.
- Manual testing. We are ridiculously limited on our QA manpower. Every time we drop something onto the "manual testing" plate something else will drop off. Which means in the end that we don't really have any test coverage. So if patches don't come with automated tests, in-kernel cross-checking or some other form of validation attached they need to have really good reasons for doing so.
- Testing by product teams. The entire point of Intel OTC's "upstream first" strategy is to have a common codebase for everyone. If we break product trees every time we feed an update into them because we can't properly regression test a given feature then the value of upstreaming features is greatly diminished. In my opinion this could potentially doom collaborations with product teams. We just can't have that.
This means that when products teams submit patches upstream they also need to submit the relevant testcases as patches to i-g-t.
Process AdjustementsThe important piece is really to not start with thinking about tests only when everything else is done.
- For big features we should have an upfront discussion about the test coverage and what all should be done (like any coverage gaps for existing code and features to fill, a new crazy test infrastructure idea to implement as a proof of concept or what kinds of tests would provide a reasonable base coverage). For really big features writing a quick test plan and everyone signing off on it could be useful. Especially to be able to learn and improve once everything has landed and the usefulness of the tests is much clearer.
- Tests should be implemented together with the feature or bugfix and should be ready about the same time. Having both pieces at hand should help development, testing and review.
- If we decide that new test infrastructure is required or that there's a large gap in the coverage of existing code then that should be done before the main feature is developed. Otherwise we'll suffer again the pains of rebase hell for no gain.
- Finally developers are not expected to run the full testsuite before submitting patches. The test suite currently simply takes too long to run and we don't have any good centralized infrastructure to speed things up by running tests on multiple machines in parallel. And then proper testing requires a wide array of different platforms anyway, so full regression testing is still squarely a job for our QA. Of course we need to improve our infrastructure and also make it easier to run a useful subset of tests while developing patches.
From the Youtube video:
Size is more than 85 PB, number of objects not mentioned (unless I missed it). Processes about 80 million requests each hour (from a half-joking remark). At least they revealed one number, which is welcome.
Nodes are 90 drives of 3TB per box, 10G network. Using SSD for Acct/Cont.
Switched from Pound to HAproxy, using Intel hw SSL termination. Maybe I should retire Pound from Fedora, too?
One thing I noticed is that Swift was a cathedral: driven by real-life requirements, 5 people sat in little room for 9 months, wrote 10,000 lines of killer code. Only then it was open, included into OpenStack, etc. Would would ESR say.
Also... The failure rate of hard drives is 10% per year! That includes the one in your laptop.
There's another problem near: 1) anything on CAN is considered trusted (how could attacker get here?) and 2) car radios now have bluetooth/wifi/USB and CAN connection and are easy to hack (but how could hacked radio be a problem? And yes, from the automotive summit it is pretty clear this will get worse). I actually thought about buying bluetooth odb-2 adapter for the next car, then I realized how bad idea it is.
I'm quite happy that my car does not have power steering or CAN bus, and has mechanical clutch, brakes and ignition key. It _does_ have ABS, so bad computer could probably kill my brakes. [I could probably kill that computer by turning off ignition, but I don't think I would do it fast enough in emergency. There is still vacuum for two/three
brake assists with engine off.]
Unfortunately, my horse is completely fly-by-wire. Unintended acceleration happens basically whenever there are other horses around...
After Linus pulled in a bunch of trees including drivers/staging yesterday, I was expecting the worst this morning after seeing the overnight results.
So I was taken aback somewhat when I saw that after last nights run, we got 49 new issues, but eliminated 57. A definite move in the right direction, especially after a big merge of 1832 patches.
3.12′s staging merge was a lot uglier due to the addition of lustre, which was a huge body of code, which has quite a few potential defects that need reviewing. For 3.13, there’s nothing really of comparable size (at least so far).
Of the 49 new results, most of them are in staging, but there’s a handful in IIO, MIC and USB.
|Opened since 2013-11-01||1||24||10||7||(42)|
|Closed since 2013-11-01||2||10||6||4||(22)|
|Changed since 2013-11-01||8||45||14||11||(78)|
Linus seems to have started merging stuff for 3.13rc again. 1800 or so new changes today, in USB, drivers/staging, sysfs, s390, mips, NFS.
Interested to see what impact staging has on tomorrows Coverity statistics. (Today the overall defect density got down to 0.61, the lowest since I’ve been tracking it).
Now that 3.12 has been released, I’ve been looking back over the coverity statistics for the last few months.
A combination of slogging through the backlog of old reports, fixing up occasional bugs, increased interest after speaking about it at kernel summit, and some work on modelling various functions. Barely scratching the surface really, but 3.12 has made a bigger dent in the backlog than previous attempts. (I’ve no pre 3.11 statistics, but given how long many of the reports have been in their database, it seems apparent that there’s been no regular work other than occasional poking).
With Linus being off for a week before the 3.13 merge window opens, hopefully things aren’t going to be too crazy when he gets back.
Formerly known as Valleyview, now called Baytrail has seen a lot of improvements. Jesse Barnes has crawled through bugzilla and fixed a lot of little issues all over the place. Chon Ming Lee also provided a few patches to initialize the chip without the VBIOS' help. Jani Nikula has provided the first cut at MIPI DSI support, but that's still all rather preliminary and still a bit of work to do.
This release has also seen a lot of changes, most of them preparatory work, for new power management features. Jani implemented SWSCI support which is a new kind of OpRegion/ACPI based low power management framework for Intel graphics. Itself it's not terribly interesting, but unfortunately a lot of platform power management code in the ACPI reference implementations for Haswell platforms is linked to SWSCI. So now the power gating for e.g. audio codecs is contingent on the gfx being switched off, which obviously doesn't make too much sense. We can blame this on Windows' power management framework being sub-par, but unfortunately we can't fix it. There's also been a lot of work in handling different power domains better (not all of which managed to get into 3.13): Ville Syrjälä implemented VGA power domains and Imre Deak and Paulo Zanoni reworked the power domain handling in general a lot. This is all preparatory work for power saving features on newer platforms, so stay tuned for more in the next kernel cycle. For the curious, most of the patches have already been posted to intel-gfx.
Another power/performance feature is the gpu boost/deboost logic from Chris Wilson. This will help for interactive gpu workloads where the hardware based boost logic is too slow to adapt. On the flip side the kernel is now also much more aggressive at deboosting the gpu when nothing is going on, which should improve power consumption figures for very light, but spikey workloads.
Damien Lespiau provided basic support for 3D/stereo displays on HDMI. Currently there's only a small test application available, but patches to support 3D modes in Wayland are already in the works. Hopefully soon we'll have glxgears spinning in 3D! As a quick aside I've just noticed that in my 3.12 release notes I've forgotten to mention the HDMI 4K support, again from Damien.
Under the hood of the driver we've seen more VMA prep patches for PPGTT from Ben Widawsky. And Ville has again been really busy trying to beat our watermark code into shape for a brave new world where atomic updates of the primary plane and any set of sprites actually works in a reliably way.
Now the one feature that makes me really happy is the display CRC support from Damien and He Shuang. This exposes the hardware checksum features through debugfs. Since these checksums update for each frame displayed and the tap point in the display pipe can be selected to be after all the cursor/plane/sprite blending, color correction and scaling have been applied this will finally allow us to test a lot of the modesetting code in a fully automated way. Ville has already provided a testcase for cursor placement and found a few more bugs while writing it.
And of course there's been countless little bugfixes and improvements to the driver internals: DP fixes from Jani, improved tuning values for Haswell DDI ports from Paulo, the option to compile the driver without CONFIG_FB support in the kernel, small improvements to the hw context support from Ben and Chris and tons of other stuff.
Oh and: We should have a little surprise ready for 3.13, too.
|Opened since 2013-10-01||18||62||26||9||(115)|
|Closed since 2013-10-01||61||256||9||5||(331)|
|Changed since 2013-10-01||243||135||33||17||(428)|
Monthly Fedora kernel bug statistics – October 2013 is a post from: codemonkey.org.uk
Kernel testing does happen but no one ever has or ever will be happy with the state of it as there never will be consensus on what is a representative set of benchmarks or adequate hardware coverage. Any such exercise is doomed to miss something but the objective is to catch the bulk of the problems, particularly the obvious ones that crop up over and over again instead of being perfect. Developers allegedly do their own testing although it can be limited based on hardware availability. There are different organisations that do their own internal testing that are sometimes publicly visible such as SUSE and other distributions reporting through their various developers and their QA departments, Intel routinely test within their performance team with the work of Fengguang Wu's on continual kernel testing being particularly visible. There are also some people that do public benchmarking such as Phoronix regardless of what people think of the data and how it is presented. This coverage is not unified and some workloads will never be adequately simulated by benchmark but in combination it means that there is a hopeful chance that regressions will be caught. Overall as a community I think we still rely on enterprise distribution and their partners doing a complete round of testing on major releases as a backstop. It's not an ideal situation but we are never going to have enough dedicated physical, financial or human resources to do a complete job.
I continue to use MMTests for the bulk of my own testing and it's capabilities and coverage continues to grow even though I do not advertise it any more. Over the last two years I have occasionally run tests on latest mainline kernels but it was very sporadic. Basically, if I was going away for a week travelling and thought of it then I would queue up tests for recent mainline kernels. If I had some free time later then I may look through the results. I was never focused on catching regressions before mainline releases and I probably never will due to the amount of time it consumes and I'm not testing-orientated per-se. More often I would use the data to correlate bug reports with the closest equivalent mmtests and see could a window be identified where the regression was introduced and why. This is reactionary based on bug reports and to combat this there are times when I am otherwise idle that I would like to preemptively look for regressions. Unfortunately when this happens my test data is rarely up to date so the regression has to be reverified against the latest kernel. By the time that test completes the free time would be gone and the opportunity missed. It would be nice to always have recent data to work with.
SUSE runs Hackweeks during which developers can work on whatever they want and the last one was October 7-11th, 2013. I took the opportunity to write "melbot" (name was a joke) which is meant to do all the grunt automation work that the real Mel should be doing but never has enough time for. There are a lot of components but none of them are particularly complex. It has a number of basic responsibilities.
- Manage remote power and serial consoles
- Reserve, prepare and release machines from supported test grids
- Deploy distribution images
- Install packages
- Build kernel rpms or from source, install and reboot to the new kernel
- Monitor git trees for new kernels it should be testing
- Install and configure mmtests
- Run mmtests, log results, generate (not a very pretty) report
There is a test co-ordinator server and a number of test clients that are part of a grid where both a local grid and a grid within SUSE are supported. To watch git trees, build rpms if necessary, queue jobs and report on jobs there is a Bob The Builder script. Kernel deployment and test execution is the responsibility of melbot. Starting the whole thing going is easy and looks something like
$ screen -S bob-the-builder -d -m /srv/melbot/bob-the-builder/bob-loop.sh $ screen -S melbot-MachinaA -d -m /srv/melbot/melbot/melbot-loop.sh MachinaA $ screen -S melbot-MachinaB -d -m /srv/melbot/melbot/melbot-loop.sh MachinaBOnce melbot is running it'll check the job queue, run any necessary tests and record the results. If any problems are encountered that it cannot handle automatically, including tests taking longer than expected, melbot emails me a notification.
None of the source for melbot is released because it is a giant hack and requires undocumented manual installation. I doubt it would be of use anyway as companies with enough resources probably have their own automation already. SUSE internally has automation for hardware management and test deployment that melbot reuses large chunks of. If I can crank out basic server-side automation in a week then a dedicated team can do it and probably a lot better. The key for me is that there is now a web page containing recent mainline kernel comparisons for mmtests. Each directory there corresponds to a similarly named configuration file in the top-level directory configs/ in mmtests. As I write this, the results are not fully up to date yet as Melbot has only been running 12 days on this small test grid and will take another 5-10 days to fully catch up. Once it has caught up, it'll check for recent kernels to test on the 17th of every month and will continually update as long as it is running. As I am paying the electricity bill on this, it might not always be running!
These test machines are new as I lost most of my test grid over the last two months due to hardware failure and all my previous results were discarded. I have not looked though these results in detail as I'm not long back from kernel summit but lets look through a few now and see what sort of hilarity might be waiting. The two test machines are ivor and ivy and I'm not going to describe what type of machines they are. FWIW, they are low end machines with single disk rotary storage.
Page allocator performance (ivor)
Page allocator performance (ivy)
kernbench is a simple kernel complication benchmark. Ivor is up to 3.10-stable and is not showing any regressions. 3.10.0 was a mess but it got fixed in -stable and is not generally showing any regressions. Ivy is only up as far as 3.4.66 but it looks like elapsed time was regressing at that point when it did not during 3.4 implying that a regression may have been introduced to -stable there. Worth keeping an eye on to see what more recent kernels look like in a week or so.
aim9 is a microbench that is not very reliable but can be an indicator of greater problems. It's not reliable as it is almost impossible to bisect with and is sensitive to a lot of factors. Regardless, ivor and ivy are both seeing a number of regressions and the system CPU time is negligible so something weird is going on. There is small amounts of IO going on the background, probably from the monitoring so it could be the interference but it seems too low to affect the type of tests that are running. Interrupts are surprisingly high in recent kernels, no idea why.
vmr-stream catches nasty regressions related to cache coloring. It is showing nothing interesting today.
page allocator microbench shows little of interest. It shows 3.0 sucked in comparison to 3.0-stable but it's known why. 3.10 also sucked but got fixed according to ivor.
pft is a page fault microbench. Ivor shows that 3.10 once again completely sucked and while 3.0.17 was better, it's still regressing slightly and the system CPU is higher so something screwy is going on there. Ivy looks ok currently but it has a long way to go.
So some minor things there -- pft is the greatest concern to pin down why the system CPU usage is higher and if it got fixed in a later mainline kernel. If it got fixed later then there may be a case for backporting it but it would depend on who was using 3.10 longterm kernels.
Local network performance (ivor)
Local network performance (ivy)
netperf-udp is running the UDP_STREAM from netperf on localhost. Performance is completely mutilated and went to hell somewhere between 3.2 and 3.4 and had varying degrees of being worse ever since according to both ivor and ivy.
netperf-tcp tells a different tale. On ivor it regressed between 3.0 and 3.2 but 3.4 was not noticably worse and it was massively improved in 3.10-stable by something. This possibly indicates that the network people are paying closer attention to TCP than UDP but it could also indicate that loopback testing of network is not common as it is not usually that interesting. Ivy has not progressed far enough but looked like it saw similar regressions to ivor for a while.
tbench4 generally looks ok, not great, but not bad enough to care although it is interesting to note that 3.10.17 is more variable than earlier kernels on ivor at least.
Of that, the netperf-udp performance would be of greatest concern. Given infinite free time and if that machine was free it is fortunately trivial to bisect these problems at least. There is a bisection script that uses the same infrastructure to bisect, build install and test kernels. It just has to be tuned to pick the value that is "bad". If the results for netperf-udp are still bad when 3.12 tests complete then I'll bisect what happened in 3.4 and report accordingly. I'm guessing that it'll be dropped as loopback is just not the common case.
There are probably a lot of surprises in there and at some point I should spend a day reading through them all, bisect any problems and file bugs as appropriate. I have no idea when I will find the time to do that. There is always the temptation that when I have that free time that I'll extend melbot to find those bugs and bisect them for me if the class of regressions continue to be fairly obvious. By rights I should coordinate with Fengguang Wu to run many of the short-lived tests with his automation and automatically identify basic regressions that way. As always, there is no shortage of solutions, just of time to execute and maintain them.
Finally recovered from jetlag after kernel summit. Trip home was pretty hellish. *Never* spending overnight in an airport again.
Spent the day getting back into the swing of things catching up on mail, updating kernels everywhere etc.
Tomorrow, some real work. I came home with a ton of suggestions from everyone at kernel summit for things they’d like to see trinity try to do, so I’m going to braindump some of those tomorrow, and have a crack at implementing some of them.
Before we celebrate the death of prelink properly (no more insufferable cron jobs), let us toast its excellence, particularly in robustness under very difficult circumstances: when failure cannot be tolerated. A mistake in roto-rooting your libraries means the box fails to boot -- or worse. I want my code be like Jakub's et.al.
Throughout these years I felt free to rpm -e prelink and expected everything working fine, including all the yum upgrades.
I think the main reason prelink died is that it attacked the problem of optimizing bad software: mainly making OpenOffice and Firefox start quicker. Once bad software became good, prelink faded. Its benefits in the age of Python are not as great and it helps not at all if you do not abuse shared libraries. The lesson here is, you cannot really fix bad sofware with a thin wrapper of good software.
And now, yay.