Beowulf Bulk Data Server - Disk Test Results |
| Goals |
| Problems with testing disks from user space or: Why Erik thought it was really necessary to write a kernel module |
There's a kernel thread called bdflush who's task it is to select buffers to flush to the disk when more free buffers are needed. Depending on how much memory you have, the process of walking the buffer lists looking for victim buffers to flush can become quite expensive. For example, on a 200 MHz PPro system with 128 MB of RAM, the flushing system was expensive enough that the hard disk was not even kept busy with writes it selected. (As evidenced by the hard disk light turning off at times.)
Although these problems are most pronounced when writing, reading still stuffers from the memory flooding effect (although more slowly and with clean buffers) and bdflush is still needed to come along and toss out those buffers and as I mentioned it can take a while to decide which ones to kill.
These huge backlogs can also lead to very inconsistent timings when doing tests from user land with commands like time dd if=/dev/zero of=/dev/hda.
Another item that is not a major concern while doing tests but will be very important in our final application is what kind of impact doing this kind of IO will have on other operations in the system. Doing this kind of IO with unrestricted caching leads to a situation where you cant keep your process's pages in memory.
In an effort to decrease the influence of the buffer cache on my tests, I wrote a kernel module to manage the buffers myself and also control when blocks are actually scheduled to be written. (In the normal scheme of things, you depend on bdflush to schedule your blocks to be written.)
The key to solving these problems is controlling your backlog of dirty blocks and most importantly telling the buffer cache when it can forget blocks. (See the bforget() function in linux/fs/buffer.c) You can control you backlog from user space using O_SYNC but anybody who has used that knows what the performance impact of that is like.
| Testing software |
The timing stops when the requested number of blocks have been read into memory. The tester will take a few moments more to wait for (and clean up) the blocks that were on the request queue but hadn't finished yet.
The test software is available here.
| The PIIX3's "independent" IDE channels aren't really that independent after all. |
Side note: You don't gain anything performance wise by adding a slave disk on a channel since only one disk can be used at a time and you can't select the other disk while a request is progress. The ATA/ATAPI4 specs talk about an optional "overlapped feature set" which would you to select the other device on the bus while one is busy. This should allow a performance gain with a slave disk but to my knowledge, there's no support for this in the Linux kernel.
| The PIIX3 is prone to fits of nervousness |
hdc: timeout waiting for DMA
hdc: irq timeout: status=0x58 { DriveReady SeekComplete DataRequest }
I'm not entirely sure what is was causing that. It turns out that
turning up the PCI Master Latency Timer, you can make this message go
away. (MLT was set to 32, I changed it to 64.) This suggests that
the PIIX3's timer was running out before it was finished and the chip
aparently can't handle this situation gracefully.
| Device | Interface | IDE Channel | ||
| hda | Intel PIIX3 | 0 | Master | |
| hdb | Intel PIIX3 | 0 | Slave | |
| hdc | Intel PIIX3 | 1 | Master | |
| hdd | Intel PIIX3 | 1 | Slave | |
| hde | Promise Ultra33 | 0 | Master | (*) |
| hdg | Promise Ultra33 | 1 | Master | (*) |
| Quantum Fireball SE 8.4 GB | |||||||
| Disks used | Throughput (bytes per sec) | ||||||
| Read | Write | ||||||
| hda | 11.220 | 9.723 | |||||
| hdb | 11.226 | 9.723 | |||||
| hda | hdb | 11.193 | 9.607 | ||||
| hda | hdc | 12.098 | 10.750 | ||||
| hda | hdb | hdc | hdd | 12.098 | 10.799 | ||
| hde | 11.269 | 11.268 | |||||
| hde | hdg | 22.512 | 22.497 | ||||
| hdc | hde | hdg | 32.884 | 30.540 | |||
| Maxtor Diamond MAX 88400D8 | |||||||
| Disks used | Throughput (bytes per sec) | ||||||
| Read | Write | ||||||
| hda | 11.307 | 7.111 | |||||
| hdb | 11.306 | 7.198 | |||||
| hda | hdb | 11.197 | 7.397 | ||||
| hda | hdc | 11.937 | 10.801 | ||||
| hda | hdb | hdc | hdd | 11.939 | 10.785 | ||
| hde | 11.292 | 7.443 | |||||
| hde | hdg | 22.599 | 14.656 | ||||
| hdc | hde | hdg | 32.830 | 21.327 | |||
| IBM Deskstar 8 DHEA-38451 | |||||||
| Disks used | Throughput (bytes per sec) | ||||||
| Read | Write | ||||||
| hda | 9.732 | 7.145 | |||||
| hdb | 9.740 | 9.124 | |||||
| hda | hdb | 9.724 | 9.984 | ||||
| hda | hdc | 12.070 | 10.735 | ||||
| hda | hdb | hdc | hdd | 12.071 | 10.787 | ||
| hde | 9.756 | 9.609 | |||||
| hde | hdg | 19.493 | 19.161 | ||||
| hdc | hde | hdg | 28.902 | 28.129 | |||