When looking at data sheets and benchmarks for hard disks I often see fantastic transfer rates of over 200 MB/s while my real life experience is vastly different.
We run a busy Linux file server providing both NFS and Samba file access, backed by a hardware RAID6 running a ext3 filesystem.
Experience tells me that the worst performance is to be expected when there are competing read and write accesses to the file system. Looking at existing benchmarks, this pattern does not seem to get exercised a lot. Especially not in the way of an expected activity of a file server. The best approximation I have seen are people who run competing dd processes, which is not even close to reality.
I have therefore developed a new benchmark to better approximate the activities of a file server. The test concerns itself mainly with reading and writing files in a tree with a file size distribution inspired by the content of a real home directory partition. See the FsOpBench Homepage for details.
The benchmark does not only measure throughput. It measures the time required for each operation it executes and then reports detailed statistics.
Armed with this new benchmark I went ahead to figure out how to make our file server faster.
Over the years, Linux has gained a fair number of knobs one can twiddle to optimize filesystem performance:
at the lowest level there is the choice of io-scheduler ; opinions seem to be split between cfq and deadline.
when working with ext3 there are three ways to handle the journaling, data=journal, data=ordered and data=writeback the popular wisdom being, that data=writeback is the fastest but also most risky option. The journal settings mainly affect write performance since read access does not touch the journal.
the noatime or relatime mount options which prevent atime updates on read-access.
keeping the journal inside the filesystem or on an external disk or on an ssd device.
whether or not to activate the write barrier.
We are looking at 72 potential combinations of these settings. Ample work for the benchmark program. And that is not even taking into account other file systems like xfs, ext4 or btrfs. To further complicate the situation, there is a lot of activity in kernel development regarding IO performance, so the kernel version is bound to play a role too. In this evaluation I have worked with the Ubuntu sever kernel for hardy (2.6.24) as well as the latest official kernel 2.6.31.2.
The first time the benchmark is run, it sets up a 20 GB file tree. This tree is then used as input for the read operations.
In normal operation, the benchmark works as follows:
Sync and drop all caches
Fork 1 reader, going through file tree number 0
Fork 3 readers going through file trees 1 to 3 in parallel
Fork 3 writers creating new file trees while the 3 readers from the previous step continue to traverse their respective trees. For this test, the performance measurement starts after the writers have been left running for 30 seconds to fill up caches.
The benchmark executes many thousand filesystem operations and measures the the time required to execute each one. It then builds min, max, average, median and stdev statistics from these numbers.
The tests have been conducted on a dual quad core Intel Xeon E5520 system with an Areca 1222 RAID controller running an RAID6 configuration on 7 SATA WD 1002FBYS drives. The tests were the only activity on the system.
Before looking at the individual results, there is one general observation. All configurations show amazingly high standard deviation. Even the most simple test with a single reader process, there is often a factor of 1000 between the median and the slowest measurement.
1 readers (30s) ---------------------------------------------------------------------- A read dir cnt 31792 min 0.002 ms max 112.924 ms avg 0.063 ms med 0.004 ms std 1.121 B lstat file cnt 29652 min 0.008 ms max 10.614 ms avg 0.046 ms med 0.018 ms std 0.377 C open file cnt 23228 min 0.015 ms max 0.137 ms avg 0.018 ms med 0.017 ms std 0.005 D rd 1st byte cnt 23228 min 0.170 ms max 102.035 ms avg 0.595 ms med 0.270 ms std 2.582 E read rate 57.546 MB/s (data) 22.278 MB/s (readdir + open + 1st byte + data) 3 readers (30s) ---------------------------------------------------------------------- A read dir cnt 10164 min 0.002 ms max 93.665 ms avg 0.129 ms med 0.004 ms std 1.566 B lstat file cnt 9502 min 0.008 ms max 109.577 ms avg 0.105 ms med 0.018 ms std 1.360 C open file cnt 7514 min 0.015 ms max 0.112 ms avg 0.018 ms med 0.017 ms std 0.006 D rd 1st byte cnt 7514 min 0.175 ms max 228.477 ms avg 2.202 ms med 0.286 ms std 9.472 E read rate 19.244 MB/s (data) 7.009 MB/s (readdir + open + 1st byte + data)
Running the same test on a single hard disk shows pretty similar results. Only the standard deviation seems to be a tad lower for the RAID setup.
In the graph below I have plotted the values for D (time to read 1st byte).
While the majority of readings stays low, there is an increasing number which bursts out on top. Even worse, at 10 seconds, the read process got stuck for more than a second.
Analysis of the data shows the following recipe for optimal performance on ext3 on LVM on Areca RAID6.
For the rest of the settings the benchmark does not render clear indications. I assume the following:
The table below shows the results gathered when running this on 2.6.24.
Linux 2.6.24-24-server relatime barrier=0 fs=ext3 disk=areca RAID6 journal=int data=ordered scheduler=cfq ************************************************************************************** 1 readers (30s) ---------------------------------------------------------------------- A read dir cnt 31375 min 0.002 ms max 78.354 ms avg 0.061 ms med 0.005 ms std 0.779 B lstat file cnt 29251 min 0.008 ms max 28.075 ms avg 0.050 ms med 0.020 ms std 0.460 C open file cnt 22874 min 0.015 ms max 0.116 ms avg 0.018 ms med 0.017 ms std 0.005 D rd 1st byte cnt 22874 min 0.173 ms max 95.596 ms avg 0.612 ms med 0.268 ms std 2.674 E read rate 57.529 MB/s (data) 21.947 MB/s (readdir + open + 1st byte + data) 3 readers (30s) ---------------------------------------------------------------------- A read dir cnt 10105 min 0.002 ms max 99.035 ms avg 0.120 ms med 0.004 ms std 1.706 B lstat file cnt 9448 min 0.008 ms max 99.366 ms avg 0.120 ms med 0.019 ms std 1.834 C open file cnt 7479 min 0.015 ms max 0.115 ms avg 0.018 ms med 0.017 ms std 0.006 D rd 1st byte cnt 7479 min 0.171 ms max 136.923 ms avg 2.157 ms med 0.284 ms std 8.937 E read rate 18.230 MB/s (data) 6.934 MB/s (readdir + open + 1st byte + data) 3 readers, 3 writers (30s) ---------------------------------------------------------------------- F write open cnt 3886 min 0.037 ms max 41.745 ms avg 0.078 ms med 0.041 ms std 0.872 G wr 1st byte cnt 3886 min 0.007 ms max 0.130 ms avg 0.008 ms med 0.007 ms std 0.004 H write close cnt 3886 min 0.012 ms max 0.110 ms avg 0.016 ms med 0.016 ms std 0.005 I mkdir cnt 334 min 0.019 ms max 1821.824 ms avg 26.962 ms med 0.025 ms std 162.606 J write rate 522.051 MB/s (data) 218.912 MB/s (open + 1st byte + data + close) A read dir cnt 3847 min 0.002 ms max 89.878 ms avg 0.187 ms med 0.004 ms std 1.995 B lstat file cnt 3604 min 0.008 ms max 26.308 ms avg 0.159 ms med 0.019 ms std 1.329 C open file cnt 2878 min 0.015 ms max 0.148 ms avg 0.019 ms med 0.018 ms std 0.007 D rd 1st byte cnt 2878 min 0.179 ms max 925.522 ms avg 7.404 ms med 0.296 ms std 42.786 E read rate 10.198 MB/s (data) 2.480 MB/s (readdir + open + 1st byte + data)
For 2.6.31.2 the results are a bit better. The number of writes completed raises dramatically also the large delays and high standard deviation associated with write close seems to have gone.
Linux 2.6.31.2-test relatime barrier=0 fs=ext3 disk=areca RAID6 journal=ext data=ordered scheduler=cfq ************************************************************************************** 1 readers (30s) ---------------------------------------------------------------------- A read dir cnt 33964 min 0.001 ms max 84.159 ms avg 0.056 ms med 0.004 ms std 0.666 B lstat file cnt 31685 min 0.007 ms max 19.915 ms avg 0.051 ms med 0.022 ms std 0.402 C open file cnt 24842 min 0.014 ms max 0.677 ms avg 0.021 ms med 0.017 ms std 0.011 D rd 1st byte cnt 24842 min 0.175 ms max 90.800 ms avg 0.560 ms med 0.270 ms std 1.631 E read rate 64.667 MB/s (data) 24.137 MB/s (readdir + open + 1st byte + data) 3 readers (30s) ---------------------------------------------------------------------- A read dir cnt 11299 min 0.001 ms max 20.214 ms avg 0.099 ms med 0.004 ms std 0.866 B lstat file cnt 10563 min 0.007 ms max 87.001 ms avg 0.112 ms med 0.022 ms std 1.424 C open file cnt 8357 min 0.014 ms max 0.430 ms avg 0.021 ms med 0.017 ms std 0.016 D rd 1st byte cnt 8357 min 0.176 ms max 216.665 ms avg 2.171 ms med 0.290 ms std 10.698 E read rate 25.330 MB/s (data) 7.764 MB/s (readdir + open + 1st byte + data) 3 readers, 3 writers (30s) ---------------------------------------------------------------------- F write open cnt 11669 min 0.036 ms max 25.387 ms avg 0.072 ms med 0.042 ms std 0.590 G wr 1st byte cnt 11669 min 0.006 ms max 0.637 ms avg 0.008 ms med 0.007 ms std 0.010 H write close cnt 11669 min 0.013 ms max 0.354 ms avg 0.020 ms med 0.019 ms std 0.015 I mkdir cnt 1096 min 0.021 ms max 685.149 ms avg 24.203 ms med 0.029 ms std 94.218 J write rate 280.469 MB/s (data) 162.076 MB/s (open + 1st byte + data + close) A read dir cnt 7364 min 0.001 ms max 50.615 ms avg 0.115 ms med 0.004 ms std 1.068 B lstat file cnt 6884 min 0.007 ms max 51.400 ms avg 0.118 ms med 0.023 ms std 1.068 C open file cnt 5446 min 0.014 ms max 1.265 ms avg 0.021 ms med 0.017 ms std 0.022 D rd 1st byte cnt 5446 min 0.180 ms max 324.864 ms avg 3.516 ms med 0.294 ms std 19.900 E read rate 16.190 MB/s (data) 4.828 MB/s (readdir + open + 1st byte + data)
By keeping the journal on an external device, the performance should be improved, or so it would seem since the journaled writes will not get into the way of the reads. In our testing we did not find all that much evidence for the theory. The median numbers are the same regardless. The only substantial improvement is the lower standard deviation for 'H write close'. The lower overall data rate is probably due to the 'I mkdir' calls taking twice as long on average.
Linux 2.6.24-24-server relatime barrier=0 fs=ext3 disk=areca RAID6 journal=ext data=ordered scheduler=cfq ********************************************************************************************************* 1 readers (30s) ---------------------------------------------------------------------- A read dir cnt 31227 min 0.002 ms max 95.552 ms avg 0.063 ms med 0.004 ms std 0.940 B lstat file cnt 29114 min 0.008 ms max 92.322 ms avg 0.054 ms med 0.021 ms std 0.842 C open file cnt 22770 min 0.015 ms max 0.115 ms avg 0.018 ms med 0.017 ms std 0.005 D rd 1st byte cnt 22770 min 0.174 ms max 105.196 ms avg 0.614 ms med 0.270 ms std 2.749 E read rate 57.985 MB/s (data) 21.930 MB/s (readdir + open + 1st byte + data) 3 readers (30s) ---------------------------------------------------------------------- A read dir cnt 10136 min 0.002 ms max 111.241 ms avg 0.118 ms med 0.004 ms std 1.706 B lstat file cnt 9479 min 0.008 ms max 103.033 ms avg 0.109 ms med 0.018 ms std 1.488 C open file cnt 7506 min 0.015 ms max 0.129 ms avg 0.018 ms med 0.017 ms std 0.006 D rd 1st byte cnt 7506 min 0.172 ms max 133.411 ms avg 2.190 ms med 0.288 ms std 9.303 E read rate 18.601 MB/s (data) 6.941 MB/s (readdir + open + 1st byte + data) 3 readers, 3 writers (30s) ---------------------------------------------------------------------- F write open cnt 2992 min 0.037 ms max 23.378 ms avg 0.059 ms med 0.041 ms std 0.534 G wr 1st byte cnt 2992 min 0.007 ms max 0.164 ms avg 0.008 ms med 0.007 ms std 0.005 H write close cnt 2992 min 0.013 ms max 205.499 ms avg 0.088 ms med 0.017 ms std 3.757 I mkdir cnt 303 min 0.020 ms max 1241.212 ms avg 18.730 ms med 0.025 ms std 118.599 J write rate 4.203 MB/s (data) 4.129 MB/s (open + 1st byte + data + close) A read dir cnt 5514 min 0.002 ms max 98.726 ms avg 0.189 ms med 0.004 ms std 2.123 B lstat file cnt 5157 min 0.008 ms max 62.170 ms avg 0.129 ms med 0.019 ms std 1.326 C open file cnt 4089 min 0.015 ms max 0.126 ms avg 0.018 ms med 0.017 ms std 0.007 D rd 1st byte cnt 4089 min 0.174 ms max 651.012 ms avg 4.577 ms med 0.296 ms std 21.041 E read rate 9.454 MB/s (data) 3.342 MB/s (readdir + open + 1st byte + data)
And the same for 2.6.31 (again faster writes):
Linux 2.6.31.2-test relatime barrier=0 fs=ext3 disk=areca RAID6 journal=/dev/journal/scratch_a data=ordered scheduler=cfq ************************************************************************************** 1 readers (30s) ---------------------------------------------------------------------- A read dir cnt 33964 min 0.001 ms max 84.159 ms avg 0.056 ms med 0.004 ms std 0.666 B lstat file cnt 31685 min 0.007 ms max 19.915 ms avg 0.051 ms med 0.022 ms std 0.402 C open file cnt 24842 min 0.014 ms max 0.677 ms avg 0.021 ms med 0.017 ms std 0.011 D rd 1st byte cnt 24842 min 0.175 ms max 90.800 ms avg 0.560 ms med 0.270 ms std 1.631 E read rate 64.667 MB/s (data) 24.137 MB/s (readdir + open + 1st byte + data) 3 readers (30s) ---------------------------------------------------------------------- A read dir cnt 11299 min 0.001 ms max 20.214 ms avg 0.099 ms med 0.004 ms std 0.866 B lstat file cnt 10563 min 0.007 ms max 87.001 ms avg 0.112 ms med 0.022 ms std 1.424 C open file cnt 8357 min 0.014 ms max 0.430 ms avg 0.021 ms med 0.017 ms std 0.016 D rd 1st byte cnt 8357 min 0.176 ms max 216.665 ms avg 2.171 ms med 0.290 ms std 10.698 E read rate 25.330 MB/s (data) 7.764 MB/s (readdir + open + 1st byte + data) 3 readers, 3 writers (30s) ---------------------------------------------------------------------- F write open cnt 11669 min 0.036 ms max 25.387 ms avg 0.072 ms med 0.042 ms std 0.590 G wr 1st byte cnt 11669 min 0.006 ms max 0.637 ms avg 0.008 ms med 0.007 ms std 0.010 H write close cnt 11669 min 0.013 ms max 0.354 ms avg 0.020 ms med 0.019 ms std 0.015 I mkdir cnt 1096 min 0.021 ms max 685.149 ms avg 24.203 ms med 0.029 ms std 94.218 J write rate 280.469 MB/s (data) 162.076 MB/s (open + 1st byte + data + close) A read dir cnt 7364 min 0.001 ms max 50.615 ms avg 0.115 ms med 0.004 ms std 1.068 B lstat file cnt 6884 min 0.007 ms max 51.400 ms avg 0.118 ms med 0.023 ms std 1.068 C open file cnt 5446 min 0.014 ms max 1.265 ms avg 0.021 ms med 0.017 ms std 0.022 D rd 1st byte cnt 5446 min 0.180 ms max 324.864 ms avg 3.516 ms med 0.294 ms std 19.900 E read rate 16.190 MB/s (data) 4.828 MB/s (readdir + open + 1st byte + data)
Also neither keeping the journal on an SSD or a physical harddrive did not make any notable difference in this setup. Keeping multiple journals on a single SSD is bound to be more efficient due to the lower seek time but this was not tested.
The RAID setup performs realy badly on a single hard drive for both 2.6.24 and 2.6.31:
The chart below might be a bit misleading since the results J to F for the writer are actually quite good, except that it managed to do only a single 'round' of writing in the 30 seconds allocated. The reason for this is that the benchmark sends a signal to the writer process to start measuring. After 30 seconds it gets a second signal to print the statistics. If the writer is blocked while receiving the signal it will only act on it once the blockage is over. In this case it got both the start and the stop signal while blocked. Eventually it got one round of writing done and this is what we can see in the table below.
Linux 2.6.31.2-test relatime barrier=0 fs=ext3 disk=hdd journal=int data=ordered scheduler=cfq ********************************************************************************************** 1 readers (30s) ---------------------------------------------------------------------- A read dir cnt 31834 min 0.001 ms max 10.841 ms avg 0.087 ms med 0.004 ms std 0.692 B lstat file cnt 29737 min 0.007 ms max 16.426 ms avg 0.066 ms med 0.023 ms std 0.460 C open file cnt 23450 min 0.014 ms max 0.238 ms avg 0.022 ms med 0.021 ms std 0.011 D rd 1st byte cnt 23450 min 0.173 ms max 32.441 ms avg 0.635 ms med 0.277 ms std 1.448 E read rate 79.409 MB/s (data) 23.406 MB/s (readdir + open + 1st byte + data) 3 readers (30s) ---------------------------------------------------------------------- A read dir cnt 3111 min 0.001 ms max 44.984 ms avg 0.486 ms med 0.004 ms std 3.079 B lstat file cnt 2917 min 0.007 ms max 45.703 ms avg 0.625 ms med 0.024 ms std 3.489 C open file cnt 2333 min 0.014 ms max 0.145 ms avg 0.021 ms med 0.018 ms std 0.011 D rd 1st byte cnt 2333 min 0.178 ms max 124.612 ms avg 8.013 ms med 0.414 ms std 13.877 E read rate 7.819 MB/s (data) 2.153 MB/s (readdir + open + 1st byte + data) 3 readers, 3 writers (30s) ---------------------------------------------------------------------- F write open cnt 57 min 0.039 ms max 0.099 ms avg 0.047 ms med 0.043 ms std 0.013 G wr 1st byte cnt 57 min 0.006 ms max 0.014 ms avg 0.007 ms med 0.006 ms std 0.001 H write close cnt 57 min 0.014 ms max 0.077 ms avg 0.021 ms med 0.020 ms std 0.009 I mkdir cnt 1 min 0.065 ms max 0.065 ms avg 0.065 ms med 0.065 ms std 0.000 J write rate 370.348 MB/s (data) 204.618 MB/s (open + 1st byte + data + close) A read dir cnt 12795 min 0.001 ms max 52.534 ms avg 0.149 ms med 0.004 ms std 1.309 B lstat file cnt 11941 min 0.007 ms max 30.476 ms avg 0.129 ms med 0.023 ms std 1.091 C open file cnt 9382 min 0.014 ms max 0.233 ms avg 0.021 ms med 0.017 ms std 0.012 D rd 1st byte cnt 9382 min 0.177 ms max 6524.964 ms avg 3.705 ms med 0.297 ms std 69.637 E read rate 15.355 MB/s (data) 4.385 MB/s (readdir + open + 1st byte + data)
When choosing the deadline scheduler instead, the fortunes get reversed. Now the writers starve the readers:
Linux 2.6.31.2-test relatime barrier=0 fs=ext3 disk=hdd journal=int data=ordered scheduler=deadline ************************************************************************************** 1 readers (30s) ---------------------------------------------------------------------- A read dir cnt 30388 min 0.001 ms max 85.786 ms avg 0.088 ms med 0.004 ms std 0.839 B lstat file cnt 28377 min 0.007 ms max 36.928 ms avg 0.067 ms med 0.023 ms std 0.521 C open file cnt 22343 min 0.014 ms max 0.248 ms avg 0.022 ms med 0.022 ms std 0.011 D rd 1st byte cnt 22343 min 0.173 ms max 97.215 ms avg 0.646 ms med 0.278 ms std 1.995 E read rate 69.505 MB/s (data) 22.314 MB/s (readdir + open + 1st byte + data) 3 readers (30s) ---------------------------------------------------------------------- A read dir cnt 1703 min 0.001 ms max 44.963 ms avg 0.926 ms med 0.004 ms std 4.559 B lstat file cnt 1597 min 0.007 ms max 64.839 ms avg 0.861 ms med 0.025 ms std 4.284 C open file cnt 1277 min 0.014 ms max 0.107 ms avg 0.021 ms med 0.018 ms std 0.010 D rd 1st byte cnt 1277 min 0.182 ms max 150.754 ms avg 14.381 ms med 14.457 ms std 13.847 E read rate 3.977 MB/s (data) 1.191 MB/s (readdir + open + 1st byte + data) 3 readers, 3 writers (30s) ---------------------------------------------------------------------- F write open cnt 8210 min 0.036 ms max 1385.255 ms avg 2.361 ms med 0.041 ms std 33.087 G wr 1st byte cnt 8210 min 0.006 ms max 0.121 ms avg 0.007 ms med 0.006 ms std 0.003 H write close cnt 8210 min 0.013 ms max 128.845 ms avg 0.035 ms med 0.019 ms std 1.422 I mkdir cnt 758 min 0.019 ms max 1481.164 ms avg 11.147 ms med 0.038 ms std 88.370 J write rate 173.795 MB/s (data) 14.317 MB/s (open + 1st byte + data + close) A read dir cnt 130 min 0.001 ms max 454.968 ms avg 14.884 ms med 0.004 ms std 66.037 B lstat file cnt 120 min 0.007 ms max 300.857 ms avg 12.510 ms med 0.027 ms std 54.107 C open file cnt 92 min 0.015 ms max 0.050 ms avg 0.022 ms med 0.021 ms std 0.006 D rd 1st byte cnt 92 min 0.541 ms max 524.495 ms avg 186.560 ms med 217.043 ms std 133.519 E read rate 0.264 MB/s (data) 0.088 MB/s (readdir + open + 1st byte + data)
The only way to get decent performance is to sacrifice some data integrity guarantees by switching to data=writeback journaling.
Linux 2.6.31.2-test relatime barrier=0 fs=ext3 disk=hdd journal=int data=writeback scheduler=cfq ************************************************************************************** 1 readers (30s) ---------------------------------------------------------------------- A read dir cnt 30861 min 0.001 ms max 65.866 ms avg 0.090 ms med 0.004 ms std 0.794 B lstat file cnt 28824 min 0.007 ms max 74.237 ms avg 0.067 ms med 0.022 ms std 0.625 C open file cnt 22713 min 0.014 ms max 0.313 ms avg 0.022 ms med 0.020 ms std 0.012 D rd 1st byte cnt 22713 min 0.170 ms max 103.009 ms avg 0.660 ms med 0.281 ms std 2.058 E read rate 77.399 MB/s (data) 22.723 MB/s (readdir + open + 1st byte + data) 3 readers (30s) ---------------------------------------------------------------------- A read dir cnt 3355 min 0.001 ms max 106.079 ms avg 0.519 ms med 0.004 ms std 3.581 B lstat file cnt 3139 min 0.007 ms max 76.925 ms avg 0.487 ms med 0.024 ms std 3.384 C open file cnt 2487 min 0.014 ms max 0.270 ms avg 0.021 ms med 0.018 ms std 0.012 D rd 1st byte cnt 2487 min 0.183 ms max 128.951 ms avg 7.814 ms med 0.367 ms std 14.997 E read rate 9.094 MB/s (data) 2.244 MB/s (readdir + open + 1st byte + data) 3 readers, 3 writers (30s) ---------------------------------------------------------------------- F write open cnt 10780 min 0.036 ms max 473.607 ms avg 0.109 ms med 0.041 ms std 4.610 G wr 1st byte cnt 10780 min 0.006 ms max 17.067 ms avg 0.008 ms med 0.006 ms std 0.164 H write close cnt 10780 min 0.012 ms max 0.428 ms avg 0.018 ms med 0.018 ms std 0.008 I mkdir cnt 1004 min 0.018 ms max 19592.233 ms avg 29.447 ms med 0.026 ms std 644.768 J write rate 240.950 MB/s (data) 129.961 MB/s (open + 1st byte + data + close) A read dir cnt 9256 min 0.001 ms max 32.343 ms avg 0.146 ms med 0.004 ms std 1.222 B lstat file cnt 8638 min 0.007 ms max 45.528 ms avg 0.138 ms med 0.023 ms std 1.309 C open file cnt 6783 min 0.014 ms max 0.266 ms avg 0.021 ms med 0.017 ms std 0.013 D rd 1st byte cnt 6783 min 0.178 ms max 312.909 ms avg 2.819 ms med 0.297 ms std 16.400 E read rate 14.757 MB/s (data) 5.163 MB/s (readdir + open + 1st byte + data)
Note the HUGE max delay for mkdir seen in this example. A hang of 19 seconds to get a single mkdir through. But at least the overall data throughput seems to be ok.
The deadline scheduler, did not do well in any of the scenarios. While multiple competing readers are handled gracefully, they suffer a major performance impact as soon as the writers start. The sample below is one of the 'faster' variants. The behaviour is the same across the board, for RAID as well as for HDD. The situation does not change in 2.6.31 either.
Linux 2.6.24-24-server atime barrier=0 fs=ext3 disk=areca RAID6 journal=ext data=ordered scheduler=deadline *********************************************************************************************************** 1 readers (30s) ---------------------------------------------------------------------- A read dir cnt 31695 min 0.002 ms max 74.903 ms avg 0.058 ms med 0.005 ms std 0.667 B lstat file cnt 29559 min 0.008 ms max 88.191 ms avg 0.057 ms med 0.022 ms std 0.837 C open file cnt 23155 min 0.015 ms max 0.161 ms avg 0.018 ms med 0.017 ms std 0.005 D rd 1st byte cnt 23155 min 0.171 ms max 100.230 ms avg 0.581 ms med 0.267 ms std 2.224 E read rate 56.348 MB/s (data) 22.497 MB/s (readdir + open + 1st byte + data) 3 readers (30s) ---------------------------------------------------------------------- A read dir cnt 13642 min 0.002 ms max 80.890 ms avg 0.114 ms med 0.004 ms std 1.327 B lstat file cnt 12763 min 0.008 ms max 57.040 ms avg 0.100 ms med 0.019 ms std 0.982 C open file cnt 10131 min 0.015 ms max 0.147 ms avg 0.019 ms med 0.017 ms std 0.009 D rd 1st byte cnt 10131 min 0.172 ms max 118.771 ms avg 1.498 ms med 0.298 ms std 4.785 E read rate 24.196 MB/s (data) 9.419 MB/s (readdir + open + 1st byte + data) 3 readers, 3 writers (30s) ---------------------------------------------------------------------- F write open cnt 7547 min 0.037 ms max 2388.053 ms avg 0.920 ms med 0.041 ms std 38.511 G wr 1st byte cnt 7547 min 0.007 ms max 7.754 ms avg 0.010 ms med 0.007 ms std 0.103 H write close cnt 7547 min 0.012 ms max 1.143 ms avg 0.018 ms med 0.016 ms std 0.028 I mkdir cnt 706 min 0.018 ms max 5190.903 ms avg 33.380 ms med 0.025 ms std 349.304 J write rate 434.229 MB/s (data) 37.918 MB/s (open + 1st byte + data + close) A read dir cnt 1019 min 0.002 ms max 125.373 ms avg 0.261 ms med 0.004 ms std 4.132 B lstat file cnt 947 min 0.008 ms max 24.082 ms avg 0.130 ms med 0.019 ms std 1.196 C open file cnt 738 min 0.015 ms max 0.133 ms avg 0.020 ms med 0.018 ms std 0.008 D rd 1st byte cnt 738 min 0.175 ms max 3676.070 ms avg 27.504 ms med 0.303 ms std 241.728 E read rate 1.733 MB/s (data) 0.612 MB/s (readdir + open + 1st byte + data)
While cfq is the default on most ext3 desktop setups, deadline is still popular on servers.
With btrfs being in 2.6.31 and hailed as the solution to all our trouble, I did give it a whirl too and found the results to be quite amazing. As long as there is no read/write competition the new filesystem puts everyone to shame.
Linux 2.6.31.2-test nobarrier fs=btrfs disk=areca RAID scheduler=cfq ************************************************************************************** 1 readers (30s) ---------------------------------------------------------------------- A read dir cnt 87133 min 0.001 ms max 19.047 ms avg 0.020 ms med 0.003 ms std 0.349 B lstat file cnt 81349 min 0.006 ms max 32.409 ms avg 0.034 ms med 0.023 ms std 0.262 C open file cnt 63997 min 0.013 ms max 0.128 ms avg 0.016 ms med 0.016 ms std 0.003 D rd 1st byte cnt 63997 min 0.013 ms max 24.931 ms avg 0.175 ms med 0.121 ms std 0.642 E read rate 181.640 MB/s (data) 71.359 MB/s (readdir + open + 1st byte + data) 3 readers (30s) ---------------------------------------------------------------------- A read dir cnt 52157 min 0.001 ms max 61.196 ms avg 0.051 ms med 0.003 ms std 0.757 B lstat file cnt 48669 min 0.006 ms max 22.695 ms avg 0.062 ms med 0.026 ms std 0.504 C open file cnt 38205 min 0.013 ms max 0.130 ms avg 0.017 ms med 0.016 ms std 0.004 D rd 1st byte cnt 38205 min 0.014 ms max 41.009 ms avg 0.349 ms med 0.144 ms std 1.313 E read rate 132.131 MB/s (data) 40.885 MB/s (readdir + open + 1st byte + data) 3 readers, 3 writers (30s) ---------------------------------------------------------------------- F write open cnt 20 min 0.065 ms max 35.891 ms avg 1.951 ms med 0.126 ms std 7.789 G wr 1st byte cnt 20 min 0.006 ms max 0.018 ms avg 0.009 ms med 0.008 ms std 0.003 H write close cnt 20 min 0.018 ms max 1848.102 ms avg 168.527 ms med 2.059 ms std 408.952 I mkdir cnt 6 min 0.035 ms max 0.085 ms avg 0.051 ms med 0.048 ms std 0.018 J write rate 0.036 MB/s (data) 0.028 MB/s (open + 1st byte + data + close) A read dir cnt 3774 min 0.001 ms max 157.739 ms avg 0.134 ms med 0.003 ms std 3.215 B lstat file cnt 3536 min 0.007 ms max 752.500 ms avg 2.737 ms med 0.029 ms std 27.926 C open file cnt 2821 min 0.015 ms max 0.119 ms avg 0.018 ms med 0.017 ms std 0.005 D rd 1st byte cnt 2821 min 0.021 ms max 2406.329 ms avg 7.716 ms med 0.149 ms std 85.023 E read rate 9.633 MB/s (data) 2.480 MB/s (readdir + open + 1st byte + data)
I also put ext4 through the paces... Its overall behaviour seems to be the same than with ext3, and same settings render the best performance. Overall, the single reader scenario seems to suffer a performance drop from 20% to 30%, while the three reader scenario gains about 30%. Large maximum latencies have become bigger if anything.
Linux 2.6.31.2-test relatime barrier=0 fs=ext4 disk=areca RAID6 journal=ext data=ordered scheduler=cfq ************************************************************************************** 1 readers (30s) ---------------------------------------------------------------------- A read dir cnt 23537 min 0.001 ms max 88.099 ms avg 0.064 ms med 0.004 ms std 1.019 B lstat file cnt 21968 min 0.007 ms max 97.830 ms avg 0.032 ms med 0.025 ms std 0.685 C open file cnt 17263 min 0.014 ms max 0.174 ms avg 0.020 ms med 0.017 ms std 0.010 D rd 1st byte cnt 17263 min 0.178 ms max 96.332 ms avg 0.877 ms med 0.280 ms std 3.350 E read rate 41.393 MB/s (data) 15.868 MB/s (readdir + open + 1st byte + data) 3 readers (30s) ---------------------------------------------------------------------- A read dir cnt 15622 min 0.001 ms max 108.698 ms avg 0.095 ms med 0.003 ms std 1.361 B lstat file cnt 14572 min 0.007 ms max 26.369 ms avg 0.033 ms med 0.023 ms std 0.354 C open file cnt 11421 min 0.015 ms max 0.266 ms avg 0.020 ms med 0.017 ms std 0.013 D rd 1st byte cnt 11421 min 0.178 ms max 138.505 ms avg 1.312 ms med 0.295 ms std 4.754 E read rate 24.273 MB/s (data) 10.016 MB/s (readdir + open + 1st byte + data) 3 readers, 3 writers (30s) ---------------------------------------------------------------------- F write open cnt 5364 min 0.041 ms max 1604.204 ms avg 0.405 ms med 0.047 ms std 21.980 G wr 1st byte cnt 5364 min 0.006 ms max 0.419 ms avg 0.007 ms med 0.007 ms std 0.008 H write close cnt 5364 min 0.011 ms max 2128.445 ms avg 1.329 ms med 0.017 ms std 39.354 I mkdir cnt 490 min 0.026 ms max 1039.008 ms avg 2.528 ms med 0.033 ms std 47.242 J write rate 8.110 MB/s (data) 5.698 MB/s (open + 1st byte + data + close) A read dir cnt 5916 min 0.001 ms max 114.718 ms avg 0.243 ms med 0.004 ms std 3.678 B lstat file cnt 5531 min 0.008 ms max 99.922 ms avg 0.094 ms med 0.025 ms std 1.941 C open file cnt 4382 min 0.015 ms max 1.918 ms avg 0.022 ms med 0.018 ms std 0.034 D rd 1st byte cnt 4382 min 0.179 ms max 632.008 ms avg 3.857 ms med 0.332 ms std 20.912 E read rate 8.363 MB/s (data) 3.542 MB/s (readdir + open + 1st byte + data)
Have a look at the results in detail:
setups with data=ordered and cfq scheduler provide the best performance balance for RAID6 on areca
normal harddisk systems only provide decent performance for r/w competition when running in data=writeback mode with cfq.
the major problem of the whole setup is the high number of outliers and the unpredictable service time of the IO calls. The standard deviation is way to high. Maximum wait times of several seconds are way too high.
upgrade from 2.6.24 to 2.6.31.2 brings some gains in throughput, maybe a factor of two or so. The random delays seems to stay the same though.
the deadline scheduler did not have advantages in any of our test cases.
btrfs promises major performance gains but competing readers and writers are not handled well either.
fsopbench allows a quick assessment of a systems ability to satisfy competing read and write requests.
Olten, October 9th, Tobi Oetiker
NOTE: The content of this website is accessible with any browser. The graphical design though relies completely on CSS2 styles. If you see this text, this means that your browser does not support CSS2. Consider upgrading to a standard conformant browser like Mozilla Firefox, Opera, Safari or Konqueror for example.