Running a Linux Server on a HW RAID6 / LVM setup we are plagued by the fact that heavy activity on one file system will impact performance on all of them. If there is an active writer on one file system (especially meta data updates) then all other file systems will face extreme performance degradation. Especially read performance fell right through the floor. Response times become large and highly fluctuating.
The problem seems to even exist on simple single disk systems as is explained in this Ubuntu bug 131094.
We have tried all sorts of things, like the
noatime,data=journal
mount option, various io schedulers and /proc/sys/vm paramters, unfortunately only with limited success.
With the arrival of Solid State Flash disks in the consumer market, a new opportunity presented itself: Keeping the ext3 journal on a fast external device. Having minimal seek time, we expected SSDs to be the ideal media for keeping a journal.
We went for the new OCZSSD2-1S32G (32GB SATA2 from OCZ) since it got some good reviews for its write speed, especially when compared to the offerings of Samsung. Interestingly enough the OCZ disk identified itself as a 'SAMSUNG MCBQE32G5MPP-0VA' to the Linux kernel. Oh well.
So tonight, after I had connected that new disk to a spare SATA port I was ready to go.
I booted the box into single user mode and unmounted all file systems
umount -a
Then I partitioned the SSD (make sure that you actually pick the SSD and not your live disk since the disk numbering may have changed since you added the additional device). I used the disk/by-id devices just to be sure:
cfdisk /dev/disk/by-id/scsi-SATA_SAMSUNG_MCBQE32SY816A2396
An ext3 journal has a maximum size of 400 MB (with 4k blocks) and since the external journals always take a whole partition. If you can, use lvm todo that since you will hit the scsi limit of 15 partitions pretty quickly. With lvm you would do:
pvcreate /dev/disk/by-id/scsi-SATA_SAMSUNG_MCBQE32SY816A2396
vgcreate journal /dev/disk/by-id/scsi-SATA_SAMSUNG_MCBQE32SY816A2396
lvcreate -L 400M -n my-dev journal
Once the partitions are created, they have to be formatted for journal duty. I added a label to the journal so that I could find the partition more easily later.
mke2fs -O journal_dev -L j-my-dev /dev/disk/by-id/scsi-SATA_SAMSUNG_MCBQE32SY816A2396-part1
or if you used lvm
mke2fs -O journal_dev -L j-my-dev /dev/journal/my-dev
Now drop curent journal from the cleanly unmounted file system. This assumes that you use lvm to manage your partitions and the vg for the partitions is called "local"
tune2fs -O ^has_journal /dev/local/my_dev
and add the journal device. While adding the journal device, we also switch to journal_data mode. This is important, as it will make all meta-data and all data go to our fast journal first without any disk dependency. I also use the label assigned above.
tune2fs -o journal_data -j -J device=LABEL=j-my-dev /dev/local/my_dev
After the SSD journal was attached to all the file except for the root filesystem I ran a
mount -a
just to make sure they were all ok and then went for a reboot. A few minutes later the system was back up and running fine.
If you have todo this for many partions, I would strongly advise to use a script for the transition.
After running the setup for a few days, I draw the following conclusions:
The general slowness of all file access, caused by a single heavy write is reduced so much that it does not interfear with daily work anymore.
The hardlink backup (using rsync to keep a copy of the files, with hardlinks to those that have not changed) is about twice as fast.
The tape based backup (bacula, running at the same time as the hardlink backup) is about twice as fast as well.
In other words, having an external journal with a HW RAID setup is a MUST.
Using a single SSD to store the journal may raise reliability concerns, since we are introducing a single point of failure into the system. The chances for the single SSD going up in smoke is probably quite a bit higher than for the RAID6 to develope such a problem because individual failed disks can be replaced easily.
I have asked on the ext3-users mailinglist what would happen if one lost the journal disk in such a context. My interpretation of Theodore "ext3" Tso's reply is the following:
In most cases when something goes wrong the journal will get disabled automatically.
The worst "highly unlikely" case is that a whole "losing a full inode table block's worth of inodes" could get lost. In general the loss should be the last few minutes worth of data.
Use SMART to monitor the health status of the SSD, since it will know when it starts running out of replacement blocks before it actually dies.
The discussion on the ext3-users list promted Teo to re-check the code and find some issues which he will create patches for, so watch the kernel log!
And from earlier conversations I draw:
So for my part, I am confident that the added risk is worth the performance we gain, but decide for yourself!
NOTE: The content of this website is accessible with any browser. The graphical design though relies completely on CSS2 styles. If you see this text, this means that your browser does not support CSS2. Consider upgrading to a standard conformant browser like Mozilla Firefox, Opera, Safari or Konqueror for example.