Home NAS with BTRFS

I just finished building a Network Attached Storage server for home use.  Backups, mostly, but also kind of as a media hub.  Here’s a run-down of the setup, with an exploration of some of the pitfalls of RAID and of the cool features of modern file systems.

Firstly, a couple pics:

Shown above is the new, small NAS (“faithless”) and the old one (“backfire”).  The old one had a few disadvantages: it was huge, noisy and power-hungry.  For storage, I had 3x250GB SATA drives, each set up with EXT4.  Two of the drives were removable (one is shown partly removed in the photo).  In theory, one of the drives would be kept off-site.  I would rsync from client machines (mostly my laptop) to the fixed internal drive, and then rsync from there to the removable drive in the chassis.  Then, I would swap the removable drive with the off-site drive.

It’s a great theory but in practice I didn’t back-up to the file server as often as I should have and I practically never swapped the off-site drive.  Finally, I was worried (maybe unreasonably) about latent sector error with EXT4 due to its limited checksumming.

A Word About Latent Sector Error

So, the SATA spec tells us that we should expect an error every 1013 bits.  1013 works out to about 243 bits, or 240 bytes.  That’s 1TB1.  It’s a little worrying to me that, if I fill up my 1TB drive, there’s a good chance of at least one error lurking somewhere.

EXT4 checksums its journal, which AFAIK will protect against errors caused by sync failures (ie. power failure during disk I/O).  But it’s not going to protect against latent sector errors.  To do that, you need checksumming on all the file data, along the lines of what ZFS or BTRFS provides.

Oh, ZFS, We Hardly Knew Ye

ZFS is an amazing filesystem.  I don’t think the benchmarks at Phoronix give it enough credit: ZFS’s default behaviour is to flush to persistent storage on every sync, which is the only way to guarantee data consistency in the absence of battery-backed on-disk cache.  Further, the one/two punch of the intent log and the adaptive read cache mean that, if you’ve set up your hybrid storage pools correctly, you can get some mind-boggling performance without sacrificing the fault-tolerance of your storage pool.

The problem with ZFS is that it runs best in Solaris.  I don’t like where Solaris is heading, and I have no reason to believe that it supports the hardware platform I’m using for the new NAS.  There have been a few attempts to port ZFS to Linux, but the most promising effort is still in its infancy.  So it’s not an option for this project.

BTRFS is a good option.  Its performance numbers are impressive and it offers the same RAID-0 and RAID-1 configurations as ZFS, plus redundant data checksums.  It’s got snapshot capabilities similar to ZFS.  I’ll miss the ZFS send/recv functionality, rsync will get me part of the way but it’s not a great solution.  I’m eagerly anticipating a proper fsck utility for BTRFS.

RAID-1 Fault Tolerance

If you look at the Wikipedia article on RAID, it claims that N-way mirroring (RAID-1) offers a Hamming distance of N.  This is a bit of an exaggeration, to me, because it assumes that checksumming is being done at a higher level.  What I mean is, if you have a 2-disk mirror and one of the disks has a latent sector error, to my knowledge there is no clear way to figure out which disk has gone bad, because you have no way of telling what the data should have looked like.  In an N-way mirror, your RAID controller (if it is clever) could probably find agreement between the other N-1 disks in the volume and reset the bad sector (isolate it, more likely), but even if that functionality exists I don’t think it would be present in a consumer grade $20 controller.  In my opinion, RAID-1 offers no fault tolerance at all because there is no checksum to verify against.  You effectively get a checksum when you go with RAID-5 or RAID-6, which is pretty much the route that ZFS/BTRFS has taken.

Enough With The Preamble, Already

There are 5 disks inside the NAS: 1x128GB OCZ Vertex SSD and 4x1TB WD Caviars.  The SSD is set up as the boot drive (Ubuntu 10.10); it also hosts all the application software.  As an aside, I will never again build a computer without an SSD.  The performance boost is incredible.  Boot time is sub-10-seconds.

The CPU is a dual-core Intel Atom D525 (1.8GHz).  I maxxed out the motherboard with 4GB of DDR3-1333.

The four 1TB drives are set up in a multi-device BTRFS filesystem, RAID-10 with redundant metadata.  I’ve disabled write caching (sudo hdparm -W 0 /dev/sd[b,c,d,e]).  With write caching enabled, I can do sequential writes up to 120MB/sec from SSD to the BTRFS filesystem.  With write caching enabled, that drops to a still-respectable 60MB/sec.  I expect that single-threaded sequential I/O, heavily biased towards writes, will be the primary use of this system.

I originally shared the filesystem through NFS but I’m accessing it mostly from OS X devices, and it seems like the OS X NFS client really, really sucks.  I couldn’t do better than about 10MB/sec over NFS and large sequential writes cause the client to become unresponsive for long periods of time.  Anyways, it’s still exported over NFS with the ‘sync’ option but I mostly access it via rsync+SSH.  In that mode of operation, I can get about 20MB/sec sequential write over my home network.

To solve the off-site backup problem, I think I’ll get a 1TB external drive and rsync from the BTRFS filesystem to the external drive.  eSATA would be really convenient here, but I’ll probably have to stick with USB due to hardware limitations.  ZFS send/receive would also be really useful to incrementally transfer the fileystem to an external disk.  The absence of this functionality in BTRFS means I’ll probably stick with rsync.

1: Useful to remember: 210=1KB, 220=1MB, 230=1GB, 240=1TB.  If you’re going to complain about how 1GB=1000MB, please, don’t.  We’re approximating here.