Home NAS with BTRFS

I just finished building a Network Attached Storage server for home use.  Backups, mostly, but also kind of as a media hub.  Here’s a run-down of the setup, with an exploration of some of the pitfalls of RAID and of the cool features of modern file systems.

Firstly, a couple pics:

Shown above is the new, small NAS (“faithless”) and the old one (“backfire”).  The old one had a few disadvantages: it was huge, noisy and power-hungry.  For storage, I had 3x250GB SATA drives, each set up with EXT4.  Two of the drives were removable (one is shown partly removed in the photo).  In theory, one of the drives would be kept off-site.  I would rsync from client machines (mostly my laptop) to the fixed internal drive, and then rsync from there to the removable drive in the chassis.  Then, I would swap the removable drive with the off-site drive.

It’s a great theory but in practice I didn’t back-up to the file server as often as I should have and I practically never swapped the off-site drive.  Finally, I was worried (maybe unreasonably) about latent sector error with EXT4 due to its limited checksumming.

A Word About Latent Sector Error

So, the SATA spec tells us that we should expect an error every 1013 bits.  1013 works out to about 243 bits, or 240 bytes.  That’s 1TB1.  It’s a little worrying to me that, if I fill up my 1TB drive, there’s a good chance of at least one error lurking somewhere.

EXT4 checksums its journal, which AFAIK will protect against errors caused by sync failures (ie. power failure during disk I/O).  But it’s not going to protect against latent sector errors.  To do that, you need checksumming on all the file data, along the lines of what ZFS or BTRFS provides.

Oh, ZFS, We Hardly Knew Ye

ZFS is an amazing filesystem.  I don’t think the benchmarks at Phoronix give it enough credit: ZFS’s default behaviour is to flush to persistent storage on every sync, which is the only way to guarantee data consistency in the absence of battery-backed on-disk cache.  Further, the one/two punch of the intent log and the adaptive read cache mean that, if you’ve set up your hybrid storage pools correctly, you can get some mind-boggling performance without sacrificing the fault-tolerance of your storage pool.

The problem with ZFS is that it runs best in Solaris.  I don’t like where Solaris is heading, and I have no reason to believe that it supports the hardware platform I’m using for the new NAS.  There have been a few attempts to port ZFS to Linux, but the most promising effort is still in its infancy.  So it’s not an option for this project.

BTRFS is a good option.  Its performance numbers are impressive and it offers the same RAID-0 and RAID-1 configurations as ZFS, plus redundant data checksums.  It’s got snapshot capabilities similar to ZFS.  I’ll miss the ZFS send/recv functionality, rsync will get me part of the way but it’s not a great solution.  I’m eagerly anticipating a proper fsck utility for BTRFS.

RAID-1 Fault Tolerance

If you look at the Wikipedia article on RAID, it claims that N-way mirroring (RAID-1) offers a Hamming distance of N.  This is a bit of an exaggeration, to me, because it assumes that checksumming is being done at a higher level.  What I mean is, if you have a 2-disk mirror and one of the disks has a latent sector error, to my knowledge there is no clear way to figure out which disk has gone bad, because you have no way of telling what the data should have looked like.  In an N-way mirror, your RAID controller (if it is clever) could probably find agreement between the other N-1 disks in the volume and reset the bad sector (isolate it, more likely), but even if that functionality exists I don’t think it would be present in a consumer grade $20 controller.  In my opinion, RAID-1 offers no fault tolerance at all because there is no checksum to verify against.  You effectively get a checksum when you go with RAID-5 or RAID-6, which is pretty much the route that ZFS/BTRFS has taken.

Enough With The Preamble, Already

There are 5 disks inside the NAS: 1x128GB OCZ Vertex SSD and 4x1TB WD Caviars.  The SSD is set up as the boot drive (Ubuntu 10.10); it also hosts all the application software.  As an aside, I will never again build a computer without an SSD.  The performance boost is incredible.  Boot time is sub-10-seconds.

The CPU is a dual-core Intel Atom D525 (1.8GHz).  I maxxed out the motherboard with 4GB of DDR3-1333.

The four 1TB drives are set up in a multi-device BTRFS filesystem, RAID-10 with redundant metadata.  I’ve disabled write caching (sudo hdparm -W 0 /dev/sd[b,c,d,e]).  With write caching enabled, I can do sequential writes up to 120MB/sec from SSD to the BTRFS filesystem.  With write caching enabled, that drops to a still-respectable 60MB/sec.  I expect that single-threaded sequential I/O, heavily biased towards writes, will be the primary use of this system.

I originally shared the filesystem through NFS but I’m accessing it mostly from OS X devices, and it seems like the OS X NFS client really, really sucks.  I couldn’t do better than about 10MB/sec over NFS and large sequential writes cause the client to become unresponsive for long periods of time.  Anyways, it’s still exported over NFS with the ‘sync’ option but I mostly access it via rsync+SSH.  In that mode of operation, I can get about 20MB/sec sequential write over my home network.

To solve the off-site backup problem, I think I’ll get a 1TB external drive and rsync from the BTRFS filesystem to the external drive.  eSATA would be really convenient here, but I’ll probably have to stick with USB due to hardware limitations.  ZFS send/receive would also be really useful to incrementally transfer the fileystem to an external disk.  The absence of this functionality in BTRFS means I’ll probably stick with rsync.

1: Useful to remember: 210=1KB, 220=1MB, 230=1GB, 240=1TB.  If you’re going to complain about how 1GB=1000MB, please, don’t.  We’re approximating here.

Java implementation of Bhandari's Algorithm

Say you want to route two (or three, or four…) paths through a network between the same endpoints, and you want to make sure that they don’t overlap, so that if one path fails you can be pretty confident that the other paths are unaffected. The “naive” approach here might be to use Djikstra’s algorithm to find the shortest path through the network, then remove those links from the network and run it again to find the shortest path in the modified network.

This works sometimes but there are a bunch of cases where it can fail, because what you really need to do is find the shortest disjoint pair of paths. In this book, Dr. Ramesh Bhandari of AT&T describes many different techniques to find these paths.  On the flight back from London I thought I’d try implementing his algorithm for shortest edge-disjoint pair routing.  I didn’t quite get it done on one battery charge, but I’ve finished off a first attempt.  The results are on GitHub, here.

Note: the unit test here shows a topology where the naive approach of running Djikstra twice will fail, but Bhandari’s algorithm will find the correct pair of paths.

Finding popular images in Apache Logs

I’m sure there are a million web pages out there that cover this already but here’s how I do it:

First, grep the interesting URIs out of the access log.  Here I look for everything ending in .png or .jpg – a better regex would filter out all the WordPress stuff, all the thumbnails, all the stuff that didn’t return 200, etc.

cat logs/www_log | awk '{print $7}' | grep -E ".jpg|.png$"

Then a quick awk script will indicate the number of times each image has been downloaded.  Because this script keeps all the image URIs in an associative array, it probably won’t scale to millions of images.  (But it will scale to millions of hits on a few images)

awk '{ if ($0 in linecount) linecount[$0]++; else linecount[$0] = 1} END { for (elem in linecount) print elem" "linecount[elem] }' | sort -r -n -k 2 | wc -l

Pipe the two commands together for UNIXy goodness. Produces output like so (with a more aggressive filter regex in step 1):

/wp-content/uploads/2010/01/IMG_7203.jpg 1543
/wp-content/uploads/2010/01/IMG_5761.jpg 1466
/wp-content/uploads/2010/01/IMG_5765.jpg 857
/wp-content/uploads/2010/01/IMG_5764.jpg 732
/wp-content/uploads/2010/01/IMG_5763.jpg 678
/wp-content/uploads/2010/01/IMG_7197.jpg 409
/wp-content/uploads/2010/01/IMG_7198.jpg 406
/wp-content/uploads/2010/01/box-plans-6.jpg 102
/wp-content/uploads/2010/01/box-plans-5.png 87
/wp-content/uploads/2010/01/box-plans-4.png 34
/wp-content/uploads/2010/01/box-plans-2.png 33
/wp-content/uploads/2010/01/box-plans-1.png 26
/wp-content/uploads/2010/01/box-plans-3.png 22

The Frustromantic Box, Part 4: Software

The final post in the Frustromantic Box series deals with the software side of things.

I’ll just let the code speak for itself, mostly.  I just want to say “thanks” to all the Arduino developers for the great libraries, and to Mikal Hart in particular for his work on the TinyGPS and NewSoftSerial libraries.

I’ve moved the code to github. Follow the instructions there to grab a copy of the code plus its dependencies.

The Frustromantic Box, Part 1: Intro

A few months ago, Hack A Day featured an ingenious hack called the Reverse Geocache Puzzle by a gentleman named Mikal Hart (please note that “Reverse Geocache Puzzle” is Mikal’s trademark – my unabashed clone of his project will hereafter be known as the Frustromantic Box).  As soon as I saw it, I knew I had to build one.  It took a lot longer than I thought it would, but I had the finished product ready in time for my wife’s Christmas present.  She still hasn’t opened it, so I have to be a little cagey here.

I’ll describe this project in a series of posts.  This post – the first – will go over the high-level design of the box.

Continue reading

Android Swingworker

I’ve been fooling around with the Android development platform. It’s quite an adjustment, coming from the iPhone world. I thought I’d post a simple equivalent to the SwingWorker class that works with Android’s Event Dispatch Thread. Here it is:
package ca.razorwire.util;

import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;

import android.os.Handler;
import android.os.Looper;
import android.util.Log;

public abstract class UIWorker
private static final ExecutorService __execSvc;
private static final Handler __edtHandler;

private static final String TAG = "UIWorker";

__execSvc = Executors.newSingleThreadExecutor();
__edtHandler = new Handler( Looper.getMainLooper() );

public abstract void doInBackground();

public void done()
// Log.d(TAG, "done() executing in thread " + Thread.currentThread().getName() );
// This space intentionally left blank

public void execute()
// Log.d(TAG, "execute() executing in thread " + Thread.currentThread().getName() );
final Runnable doneRunner = new Runnable()
public void run()

final Runnable bgRunner = new Runnable()
public void run()
// Log.d(TAG, "bgRunner executing in thread " + Thread.currentThread().getName() );
__edtHandler.post( doneRunner );

And here’s a sample usage:
UIWorker worker = new UIWorker()
public void doInBackground()
Log.d( TAG, "Sleeping for a bit." );
try { Thread.sleep( 10000 ); } catch ( Exception ignore ) {};
Log.d( TAG, "Executing worker…" );
Log.d( TAG, "Returned from execute()" );

If you uncomment the Log calls in UIWorker, you should see some output like this:

07-08 12:14:11.015: DEBUG/rweeks(1078): Executing worker...
07-08 12:14:11.015: DEBUG/UIWorker(1078): execute() executing in threadmain
07-08 12:14:11.025: DEBUG/UIWorker(1078): bgRunner executing in thread pool-1-thread-1
07-08 12:14:11.035: DEBUG/rweeks(1078): Sleeping for a bit.
07-08 12:14:11.035: DEBUG/rweeks(1078): Returned from execute()
07-08 12:14:21.038: DEBUG/UIWorker(1078): done() executing in thread main

Public domain code, no guarantees, seems to work.