Doesn’t quite have the same ring to it as ‘A Year in Provence’, does it? Oh well, never mind. 😉
After a year of using Solaris and ZFS for a home fileserver, I thought I would share my experiences here to give an insight into things that worked or did not work.
Also, others have asked me to give a summary of my experiences of using ZFS to highlight strong and weak areas, and to give a critique.
Where to start?
Well, in my original setup I had two systems running Solaris SXCE, one was a NAS and the other one was a backup machine.
The NAS had a zpool utilising one vdev (virtual device) of three drives in a RAID-Z1 configuration.
The backup machine utilised a non-redundant configuration of old, different-sized drives that I had lying around.
In practice, as I was cautious of trusting a new system, I didn’t put all of my data on the NAS initially. I just put media like music, video and photos on there, which I had masters of elsewhere. I also used this NAS as a kind of ‘dumping ground’ for copies of existing data from various machines and external drives. In short, this was not really using this NAS as it was intended. In time though, having used it for over a year now, and having experienced no data loss, and not even one pool scrub error reported (checking integrity of all files), my trust and confidence in Solaris and ZFS have grown.
So far, I have upgraded the NAS twice, the first time to give increased storage capacity, and the second time to give increased storage capacity & increased redundancy to reduce the likelihood of data loss through failing drives. Due to the fact that (1) I wanted to keep a single vdev for simplicity and (2) the fact that to-date (2009-05-01) it is still not possible to attach additional drives to a vdev, upgrades have been more painful than should be necessary. In reality my upgrades meant having to:
- backup to 2 targets: (1) one 1TB drive, and (2) the backup system
- destroy the storage pool
- recreate the new storage pool with the extra drive(s)
- restore the data back into the pool from the backups
I am quite aware that this pain is often not encountered by enterprise users as they have more resources and thus buy large amounts of storage up-front when they plan purchasing of storage kit. And when they upgrade existing storage systems, they are likely to be adding multiple drives at a time, like one or multiple additional vdevs of multiple drives per vdev. Thus the current restriction of not being able to grow an existing vdev is only an issue for home users and, even then, there are workarounds, like I have shown, painful as they are.
The upside to this pain, however, was that it forced me to learn how to do (1) a full backup, (2) incremental backups, and (3) a full restore from backups.
I will write a whole post on this upgrade process to explain step-by-step the approach I used, which should help others in a similar situation, as there are a number of potential pitfalls for the unwary.
Although I think that I was fairly fortunate in my choice of hardware for my NAS, there were a few lessons learned, which will influence decisions when building future storage systems:
- Research the proposed hardware thoroughly, as flakey driver support will give a miserable experience.
- The motherboard I used for the NAS, an Asus M2N-SLI Deluxe had a fault whereby the second network port would frequently fail to initialise on POST. I currently use only one network port so that’s not a big problem.
- With my hardware and SATA drivers (nv_sata), I encountered a rare lockup situation when copying files within the same pool, but I have not encountered the bug since last year so perhaps it’s been fixed?
- Power management features on the AMD processor (AMD Athlon X2 BE-2350) I used were non-existent, as the CPU operated at only one fixed frequency – i.e. it was unable to switch to a slower frequency when idle. The processor used little power though, so it was not all bad.
- The system never managed to successfully enter and recover from S3 suspend mode. Thus, I turned the system off when not in use. This turned out not to be a real problem for me, but it was one of my original wishes.
The innovations unveiled within recent Intel and AMD processor designs look interesting in terms of power economy – see Intel’s Core i7 architecture (Nehalem), and AMD’s Phenom II, which have many improvements over the original Phenom processors.
Also, Intel’s Atom processor and associated chipset and D945GCLF2 motherboard look interesting for very small NAS systems utilising a simple 2-way mirror. Unfortunately, due to only having two SATA connectors on the motherboard, it makes this unsuitable for a more substantial NAS using more than two drives, although I’ve seen that Zhong Tom Wang got round this limitation with his Intel Atom-based ZFS home NAS box. However, lack of ECC memory support is a pity.
The latest Intel Core i7-based Xeon 5500 series of processors have 15 P-states, so can select an appropriate CPU frequency dynamically according to load. Also, as Intel have worked very closely with Sun to ensure Solaris/OpenSolaris has great support for their new Core i7 processors, you can be pretty sure that support for power management has advanced greatly in the last year. Check out these videos to see what I mean:
- Sun and Intel joint engineering: Solaris and Core i7 innovations
- OpenSolaris and Nehalem – a digital short by Intel’s David Stewart.
I don’t know anything about how well Solaris/OpenSolaris supports the new power states available in the new AMD Phenom II processors, so if anyone has info, please add a comment below.
Update 2010-01-20: Since originally writing this article, there have been several new processor developments:
- Intel Pine Trail platform, including the new Pineview processors which are single/dual-core 1.66GHz 64-bit second generation Atom processors, which integrate processor, chipset, memory controller and GPU, and use incredibly low amounts of power ranging from 5.5W TDP for the single-core Intel Atom N450, through the top of the range dual-core Intel Atom D510 which is a miserly 13W TDP. Unfortunately for ZFS-usage, they don’t support ECC memory, which makes them unsuitable where data integrity is of paramount importance. See the AnandTech report for more details here: AnandTech: Intel Atom D510: Pine Trail Boosts Performance, Cuts Power
- Intel Xeon 5600-series (Gulftown): Due to be released around March 2010, although it is anticipated that Apple will announce a new 2010 edition Mac Pro model using these processors before then. These should offer 6-cores instead of the 4-core Xeon 5500-series (Nehalem). Unfortunately for consumers, Intel has priced these processors as enterprise devices, making them unsuitable as home NAS processors due to cost, even though they support ECC memory.
- AMD Athlon II X2 ‘e’ models: AMD have released interesting low power versions of their dual-core 64 bit processors, for example the AMD Athlon II X2 235e . 45W TDP with CPU frequency scaling to use lower power when the NAS is idle. As with most AMD processors, these provide ECC memory support within the memory controller within the processor package.
- AMD Phenom II X4 ‘e’ models: AMD have released the much improved Phenom II range of quad-core processors and, additionally, have now produced lower power versions of these, denoted with an ‘e’ suffix, such as the AMD Phenon II X4 905e, rated at 65W TDP and supporting CPU frequency scaling for lower power usage when the processor is idle, and providing ECC memory support within the memory controller within the processor package. These are a little more expensive than the standard models, but are more suitable for a NAS due to lower power usage. These processors seem to offer the ideal combination of (1) low power when the NAS is idle, and (2) extra processor power when required for various interesting ZFS features like compression, deduplication, encryption and triple parity (RAID-Z3) calculations, plus sufficient power for 10GbE for fast LAN communications, if required for things like video editing.
I originally chose ECC memory for added robustness, and I will continue to use it for future builds, as the cost premium is minimal for the added peace of mind it gives in its ability to detect and correct parity errors within the memory. Many people don’t consider ECC memory important. I disagree. Garbage in memory caused by flipped bits, written to disk, will not be what you would like to read back from disk. Enterprise server systems use ECC memory, and there’s a reason for that. 😉
Please see the important updates below before continuing.
In terms of drives, the 3.5″ Western Digital ‘green’ WD10EADS 1TB drive is currently looking good on price, idle power usage (2.8W), read/write power usage (5.4W), and also noise and vibration issues, as these drives operate at 5400 RPM despite confusing & conflicting information out there. Read/write performance is quite respectable, however, due to built-in innovations, allegedly. They should be perfect for a general purpose home NAS where you want lots of cheap, reliable storage that doesn’t sound like a jet-plane, and consumes little power.
The Western Digital WD15EADS model which is a 1.5TB version of the same drive, with very similar specs is around the same price per GB and looks like a good choice too for larger storage pools. Currently the WD20EADS 2TB version of the same drive is just too expensive per GB to use, unless you’re building a monster Blu-ray server.
Update #1 2010-01-20: Please note that serious problems with the current range of Western Digital Green drives are being reported in various fora, and so I can not recommend these drives as suitable for use in a RAID system, and Western Digital do not recommend them as suitable for RAID systems either. Please see here for more details:
Update #2 2010-01-20: As of the date of this update, the price sweet-spot are the 1.5TB drives, with the 2TB drives still a little too expensive and currently not good value for money, although this will surely change in the coming few months. It’s quite difficult to find good, reliable, consumer-priced SATA drives for RAID use. See my comments listed by manufacturer below.
- Western Digital Green drives, which would have been my first choice, have to be ruled out for the reasons cited in the update above. Also, see my notes below.
- Seagate: I will check reports of currently available models.
- Hitachi have the HDS722020ALA330 2TB model, but I have not seen comments on this, although it appears to be a 5-platter model, which is not desirable for the reasons cited below. I will seek out reports on it and also seek out any 1.5TB model reports.
- Samsung has a HD154UI 1.5TB drive having 3-platters which seems to have good customer ratings at newegg.com, and they also have a very recent HD203WI 2TB 4-platter model released around 2009-12, which has good customer ratings so far at newegg.com, but it might be too early to make an informed buying decision yet.
In seeking desirable drives, one looks for drives containing the fewest number of platters from the low noise, low vibration, low heat and good reliability perspectives. As of January 2010, 500GB per platter is the highest available data density, so for 1.5TB drives look for 3-platter drives, and for 2TB drives look for 4-platter drives.
I think Western Digital has made a really big mistake recently with their Green drive range. First of all, they appear to have some serious technical issues with these drives. Also, they appear to have marked this Green range of drives as unsuitable for RAID usage, even though the low price, low rotational speed (5400 RPM) and low power usage make them an obvious candidate for consumer RAID drives. If this is an intentional decision in order to create market segmentation between consumer and enterprise drives, then it is a pity, as there are many potential buyers of these drives for consumer RAID applications where issues of low price, high capacity and power economy are of primary importance, with performance of secondary importance. I have seen reports from users claiming that the WDTLER.EXE no longer works for newer revisions of these Western Digital Green range of drives, which is used to improve error handling in a RAID environment. This alone, points to an attempt to make users buy their much more expensive enterprise SATA drives for RAID environments, like the WD2002FYPS 2TB model, but these are around 50% more expensive. In effect, for these reasons, this has removed Western Digital as a choice for consumer-price RAID drives. If this situation changes, I will update this viewpoint. If you have evidence that I am wrong in my interpretation, please leave a comment below.
Storage is the heart of a NAS, so special consideration should be given to the disk SATA controller. ZFS should have full control of the disks, so JBOD mode is all that you need (Just a Bunch Of Disks, i.e. no custom RAID controller hardware, firmware or software). However, drivers for on-motherboard SATA controllers may or may not be robust, or even available, and so it may be worth considering a SATA controller card for a future build, whose driver is known to be 100% rock solid. This will help guarantee no weird issues with storage ruining your day. 🙂
Update 2010-01-20: I am currently using the SuperMicro AOC-USAS-L8i SATA/SAS controller, and have been very impressed with it. Please see here for more details: Home Fileserver: Mirrored SSD ZFS root boot.
This is a great value 8-port SATA/SAS controller, but it uses a SuperMicro UIO bracket, which needs to be removed for use in a standard tower case, although this is easy to do. For an alternative adapter which is 100% identical in terms of hardware, but uses a standard bracket for a tower case, see the LSISAS3081E-R, although this is around 50% or so more expensive than the SuperMicro equivalent, for some reason.
SuperMicro also make a low-profile version called the AOC-USASLP-L8i. These models are all SATA 2 3Gbps per lane models.
SuperMicro has recently released new adapter models, the AOC-USAS2-L8i and the AOC-USAS2-L8e which are able to provide 6 Gbps of per-lane bandwidth for the new ranges of high-speed SATA 3 SSD devices. The AOC-USAS2-L8i model also has RAID capability, whereas the AOC-USAS2-L8e model does not. As ZFS requires JBOD and does not need hardware RAID, the AOC-USAS2-L8e model looks to be the best adapter to use for up to 8 internal SATA drives. However, I am awaiting confirmation whether this card is compatible with Solaris and ZFS. Normally, it should be compatible. Check this thread for further details:
New Supermicro SAS/SATA controller: AOC-USAS2-L8e in SOHO NAS and HD HTPC.
These practical experiences and latest technological updates will make it easier to choose future storage hardware for a processor, motherboard and SATA controller.
Watch this space.
ZFS allows a system builder to design-in as much or as little redundancy (none) into his/her storage systems as required.
Many people choose single parity (RAID-Z1) for multi-drive arrays, as this gives an efficient data to parity ratio — i.e. you can use most of your drive capacity for data storage, and only the capacity of one drive is used for parity data. It is this parity data that is used to rebuild drives in the event that files get corrupted or drives fail. So it *does* have value, immense value, but because it is unavailable for data storage, many people see it as wasted space. When you suffer a loss scenario you will be thankful for parity data though, as ZFS will use the parity data to put things back to normal.
My NAS originally started off with a storage pool consisting of a three-drive RAID-Z1 vdev, which was a cheap way to get started, and it was the right choice for me at the time: the capacity of two drives for data, and one for parity.
A RAID-Z1 configuration means that the NAS will survive one drive failure without data loss. A second drive loss means your data is toast. Yikes!
This is very important to consider, as drives are often bought together at the same time, and so they will most likely be from the same manufacturing batch, meaning that any faults in design, materials or manufacturing process, will often cause drives to fail around the same time. This means that it is quite likely that when one drive fails, it is only a matter of time before a second drive fails, and that time period may be very short.
This becomes important when you have little time to rebuild your storage array. The process of rebuilding the lost data from parity data onto a replacement drive is called resilvering, and it is critical that a second drive does not fail during this resilvering process, otherwise you will lose your data!
After some further research into drive failures, and considering the fact that putting your data onto a NAS is like ‘putting all your eggs in one basket’, it has caused me to reconsider the use of RAID-Z1.
With the information I now know, I consider the use of double parity far more robust, as it gives much greater protection against data loss by allowing your system to survive two drive failures. With RAID-Z2 you are effectively buying yourself more time when you need to replace a failed drive. Also, should a second drive fail during the resilvering process when rebuilding the first failed drive, you will still not lose data. Only if a third drive should fail will you lose your data.
For these reasons, I have now upgraded my system to use a single RAID-Z2 vdev configuration.
As using RAID alone is not an excuse not to do backups, in addition to using a RAID-Z2 based NAS, it is important to have some other system to do backups onto.
And if you are really serious about not losing data, the next thing to consider is taking a copy off-site, possibly on a high capacity drive or two, to guard against possible loss due to fire. With ZFS this is trivial to achieve by inserting a couple of high-capacity drives, creating a new pool from them, and then using ‘zfs send/receive’ to export the file systems to the new pool, and finally typing ‘zfs export’ to complete the write process before removing the drives and transporting them off-site.
For those interested in preserving data (who isn’t?), I found the following site to be educational:
- The Tao Of Backup
- Tao Of Backup Wailing Wall (very sad tales of loss — sniff, sniff!)
- The Tao Of Backup Audit Questionnaire
- The Tao Of Backup Sanctuary
These are magical. Simple, but magical. I use snapshots like one would use the ‘save game’ feature in a video game, just before opening a door in a game like Doom, Quake etc…
In these games, it takes a long time for feeble players like me to progress through all the levels, so I learned to save the game just before opening a door, as there was invariably a nasty beast waiting behind it which would result in ‘Game Over’ being displayed. 🙂
In the same way with ZFS, before attempting any operation that is significant, I always snapshot the file systems in the pool. That way, should I make a mistake and type ‘rm -fr *’, I can easily recover by typing ‘zfs rollback tank/fs@snapshot’ and everything magically returns to the state before typing the normally disastrous command. That’s magic!
You can even rollback from an OS upgrade that didn’t go well, if you do a snapshot of the OS file system before the upgrade.
This command has become my friend:
# zfs snapshot -r tank@20090501
This little beastie will make a snapshot recursively through all the file systems within your storage pool, assuming your pool is called tank. The snapshot name given to the snapshots for each file system will be ‘20090501’ in this case. Use a new date for each occasion, or qualify further with a timestamp or sequence number if you are doing lots of snapshots on the same day. Recursive snapshotting like this, makes it easy when doing large incremental backups too.
Now that I have started to take more snapshots, I have also learnt the necessary incantations to do full and incremental backups using ‘zfs send’ and ‘zfs receive’. These are amazingly powerful and, done right, make it fairly simple to do regular incremental backups recursively through a hierarchy of file systems. I will detail all this in a later post. I used these techniques when doing pool upgrades for increasing capacity and redundancy levels (RAID-Z1 –> RAID-Z2).
Sharing and NFSv4 ACLs
I started by creating simple CIFS shares to a single computer, which was a Macintosh. All was well. I was using simple Unix-style permissions, and all was well.
Well, not quite. I frequently saw permissions problems when moving data around and it got to be a pain sometimes: back to the command line, chmod 755/644, chown user:group * etc.
Then I discovered the ACLs chapter of the ZFS Administration Guide. It looked powerful but far from trivial.
When you get these right, you can setup nicely behaving file systems: inheritance of properties and permissions etc, but getting them right is a bit of work and, so far I failed to find an idiot’s guide to NFSv4 ACLs, as used by ZFS now. If someone can send me the URL of one, I’d be delighted. If not, then I see another long post ahead for me to write one day…
In a later post I will detail my findings, as these NFSv4 ACLs seem to be the future, and they offer more flexibility than standard Unix-style permissions, although at a cost, it seems.
Sharing read/write file systems with a Windows user/box led to some interesting discoveries relating to ACLs and user accounts too, which I will also try to document later.
This was a piece of fun, and it allowed fast transfers! But due to the fact that my backup machine was not switched on very often, the NAS was taking ages to start and shut down as it was trying to connect to the backup machine iSCSI resources. I don’t use this any longer. Useful for systems which are on 24/7 though.
This was a pain to setup, on both ends, required an 802.3ad-compliant switch, and did give considerably faster transfers between my dual-GbE Mac Pro and the dual-GbE NAS when it worked. I’m referring here to the bug in initialising the 2nd GbE port on the Asus M2N-SLI Deluxe motherboard on POST. Thus, I broke my trunked network connections and returned to simple single GbE network connections.
When I get more serious hardware one day, I will probably revisit this area, and buy myself a decent user-friendly switch like the HP ProCurve 1800-8G 8-port managed switch that works with browsers other than MS Internet Explorer, unlike my Linksys SRW2008 (Cisco low-end). I have returned to my previous switch, a DLink DGS-1008D green ethernet 8-port unmanaged Gigabit switch, which works great for single GbE-connected machines.
I have learnt a lot over the last year about ZFS, and using it has convinced me that I made the right choice in selecting both Solaris and ZFS. But I have more to learn, and a lot more to write about, which I hope to do when I get some more time.
I would be interested in hearing from any other users of ZFS to hear their experiences — feel free to add a comment below.