Home Fileserver: A Year in ZFS

Doesn’t quite have the same ring to it as ‘A Year in Provence’, does it? Oh well, never mind. ;-)

After a year of using Solaris and ZFS for a home fileserver, I thought I would share my experiences here to give an insight into things that worked or did not work.

Also, others have asked me to give a summary of my experiences of using ZFS to highlight strong and weak areas, and to give a critique.

Where to start?

Well, in my original setup I had two systems running Solaris SXCE, one was a NAS and the other one was a backup machine.

The NAS had a zpool utilising one vdev (virtual device) of three drives in a RAID-Z1 configuration.

The backup machine utilised a non-redundant configuration of old, different-sized drives that I had lying around.

In practice, as I was cautious of trusting a new system, I didn’t put all of my data on the NAS initially. I just put media like music, video and photos on there, which I had masters of elsewhere. I also used this NAS as a kind of ‘dumping ground’ for copies of existing data from various machines and external drives. In short, this was not really using this NAS as it was intended. In time though, having used it for over a year now, and having experienced no data loss, and not even one pool scrub error reported (checking integrity of all files), my trust and confidence in Solaris and ZFS have grown.

Upgrades

So far, I have upgraded the NAS twice, the first time to give increased storage capacity, and the second time to give increased storage capacity & increased redundancy to reduce the likelihood of data loss through failing drives. Due to the fact that (1) I wanted to keep a single vdev for simplicity and (2) the fact that to-date (2009-05-01) it is still not possible to attach additional drives to a vdev, upgrades have been more painful than should be necessary. In reality my upgrades meant having to:

  1. backup to 2 targets: (1) one 1TB drive, and (2) the backup system
  2. destroy the storage pool
  3. recreate the new storage pool with the extra drive(s)
  4. restore the data back into the pool from the backups

I am quite aware that this pain is often not encountered by enterprise users as they have more resources and thus buy large amounts of storage up-front when they plan purchasing of storage kit. And when they upgrade existing storage systems, they are likely to be adding multiple drives at a time, like one or multiple additional vdevs of multiple drives per vdev. Thus the current restriction of not being able to grow an existing vdev is only an issue for home users and, even then, there are workarounds, like I have shown, painful as they are.

The upside to this pain, however, was that it forced me to learn how to do (1) a full backup, (2) incremental backups, and (3) a full restore from backups.

I will write a whole post on this upgrade process to explain step-by-step the approach I used, which should help others in a similar situation, as there are a number of potential pitfalls for the unwary.

Hardware issues

Although I think that I was fairly fortunate in my choice of hardware for my NAS, there were a few lessons learned, which will influence decisions when building future storage systems:

  • Research the proposed hardware thoroughly, as flakey driver support will give a miserable experience.
  • The motherboard I used for the NAS, an Asus M2N-SLI Deluxe had a fault whereby the second network port would frequently fail to initialise on POST. I currently use only one network port so that’s not a big problem.
  • With my hardware and SATA drivers (nv_sata), I encountered a rare lockup situation when copying files within the same pool, but I have not encountered the bug since last year so perhaps it’s been fixed?
  • Power management features on the AMD processor (AMD Athlon X2 BE-2350) I used were non-existent, as the CPU operated at only one fixed frequency – i.e. it was unable to switch to a slower frequency when idle. The processor used little power though, so it was not all bad.
  • The system never managed to successfully enter and recover from S3 suspend mode. Thus, I turned the system off when not in use. This turned out not to be a real problem for me, but it was one of my original wishes.

Processors

The innovations unveiled within recent Intel and AMD processor designs look interesting in terms of power economy – see Intel’s Core i7 architecture (Nehalem), and AMD’s Phenom II, which have many improvements over the original Phenom processors.

Also, Intel’s Atom processor and associated chipset and D945GCLF2 motherboard look interesting for very small NAS systems utilising a simple 2-way mirror. Unfortunately, due to only having two SATA connectors on the motherboard, it makes this unsuitable for a more substantial NAS using more than two drives, although I’ve seen that Zhong Tom Wang got round this limitation with his Intel Atom-based ZFS home NAS box. However, lack of ECC memory support is a pity.

The latest Intel Core i7-based Xeon 5500 series of processors have 15 P-states, so can select an appropriate CPU frequency dynamically according to load. Also, as Intel have worked very closely with Sun to ensure Solaris/OpenSolaris has great support for their new Core i7 processors, you can be pretty sure that support for power management has advanced greatly in the last year. Check out these videos to see what I mean:

I don’t know anything about how well Solaris/OpenSolaris supports the new power states available in the new AMD Phenom II processors, so if anyone has info, please add a comment below.

Update 2010-01-20: Since originally writing this article, there have been several new processor developments:

  • Intel Pine Trail platform, including the new Pineview processors which are single/dual-core 1.66GHz 64-bit second generation Atom processors, which integrate processor, chipset, memory controller and GPU, and use incredibly low amounts of power ranging from 5.5W TDP for the single-core Intel Atom N450, through the top of the range dual-core Intel Atom D510 which is a miserly 13W TDP. Unfortunately for ZFS-usage, they don’t support ECC memory, which makes them unsuitable where data integrity is of paramount importance. See the AnandTech report for more details here: AnandTech: Intel Atom D510: Pine Trail Boosts Performance, Cuts Power
  • Intel Xeon 5600-series (Gulftown): Due to be released around March 2010, although it is anticipated that Apple will announce a new 2010 edition Mac Pro model using these processors before then. These should offer 6-cores instead of the 4-core Xeon 5500-series (Nehalem). Unfortunately for consumers, Intel has priced these processors as enterprise devices, making them unsuitable as home NAS processors due to cost, even though they support ECC memory.
  • AMD Athlon II X2 ‘e’ models: AMD have released interesting low power versions of their dual-core 64 bit processors, for example the AMD Athlon II X2 235e . 45W TDP with CPU frequency scaling to use lower power when the NAS is idle. As with most AMD processors, these provide ECC memory support within the memory controller within the processor package.
  • AMD Phenom II X4 ‘e’ models: AMD have released the much improved Phenom II range of quad-core processors and, additionally, have now produced lower power versions of these, denoted with an ‘e’ suffix, such as the AMD Phenon II X4 905e, rated at 65W TDP and supporting CPU frequency scaling for lower power usage when the processor is idle, and providing ECC memory support within the memory controller within the processor package. These are a little more expensive than the standard models, but are more suitable for a NAS due to lower power usage. These processors seem to offer the ideal combination of (1) low power when the NAS is idle, and (2) extra processor power when required for various interesting ZFS features like compression, deduplication, encryption and triple parity (RAID-Z3) calculations, plus sufficient power for 10GbE for fast LAN communications, if required for things like video editing.

ECC memory

I originally chose ECC memory for added robustness, and I will continue to use it for future builds, as the cost premium is minimal for the added peace of mind it gives in its ability to detect and correct parity errors within the memory. Many people don’t consider ECC memory important. I disagree. Garbage in memory caused by flipped bits, written to disk, will not be what you would like to read back from disk. Enterprise server systems use ECC memory, and there’s a reason for that. ;-)

Drives

Please see the important updates below before continuing.

In terms of drives, the 3.5″ Western Digital ‘green’ WD10EADS 1TB drive is currently looking good on price, idle power usage (2.8W), read/write power usage (5.4W), and also noise and vibration issues, as these drives operate at 5400 RPM despite confusing & conflicting information out there. Read/write performance is quite respectable, however, due to built-in innovations, allegedly. They should be perfect for a general purpose home NAS where you want lots of cheap, reliable storage that doesn’t sound like a jet-plane, and consumes little power.

The Western Digital WD15EADS model which is a 1.5TB version of the same drive, with very similar specs is around the same price per GB and looks like a good choice too for larger storage pools. Currently the WD20EADS 2TB version of the same drive is just too expensive per GB to use, unless you’re building a monster Blu-ray server.

Update #1 2010-01-20: Please note that serious problems with the current range of Western Digital Green drives are being reported in various fora, and so I can not recommend these drives as suitable for use in a RAID system, and Western Digital do not recommend them as suitable for RAID systems either. Please see here for more details:
http://opensolaris.org/jive/thread.jspa?threadID=121871&tstart=0

Update #2 2010-01-20: As of the date of this update, the price sweet-spot are the 1.5TB drives, with the 2TB drives still a little too expensive and currently not good value for money, although this will surely change in the coming few months. It’s quite difficult to find good, reliable, consumer-priced SATA drives for RAID use. See my comments listed by manufacturer below.

  • Western Digital Green drives, which would have been my first choice, have to be ruled out for the reasons cited in the update above. Also, see my notes below.
  • Seagate: I will check reports of currently available models.
  • Hitachi have the HDS722020ALA330 2TB model, but I have not seen comments on this, although it appears to be a 5-platter model, which is not desirable for the reasons cited below. I will seek out reports on it and also seek out any 1.5TB model reports.
  • Samsung has a HD154UI 1.5TB drive having 3-platters which seems to have good customer ratings at newegg.com, and they also have a very recent HD203WI 2TB 4-platter model released around 2009-12, which has good customer ratings so far at newegg.com, but it might be too early to make an informed buying decision yet.

In seeking desirable drives, one looks for drives containing the fewest number of platters from the low noise, low vibration, low heat and good reliability perspectives. As of January 2010, 500GB per platter is the highest available data density, so for 1.5TB drives look for 3-platter drives, and for 2TB drives look for 4-platter drives.

I think Western Digital has made a really big mistake recently with their Green drive range. First of all, they appear to have some serious technical issues with these drives. Also, they appear to have marked this Green range of drives as unsuitable for RAID usage, even though the low price, low rotational speed (5400 RPM) and low power usage make them an obvious candidate for consumer RAID drives. If this is an intentional decision in order to create market segmentation between consumer and enterprise drives, then it is a pity, as there are many potential buyers of these drives for consumer RAID applications where issues of low price, high capacity and power economy are of primary importance, with performance of secondary importance. I have seen reports from users claiming that the WDTLER.EXE no longer works for newer revisions of these Western Digital Green range of drives, which is used to improve error handling in a RAID environment. This alone, points to an attempt to make users buy their much more expensive enterprise SATA drives for RAID environments, like the WD2002FYPS 2TB model, but these are around 50% more expensive. In effect, for these reasons, this has removed Western Digital as a choice for consumer-price RAID drives. If this situation changes, I will update this viewpoint. If you have evidence that I am wrong in my interpretation, please leave a comment below.

SATA controller

Storage is the heart of a NAS, so special consideration should be given to the disk SATA controller. ZFS should have full control of the disks, so JBOD mode is all that you need (Just a Bunch Of Disks, i.e. no custom RAID controller hardware, firmware or software). However, drivers for on-motherboard SATA controllers may or may not be robust, or even available, and so it may be worth considering a SATA controller card for a future build, whose driver is known to be 100% rock solid. This will help guarantee no weird issues with storage ruining your day. :)

Update 2010-01-20: I am currently using the SuperMicro AOC-USAS-L8i SATA/SAS controller, and have been very impressed with it. Please see here for more details: Home Fileserver: Mirrored SSD ZFS root boot.

This is a great value 8-port SATA/SAS controller, but it uses a SuperMicro UIO bracket, which needs to be removed for use in a standard tower case, although this is easy to do. For an alternative adapter which is 100% identical in terms of hardware, but uses a standard bracket for a tower case, see the LSISAS3081E-R, although this is around 50% or so more expensive than the SuperMicro equivalent, for some reason.

SuperMicro also make a low-profile version called the AOC-USASLP-L8i. These models are all SATA 2 3Gbps per lane models.

SuperMicro has recently released new adapter models, the AOC-USAS2-L8i and the AOC-USAS2-L8e which are able to provide 6 Gbps of per-lane bandwidth for the new ranges of high-speed SATA 3 SSD devices. The AOC-USAS2-L8i model also has RAID capability, whereas the AOC-USAS2-L8e model does not. As ZFS requires JBOD and does not need hardware RAID, the AOC-USAS2-L8e model looks to be the best adapter to use for up to 8 internal SATA drives. However, I am awaiting confirmation whether this card is compatible with Solaris and ZFS. Normally, it should be compatible. Check this thread for further details:
New Supermicro SAS/SATA controller: AOC-USAS2-L8e in SOHO NAS and HD HTPC.

These practical experiences and latest technological updates will make it easier to choose future storage hardware for a processor, motherboard and SATA controller.

Watch this space.

Redundancy

ZFS allows a system builder to design-in as much or as little redundancy (none) into his/her storage systems as required.

Many people choose single parity (RAID-Z1) for multi-drive arrays, as this gives an efficient data to parity ratio — i.e. you can use most of your drive capacity for data storage, and only the capacity of one drive is used for parity data. It is this parity data that is used to rebuild drives in the event that files get corrupted or drives fail. So it *does* have value, immense value, but because it is unavailable for data storage, many people see it as wasted space. When you suffer a loss scenario you will be thankful for parity data though, as ZFS will use the parity data to put things back to normal.

RAID-Z1 vdevs

My NAS originally started off with a storage pool consisting of a three-drive RAID-Z1 vdev, which was a cheap way to get started, and it was the right choice for me at the time: the capacity of two drives for data, and one for parity.

A RAID-Z1 configuration means that the NAS will survive one drive failure without data loss. A second drive loss means your data is toast. Yikes!

This is very important to consider, as drives are often bought together at the same time, and so they will most likely be from the same manufacturing batch, meaning that any faults in design, materials or manufacturing process, will often cause drives to fail around the same time. This means that it is quite likely that when one drive fails, it is only a matter of time before a second drive fails, and that time period may be very short.

This becomes important when you have little time to rebuild your storage array. The process of rebuilding the lost data from parity data onto a replacement drive is called resilvering, and it is critical that a second drive does not fail during this resilvering process, otherwise you will lose your data!

RAID-Z2 vdevs

After some further research into drive failures, and considering the fact that putting your data onto a NAS is like ‘putting all your eggs in one basket’, it has caused me to reconsider the use of RAID-Z1.

With the information I now know, I consider the use of double parity far more robust, as it gives much greater protection against data loss by allowing your system to survive two drive failures. With RAID-Z2 you are effectively buying yourself more time when you need to replace a failed drive. Also, should a second drive fail during the resilvering process when rebuilding the first failed drive, you will still not lose data. Only if a third drive should fail will you lose your data.

For these reasons, I have now upgraded my system to use a single RAID-Z2 vdev configuration.

As using RAID alone is not an excuse not to do backups, in addition to using a RAID-Z2 based NAS, it is important to have some other system to do backups onto.

And if you are really serious about not losing data, the next thing to consider is taking a copy off-site, possibly on a high capacity drive or two, to guard against possible loss due to fire. With ZFS this is trivial to achieve by inserting a couple of high-capacity drives, creating a new pool from them, and then using ‘zfs send/receive’ to export the file systems to the new pool, and finally typing ‘zfs export’ to complete the write process before removing the drives and transporting them off-site.

For those interested in preserving data (who isn’t?), I found the following site to be educational:

Snapshots

These are magical. Simple, but magical. I use snapshots like one would use the ’save game’ feature in a video game, just before opening a door in a game like Doom, Quake etc…

In these games, it takes a long time for feeble players like me to progress through all the levels, so I learned to save the game just before opening a door, as there was invariably a nasty beast waiting behind it which would result in ‘Game Over’ being displayed. :)

In the same way with ZFS, before attempting any operation that is significant, I always snapshot the file systems in the pool. That way, should I make a mistake and type ‘rm -fr *’, I can easily recover by typing ‘zfs rollback tank/fs@snapshot’ and everything magically returns to the state before typing the normally disastrous command. That’s magic!

You can even rollback from an OS upgrade that didn’t go well, if you do a snapshot of the OS file system before the upgrade.

This command has become my friend:

# zfs snapshot -r tank@20090501

This little beastie will make a snapshot recursively through all the file systems within your storage pool, assuming your pool is called tank. The snapshot name given to the snapshots for each file system will be ‘20090501′ in this case. Use a new date for each occasion, or qualify further with a timestamp or sequence number if you are doing lots of snapshots on the same day. Recursive snapshotting like this, makes it easy when doing large incremental backups too.

Backups

Now that I have started to take more snapshots, I have also learnt the necessary incantations to do full and incremental backups using ‘zfs send’ and ‘zfs receive’. These are amazingly powerful and, done right, make it fairly simple to do regular incremental backups recursively through a hierarchy of file systems. I will detail all this in a later post. I used these techniques when doing pool upgrades for increasing capacity and redundancy levels (RAID-Z1 –> RAID-Z2).

Sharing and NFSv4 ACLs

I started by creating simple CIFS shares to a single computer, which was a Macintosh. All was well. I was using simple Unix-style permissions, and all was well.

Well, not quite. I frequently saw permissions problems when moving data around and it got to be a pain sometimes: back to the command line, chmod 755/644, chown user:group * etc.

Then I discovered the ACLs chapter of the ZFS Administration Guide. It looked powerful but far from trivial.

When you get these right, you can setup nicely behaving file systems: inheritance of properties and permissions etc, but getting them right is a bit of work and, so far I failed to find an idiot’s guide to NFSv4 ACLs, as used by ZFS now. If someone can send me the URL of one, I’d be delighted. If not, then I see another long post ahead for me to write one day…

In a later post I will detail my findings, as these NFSv4 ACLs seem to be the future, and they offer more flexibility than standard Unix-style permissions, although at a cost, it seems.

Sharing read/write file systems with a Windows user/box led to some interesting discoveries relating to ACLs and user accounts too, which I will also try to document later.

iSCSI

This was a piece of fun, and it allowed fast transfers! But due to the fact that my backup machine was not switched on very often, the NAS was taking ages to start and shut down as it was trying to connect to the backup machine iSCSI resources. I don’t use this any longer. Useful for systems which are on 24/7 though.

Trunking

This was a pain to setup, on both ends, required an 802.3ad-compliant switch, and did give considerably faster transfers between my dual-GbE Mac Pro and the dual-GbE NAS when it worked. I’m referring here to the bug in initialising the 2nd GbE port on the Asus M2N-SLI Deluxe motherboard on POST. Thus, I broke my trunked network connections and returned to simple single GbE network connections.

When I get more serious hardware one day, I will probably revisit this area, and buy myself a decent user-friendly switch like the HP ProCurve 1800-8G 8-port managed switch that works with browsers other than MS Internet Explorer, unlike my Linksys SRW2008 (Cisco low-end). I have returned to my previous switch, a DLink DGS-1008D green ethernet 8-port unmanaged Gigabit switch, which works great for single GbE-connected machines.

Conclusion

I have learnt a lot over the last year about ZFS, and using it has convinced me that I made the right choice in selecting both Solaris and ZFS. But I have more to learn, and a lot more to write about, which I hope to do when I get some more time.

I would be interested in hearing from any other users of ZFS to hear their experiences — feel free to add a comment below.

For more ZFS Home Fileserver articles see here: A Home Fileserver using ZFS. Alternatively, see related articles in the following categories: ZFS, Storage, Fileservers, NAS.

Popularity: 23% [?]

Share and Enjoy:

  • RSS
  • del.icio.us
  • StumbleUpon
  • Digg
  • Twitter
  • Mixx
  • Slashdot
  • Technorati
  • Facebook
  • NewsVine
  • Reddit
  • Google Bookmarks
  • LinkedIn
  • Yahoo! Buzz
  • email

47 Responses to “Home Fileserver: A Year in ZFS”

  1. Hi Simon,

    I’ve had ZFS based NAS similar to yours running for a little over 10 months now (partially thanks to your clear articles).

    I did not really have any hardware issues (quad core phenom, 8gb ram) other than consistently slowish transfer speeds from my MacBook Pro – tellingly it sped up a lot when I upgraded the MBP disk, but I think some if it is due to the open Solaris drivers for the onboard NIC on the M3N78-EMH M/B. I did try an Intel NIC and that was slower still, so I am not really sure what’s going on.

    I have a 5 * 750gb RAIDZ1 array, and I have never worried too much about data loss; I figure that if a drive dies I can power down and buy a new one. However what you say about safety and down time makes sense and I think I would switch to Z2 if I had a spare SATA port. I’m going to need another controller at some point anyway unless I want to junk the existing disks as I will run out of capacity in another 6 months, so I will probably make the switch then. It’s a bit noisy, but that because of a cheap case and a 5 in 3 hotswap drive cage which has an 80mm fan and no dampening – definitely not needed, but it does look good.

    Like you I had/have occasional issue with permissions, mainly when I play with NFS mounts rather than CIFS. I did look at the ACL stuff, but decided it was too complicated to dive into until I had a block of time. There does seem to be need for a clear summary or idiots guide, *hint* *hint*.

    Timeslider in the gui makes taking regular snapshots for backups trivial, and I might try and see how hard it is to point it at cloud based storage lacking anywhere better to back up to.

    I have an iSCSI target running on the NAS which I point the MBP’s Timemachine at – it works great (thanks to whoever posted the links in the comments of one of your previous threads).

    All in all I am very happy with it. It gives me the data security and centrality that I wanted, a place to run any other server code that I need, and is far more performant and cheaper than an off the shelf NAS box with anything near the same capacity.

    The only thing that worries me about it now is Oracle playing silly buggers with either ZFS or Open Solaris! I can see them keeping ZFS going, it makes sense as their storage layer, but I can see them killing off Open Solaris development, and putting a commercial licence on future ZFS versions. Though there also seem the possibility of opening up the licence so it can be included in the linux kernel as Oracle push their linux stack fairly hard. If that happens, i would be tempted to switch to a linux distro; solaris may be a technically better OS, but software support and momentum count for a lot.

  2. Hi Lee,

    Thanks, and good to see you’ve had success with ZFS too.

    Without knowing the exact specs of your MacBook Pro’s hard drive, it sounds quite possible that this could have been the cause of slow transfers, as laptops use small 2.5″ drives spinning at 5400 RPM or less, for low power consumption to preserve battery life.

    When you switch to a RAID-Z2 system, after sending the pool file system data to your new machine, you could use your five 750GB drive array as a large backup target: almost 3.75TB of non-redundant backup space, or almost 3TB of RAID-Z1 protected space for backups…

    Yep, you’re right, when expanding these things, one quickly runs out of SATA ports on the motherboard, and a SATA controller card then becomes necessary.

    OK, after a bit more searching, I’ll see if I can bring myself to write an ‘NFSv4 ACLs for complete idiots’ post… trouble is, it will probably be full of errors as I have had great difficulty locating any comprehensible, yet detailed explanation of exactly how to use a lot of the power in NFSv4 ACLs… but I suppose it will be better than nothing, and if more knowledgeable people spot errors they can correct me.

    I have resisted using Time Slider so far, as I wanted to learn how to do all ZFS admin from the command line, so I understood the underlying mechanisms. But I must admit that when they add network-based incremental backups from the snapshots taken, then it will be harder to resist using the ZFS Automatic Snapshot service written by Tim Foster and Time Slider from the Gnome UI.

    And regarding cloud-based backups from your snapshots, you might find this recent post from Tim Foster interesting if you haven’t already read it: Automatic snapshots into The Cloud.

    Good to see that you got the iSCSI target working for backing-up your MBP with Time Machine. I didn’t try it again recently, and now that my data is off the Mac (mostly) and onto the NAS, there is less incentive to look again at Time Machine. For getting a speedy clone of the boot disk I found SuperDuper to be pretty good.

    And I have to agree with you — a NAS like this is cheaper, has more capacity and out-performs virtually any off the shelf NAS system out there and, crucially, due to using ZFS, it offers far more protection than these expensive mainstream alternatives.

    Like yourself, I too have concerns about what will happen to Solaris/OpenSolaris and ZFS when Oracle get hold of it. As you say, I think they will keep Solaris going, and hopefully they will see OpenSolaris as a kind of experimental bug-testing version of Solaris that is free for general use, and just charge commercial license fees for the Solaris 10/11 enterprise systems etc. Again, hopefully Oracle will restrict license fees for ZFS to commercially-supported installations — i.e. SLA contracts. But we’ll just have to wait and see.

    Also, what will be interesting with Oracle, will be that they will own two file systems: (1) ZFS from Sun and (2) Btrfs. I must admit that I know nothing about Btrfs, but it appears to be very immature compared to ZFS. Financially, it would probably make sense to kill off Btrfs and replace it with ZFS, as it’s probably years away from reaching ZFS’ level of maturity. Then they could GPL the ZFS license to make it available for Linux kernel usage, the Linux crowd would jump for joy, and Oracle would become heros and gain kudos. And if they only charge for supported installations with SLA’s, it could be that everyone wins. But that’s probably the best scenario outcome.

    Anyone out there care to give their opinions on what they think Oracle will do with ZFS and Btrfs?

    And imagine if Oracle did make ZFS use the GPL licence so that Linux could allow itself to use ZFS at the kernel level. I’m not sure that I would want to switch to Linux from Solaris if that happened, but it would be nice to have the possibility to do so. But Solaris has so many nice features like CIFS sharing for ZFS file systems too, and Zones which is a marvel in itself. And DTrace… the list goes on…

    Cheers,
    Simon

  3. Simon,

    I’ve had my ZFS NAS up since last October. My decision to use ZFS (and my introduction to the technology) was based on your posts. I bought 4 x 1.5TB seagate drives (yes, the ones that had bad firmware problems) but luckily I have had zero problems with the drives. They are in RAIDZ1 on an ASUS MB and 8 GB ram. My home network has a couple of macs, and the one on gig ethernet is able to transfer around 80 mbps over NFS. I got Time Machine setup to work over samba to the machine, even over wireless (but I abandoned this until I can wire ethernet around to some other machines).

    I am hoping that the next release of OpenSolaris will allow me to upgrade to it from SXCE b99. I’ve been stuck with this build because every other build has trouble seeing all of my drives after reboot, and sometimes loses them while the machine is running. I’m pretty sure it is some kind of driver error with my MB. But i’m so used to Ubuntu and a package manager that I’d really like to have the solaris package manager to install some addons and get virtualbox up with ubuntu to stream media around my home, and possibly get LDAP up for permissions things.

    I do not know enough about the two companies so I cannot make an educated assumption on what Oracle will do with this project, but I’m hoping they’ll keep it around for us all. Putting ZFS on the GPL would definitely make me consider using Ubuntu as the main OS instead of having to virtualize it, but I still enjoy solaris. Also, Apple is going to add ZFS read/write support at least for mac OS X server http://www.apple.com/server/macosx/snowleopard/ and hopefully for the desktop. This would be great to be able to use ZFS to administer a multi-drive firewire or eSATA enclosure.

    Keep up the great posts, I learn something new every time I read them. Now that classes are done for the summer I’ll hopefully get some playtime with my machine.

    Brian

  4. Simon,

    Just want to say I’ve found your posts to be immensely helpful in deciding my NAS solution. I’ve purchased and AMD based system and some drives and am working on getting opensolaris set up now. Unfortunately I ran in to a snag that my workstation’s Realtek network card can’t seem to get up to GigE speeds when connecting to the switch, so it looks like I’ll have to get a new network card before I can actually try using the setup.

    I will likely do a writeup in my blog once I get things more figured out.

  5. Hi Brian,

    Thanks for the compliments, and good to see you’ve had generally positive experiences too.

    Pity to hear about possible motherboard hardware driver issues — I know, these are no fun!

    Like you say, getting LDAP running might be nice — I need to take a look how to do that too, I think.

    Indeed OS X server 10.6 aka Snow Leopard will support ZFS, but I don’t think the client will, which is a pity.

    Have fun!

    Simon

  6. Hi Kamil,

    Thanks a lot and glad the posts have been useful.

    Which speed do you get with your Realtek NIC when using your switch? I assume it’s a gigabit switch too. And what are you tranferring to/from, and does it also use a gigabit NIC, and which sharing protocol are you using: NFS or CIFS? And which category cable are you using on BOTH legs of the link to/from the switch: category 5, 5e or 6?

    I’ve seen people not realise they were using category 5 cable on one of the cables and say they can’t get high speeds, or discovered later that the other NIC is only a 10/100 Mbit/s NIC.

    Yes, writing up these experiences can be really useful, even for oneself when asking “how did I do that again?” some time later!

    Simon

  7. Simon,

    I’ve found chapter 6 of http://tools.ietf.org/html/draft-ietf-nfsv4-minorversion1 to be quite helpful as it’s the definitive reference for NFS v4 ACLs.

    It is not, however, a “to accomplish this, use these ACL entries” guide. But it does provide an essential background to understand others’ guides/suggestions regarding ACL use.

    Kyle

  8. Thanks for that Kyle. I was using something similar from the same source: http://www.ietf.org/rfc/rfc3530.txt
    However, I see that the date of your URL is 2008 and mine is 2003, so your link should be more up-to-date.

    Thanks,
    Simon

  9. Great post, I am in a process of implementing ZFS in enterprise environment, several high capacity servers to be used as NAS. Your article was great help in setting up testing environment, now I will move to work on ACL which unfortunately by living in windows active directory is essential. Second thing for me will be ISCSI since we use that technology heavily for virtual machines. Once again great article and I’m looking forward to your posts.

  10. Simon,

    Thanks for your posts regarding ZFS. They were very helpful. I built a server last year with the following hardware:

    CPU: AMD X2 4850E
    MB: XFX MDA72P7509 NF750A (6x SATA, 2x PCIe x16 ports)
    RAM: 4GB ECC RAM

    The onboard NIC is a Marvell 88E8056 and is not supported, as of OpenSolaris 2009.06. I used an Intel 1000GT PCI NIC instead. Initially I had two of these Intel NICs and set them up as a trunk on a Cisco 2970, but I didn’t get the expected results. The 802.3ad standard does not round robin packets like you’d think it would. It uses src/dest hashing instead. Also, the eSATA port on the motherboard requires a SATA pass through cable to be plugged into one of the six ports on the motherboard. So if you want to use eSATA, you can only use 5 internal SATA ports. I may get a cheap pci-express eSATA card later.

    I was using a single 80 GB IDE drive for the OS, and 6x 300GB SATA drives in raidz. Those drives are old and way past their warranty, so I decided to upgrade the drives with 1TB drives. I only bought 2 so far at $90 each. I’ll buy more later.

    My new storage is configured like this:

    OS: 2x 300GB SATA drives in a mirror (old, but never used, were spares)
    DATA: 2x 1TB SATA drives in a mirror (will soon expand to a 4x 1TB raidz2)

    That will max out my SATA ports. I decided to mirror the OS since it should increase performance and redundancy. I am also no longer running SXCE. It was too much work and I don’t like SVR4 style package management and I don’t like bi-weekly upgrades. I prefer to use IPS, so I am running OpenSolaris 2009.06.

  11. I was at first also very disappointed about the fact that adding single disks to raidz was not possible. But after playing with the whole system I am not anymore. Although if you are on a really tight budget it might be still interesting.
    If you want to add space you have basically 2 options which are both good.
    1. you can swap all drives 1 by 1 and when ur done ur pool magically grew. ;)
    2. you can add another zraid to your pool.
    The first option is nice when you have no will of adding sata ports or storage bays. And if a disk dies you can decide to already get a bigger drive.
    The second option is good for when you ran out of space. When this happens it usually means you need to get @ least the double amount of space to be good.
    I tend to fill up space faster and faster over time so adding another raidz is not such a bad idea. Now you could say that you will be wasting another drive for security but you don’t get extra safety like with raidz2 or using a hot spare drive. However what you are forgetting what you do get back, the 2 raidz will be a stripe with each other. Of course not for the old data and if your old zfs was filled to the rim neither for the new but for the rest yes.
    So my conclusion is that however sad it is that you cannot add a single drive to a zraid, in reality even in super budget home settings I don’t think it is a feature needed. Maybe in a super tight budget home setting, but hell just throw away some data then ;) . For the money ppl buy a videocard you can buy a set of 5 harddisk which is a nice size raidz1.

  12. Thanks Joe, sounds like you’ve had a lot of ‘fun’ too with the hardware :)

    I had mixed results with the trunking too, but it was definitely a lot faster when I had it working. But the trunk suffered from a buggy 2nd GbE port that often failed to initialize on the Asus M2N-SLI Deluxe motherboard.

    Like yourself, I too have run out of SATA ports: soon it’s time to get another SATA card too :)

    Again, like yourself I’m looking to mirror the OS boot disks, and due to breaking a pin on the boot drive’s IDE connector, I took the opportunity to install OpenSolaris 2009.06. However, as I now want to use this NAS to do more stuff, my configuration is likely to get more complex and so I’m now thinking of creating a mirrored, multiple boot environment-capable setup using OpenSolaris 2009.06 with a couple of SSDs, as this case has run out of vibration-dampened, silicone-grommeted drive bays, so it’s SSDs for me, despite their ridiculous $/GB cost right now, but two 30GB SSDs should do nicely.

  13. Thanks Jarek, and good luck with your enterprise setup — which kit/setup are you using for your ZFS NAS?

    You’ve probably already discovered all you need to know by now regarding ACLs to be used in Windows environments, but if not there is a useful post I found that was posted by an experienced Windows systems administrator that might be interesting — see here:
    http://breden.org.uk/2009/05/10/home-fileserver-zfs-file-systems/#comment-9524

  14. Hi Wouter,

    Like you said, if you’re on a tight budget then adding a single drive to an existing vdev, like a raidz vdev, if it were possible right now, would be an interesting possibility.

    But like you say, other ways of upgrading the storage pool do exist. I think in previous posts, some people were confused when I said you can’t expand an existing raidz vdev by adding additional drives, and thought I was saying that you can’t expand a zpool storage pool, which is not true, of course, as you correctly pointed out.

    For me personally, the second option you mentioned of adding a new raidz vdev was not an option due to lack of SATA ports on the motherboard. And the first option of swapping a disk at a time frightens me a lot, due to the multiple resilvering process required. See these comments:
    http://breden.org.uk/2008/09/01/home-fileserver-raidz-expansion/#comment-4046
    http://breden.org.uk/2008/09/01/home-fileserver-raidz-expansion/#comment-4057

    And of course, comparing the two configurations of (1) two 3-drive raidz1 vdevs, against (2) one 6-drive raidz2 vdev, we can see that both (1) and (2) use the capacity of 4 drives for data and the capacity of 2 drives for parity, but the important difference is that (2) has double parity across the whole raidz2 vdev, and so is more protected against data loss — i.e. (2) can survive any 2 drives failing before data loss occurs, whilst (1) can only survive one drive failing in each of the raidz1 vdevs — if two drives fail in either raidz1 vdev then data loss occurs.

  15. Simon,
    Even if adding one drive on a tight budget might be nice. I am trying to look at the bigger picture here. Even if the budget is tight, it is probably more economical to add space in an order of magnitude greater then 2. Otherwise you just spend ur time creating space instead of using it. So this is even viable for a home setup.
    If you do not need this amount of extra space then you probably should rethink the need of more space, because it might not be there at all. For all content is exponentially growing over time, think about more megapixel camera’s, newer games don’t fit on that single side 360kb floppy anymore, Programs used to be distributed on floppy now on dvd and soon to be blueray, etc etc.
    for the choice of method, when you realised how much more data space you need. Do you physically have more ports/bays then you choose this path, if not get bigger disks.

    For the resilvering, I agree it is a strain on your drives and if you are really worried about it, but then again what about normal use and if the disks are that bad maybe you should have changed them b4.
    You can lower the change of it going wrong while resilvering with a disk you did not exchange. By doing a scrub beforehand and, although this one is tuff on solaris, an extended smart check.
    Now compare the disks for their smart data and choose the one you think is going to fail the most likely. Personally I think that after the previous to intensive operations the disk will survive the resilvering process just fine. And you can check before changing the next disk the smart data and choose again accordingly.

    And for safety of the 2 raidz vdevs versus 1 raidz2 I know you do not get extra safety, I just mentioned that instead of safety you get a bit more speed, since the 2 raidz will be “striped” by zfs.

    p.s. my experience with failing harddrives is not very pleasant either. In fact right now I am making an one to one backup of my laptop drive since it had some pauses…. just a gut feeling but the previous 2 gut feelings were correct so better be safe the sorry here.

  16. Hi Wouter,

    I agree that one would save time adding 2 drives once, rather than 1 drive twice, and you’re right — having more space available than we need right now is a good idea. I just used similar sized disks for consistency to ensure it worked well, rather than upgrading to bigger drives — but that is a good option I will use on another system sometime.

    Speed is a little irrelevant here right now as I’m currently using a single GbE connection limiting network transfer speeds to around 50 MBytes/sec, so having 2 RAID-Z1 vdevs is not of interest to me with this NAS, but it might be of interest to others with different configurations, or who often need to transfer large amounts of data within the NAS — i.e. not across the network.

    Failing drives are not fun — good luck with your backup!

    Cheers,
    Simon

  17. Simon,
    What I meant with “to add space in an order of magnitude greater then 2″ was not adding drives in pairs but to at least double your current capacity. I am not sure how much space you have at the moment, but let’s say it is about 3tb of usable space. Now when this is getting full, this is either cause you have a bunch of garbage or there is actual data there. In the data case you probably want to go to at least 6tb of space when you decide to upgrade. Although that sounds like a lot now but if you could fill 3tb now you will be able to fill the next 3tb quite a lot faster.
    About hardware choices, for the ecc memory one could decide to get an amd solution and a lot of those accept ecc memory. And at least 6 sata ports :)
    For myself I just decided to use 5tb hd’s and just be over with it for a while. Even though I got all the hardware lying around I am still not able to use it all for some delay in case building. The reason I did not get a standard case is that in order to put 1 dvd 1 os hd and 5 data disks in a thermally sound(should say silent) package is not so easy. Now I got a decent design wich makes even less noise then my laptop(macbook) I still need to put it all together. The reason why it is silent is that those samsung drives do not make much sound and produce remarkable little heat A big cpucooler and 45W cpu and a 300W bequit is very efficient and it will be running max 170W. Did a measure with 1 500gb samsung sata dvd floppy and burnk7, furmark and smartselftest inserting a dvd that came down to 100W from the socket, idle 50W. Damn I forgot to put in a floppy.. ;) needed it for an xp install. The noisiest will be the os hd for now since the usb sticks bought to run the os on are error prone, I need to send them back.
    After that I still have some software issues since my data machine needs also to be the media center and although I did compile movieplayer eventually on solaris the sound chip in my asus m3n78-em board does not play nice(a serious case of clipping).
    Now my final setup will be a 32bit ubuntu with boxee as mediaplayer , this will also function as a samba pdc a dns and dhcp server and finaly it is running apache for my website/blog. Virtually a qemu centos calweaver or asterisk server a qemu eon with direct acces to the 5 data discs. Except for the eon install wich has some usb issues with qemu it is running already on my older box with asterisk. Now in the end I would like to add a firefly media server for itunes a cms sharepoint based on plone and finally maybe connect my tvcard to the tvcable since I do not have a television.
    you might wonder why 32 bit, well boxee is currently available only on 32bit.
    For the laptop drive I am happy to report it is still fine;)

  18. Hi Simon,

    thanks to your very enjoyable posts (from a geek’s point of view), I have been the owner of a homemade ZFS-based NAS for about two weeks now. It’s working great so far, even though I struggled a bit when authenticating from Windows to the CIFS server at first. This is now resolved, but it would have been nice if you had developped that part of the installation process in one of your articles (unless you never encountered any such problem, in which case you don’t know your luck). Comparatively, accessing the same shares as an NFSv4 filesystem from Linux was painless, even setting up the UID/GID mappings went smoothly. I haven’t tried from a Mac yet, but it should be even more painless.

    Anyway, I actually wanted to warn about a small usage issue that I have bumped into during my tests, and maybe someone could help me understand it, and maybe even fix it.

    I’ve always planned this NAS to mainly hold three kinds of file: my photos (those 14 megapixels RAW shots sure take some place), my music & movies, and finally games. As hinted earlier, my main computer is a Windows/Linux dual-boot setup, and I use the Microsoft OS to play games. That’s roughly it. The issue that has me puzzled is that, even though I’ve bought a small D-Link Gigabit switch serving its purpose perfectly when playing music or movies, moving files around, or when browsing & managing my photos, launching applications from the NAS has proved to be quite a problem. It works, mind you, but with a very annoying symptom: a game that used to launch in about 10 seconds now takes almost 1 minute to be available. These launch times were measured in a simple manner: start a stopwatch as I double-click the executable, stop it as I reach the main menu. After that, loading times in the game itself are very much the same as what I used to experience when it was installed on my internal 320GB hard drive. I have looked around for clues, but the info I gathered was either outdated (when Gigabit home networks where neither practical nor possible), or irrelevant to my situation (i.e. an application server that actually *runs* the applications and streams them to clients).

    Alone and facing these cold machines, I poked away at a few settings, without any of them making any difference: compression=on/off, atime=off, deactivating Windows security warnings (i.e. running an executable from a network drive), etc. There must be some technical aspect to loading-an-executable-into-memory-and-then-running-it that I am not aware of, because I don’t see why it would take that much longer when done off of a network drive (again, especially when only the *loading* part is actually crippled, the running part is *fine*).

    But overall, this NAS delivers: a few different tools gave me between 50 and 75 MB throughput, managing zpools and zfs datasets is amazingly easy and satisfying, though I’ve yet to fully tame those beasts, and the price is much lower than any comparable commercially avaiable product. It could definitely become the be-all-end-all when (and if) I solve that annoying loading times issue.

    One interesting benchmark came from the use of the Intel NAS Performance toolkit available at (warning, this is a WinXP-32-bits-on-Intel-processor-*ONLY* tool). You can test different workloads, and as the best one gave me 77MB/s, some where deep down in the 5MB/s area. From what I understood, such a difference can be blamed on sequential vs. random file access. The whitepaper was also interesting, and one part of it actually makes me think that yes, running applications from a NAS is a different beast altogether: “Consistent with mainstream personal computer usage, we assume that operating system files and executables will be kept locally on the client. (…) Therefore, our traces include only those transactions targeting data/media files and initiated by the application being traced. System generated accesses and accesses to executable program files are excluded.”

    So once again, thanks for all your helpful posts on the matter, and see you in a year when I take a look back on my experience.

    Laurent.

  19. Hi Laurent,

    Glad you enjoyed building your own ZFS-based NAS and mostly got it working nicely now.

    You’re right, I didn’t use Windows clients much to access the ZFS-based NAS so I haven’t encountered and found solutions to Windows-related access/permissions issues. Maybe later :)

    So far I’ve mostly accessed the NAS from a Mac.

    The problem you mention of the slow-loading game is curious. My first question was going to be to ask if you had (1) Gigabit NICs on both computers, (2) Gigabit switch and (3) Category 5e or Category 6 ethernet cabling, but as you mention speeds of 50+ MBytes/sec speeds then that seems OK.

    So the issue looks like something to do with the latency differences between loading directly off local hard drive and the NAS across the network. A pure guess would be that the game load process involves the loading of many little files and that this could cause a small speed difference for each file load to be magnified. But it’s a pure guess.

    You could perhaps fire up the network diagnostics on both ends of the network next time you load the game and see what speeds you see on each end sending and receiving.

    Then compare those speeds with the disk activity diagnostics when loading the game directly from local hard drive. Doing this might yield some clues. Let me know what you discover.

    Have fun!

    Cheers,
    Simon

  20. Hi,

    I have been using OpenSolaris and ZFS for 18 months now. I found ACL descriptions in the SUN documentation totally confusing. I have been using the following commands to get my ACL’s under control. The following commands have been working for me. The last 6 lines of chmod are the key to success. After I executed the last 6 lines, all my ACL problems were solved :)

    I am no expert, so please don’t just copy paste these commands. Understand them first…….

    I have 2 ZFS filesystems, one “/tankbig/tank”. This is read only for everyone, but only the file owner can write. The second one, “/tankbig/share” everyone can read and write.

    chmod -R 750 /tankbig/tank
    chmod -R 777 /tankbig/share

    chown -R brian:shareme /tankbig/tank
    chown -R brian:shareme /tankbig/share

    chmod -R A=owner@:full_set:file_inherit/dir_inherit:allow /tankbig/tank
    chmod -R A+group@:read_set/execute:file_inherit/dir_inherit:allow /tankbig/tank
    chmod -R A+everyone@:read_set/execute:file_inherit/dir_inherit:allow /tankbig/tank

    chmod -R A=owner@:full_set:file_inherit/dir_inherit:allow /tankbig/share
    chmod -R A+group@:full_set:file_inherit/dir_inherit:allow /tankbig/share
    chmod -R A+everyone@:full_set:file_inherit/dir_inherit:allow /tankbig/share

    Brian.

  21. Hi

    I found your blog articles to be a very good source of information. The concrete examples pushed me over the edge.

    Yet I wanted to correct a slight error regarding ECC and Intel chipsets

    I am witing this on my OpenSolaris 2009-06 box
    which is using an INTEL ECC Motherboard : Gigabyte EX38-DS4 (this Express 38 chipset can be used with either ECC and non ECC) with 6 SATA-2 on board, 2 PCI-Express and 2 GigE network port
    with 4 Gig of ECC RAM
    CPU CoreDuo E2180
    6* Samsung F1
    1* Maxtor 120 (dual boot XP and Solaris)
    2*DVD-Writer
    Corsair 650W PSU (a tad overboard, but I may upgrade the videocard
    Nvidia Quadro NVS-280 PCI card
    1 DELL SATA-SAS card to get 4 additional SATA in one PCI-Express slot (could add another 8-port SATA-SAS in the other)
    in an Antec 300 box which is one of the nicest box ever used and extremely well ventilated (with 2 Noctua 12cm fans on top of the 2 14cm fan factory mounted)

    The EX38-DS4 is said to be obsolete despite its advanced features, I got it for cheap. Most of my hardware is Intel/CoreDuo Intel/Quad, so no AMD this time. I am unaware of a MB with EX38 and integrated video.

    This is mostly commodity Hardware. The plan call for a second machine build from left-over component (perhaps even cheaper with normal RAM and Z1) to be synchronized at 300km

    I also want to offer to the poor souls battling with X11 on OpenSolaris the following:

    If I was to sum up my experience of OpenSolaris:
    A. Getting a copy on Cd-Rom was tough, but creating a bootable USB-Stick was simple enough and works even better (thanks to http://www.genunix.org)
    B. Connecting to Internet was easy (clicking on the appropriate icon…)
    C. Updating the packages was trivial using the “add more software” button
    D. Transferring files from XP computers on the network was painless using the File browser Nautilus to browse the network
    E. Setting up the HP Lserjet networked printer required a dive into the book “OpenSolaris Bible” by Solter, Jelinek and Miner to install and enable CUPS
    BUT
    F. Getting the display to run 1920*1080 in Gnome was a hair-tearing ordeal.

    ===== beadm plug ======
    Before I lost my sanity (and after 2 full reinstall), I managed to understand how the BootEnvironment work, and what a difference it makes. The OS utility beadm is a godsend!

    beadm allowed me to save full bootable setups more and more refined.
    When a Nvidia driver too recent for my hardware or an incomplete xorg.conf file prevented the system from completing the boot, I used grub to boot the previous working environment and then mounted the non-functionning boot environment to discover the errors stored in the X log file, then modified the xorg.file in the test environment before unmounting and attempting another boot.
    ===== end of beadm plug =====

    So, readers with Nvidia NVS-280 may follow the steps I had to take:
    1. remove the nvidia graphics package
    2. install the 173.14.20 drivers
    3. generate a default xorg.conf with nvidia-xconfig
    4. modify the xorg.conf file to match the monitors resolution (discovered using Powerstrip under Microsoft XP)
    # LG 2361V-FP
    DisplaySize 510 287
    HorizSync 30.0 – 83.0
    VertRefresh 56.0 – 75.0
    ModeLine “1920×1080″ 173.605 1920 2056 2264 2576 1080 1081 1084 1118 -hsync +vsync
    ModeLine “1600×1024″ 136.617 1600 1720 1888 2144 1024 1025 1028 1060 -hsync +vsync
    ModeLine “1280×1024″ 107.981 1280 1344 1456 1688 1024 1025 1028 1066 +hsync +vsync
    ModeLine “1024×768″ 65.027 1024 1064 1200 1344 768 771 777 806 -hsync -vsync
    ModeLine “800×600″ 39.971 800 856 984 1056 600 601 605 628 +hsync +vsync
    5. rebot
    6. use the System/Preferences/Screen Resolution application to change settings.

    Next steps:
    G. Finalize the Raid-Z2 setup
    H. Use beadm to create a copy of the stable BootEnvironment on the ZFS storage (to protect from a failure of the boot disk)
    I. Connect the UPS
    J. Consolidate all pictures and music onto the system
    K. Create ISO from DVD
    L. Serve media to the household: Pics, Music, and films
    M. Upgrade the router to a trunking-capablen router
    N. Set up remote server
    O. Automate sync between servers

    In the end, a ZFS Home server is not for the faint of the heart. Unless you have an OS background and you NEED large capacity with multiple HD and you are UTTERLY convinced by ZFS, a commercially available NAS such as a QNAP TS-219 Raid-1 NAS with 2*1To is very easy to setup, and eats VERY VERY VERY little power.

  22. Thanks, and I’m glad these notes helped you get motivated to try it out.

    Which ’slight error regarding ECC and Intel chipsets’ did you mean?

    Simple 2-drive mirror NAS devices, such as you mention, might appeal to some people, especially on price, ease of use, and power requirements, but these devices lack automatic detection & repair of latent errors (bit rot) when a file is read back. And they have no scrub capability which can run on a regular basis.

    I agree that the Boot Environment functionality in OpenSolaris is a great feature, and I also use it, and it has saved my system when it failed to boot after performing an ‘update all’ to update all packages. Like you say, after getting a boot failure situation, one simply selects the previous boot environment from the GRUB menu. Simple and effective!

    ZFS admin, although reasonably simple, is a learning curve and most people are not technically-minded: they just want something that plugs in and works. One day, when one of the existing NAS manufacturers realises the enormous appliance potential for OpenSolaris, then hopefully the general public can also benefit from robust data storage protection without needing sys admin skills. Maybe Oracle will delve into this market?

    In your ‘next steps’ under item H, you mention about protecting the boot disk from failure. You might also wish to consider using a mirror for your boot pool to make a more robust boot environment. Then you have the dual protective features of (1) a mirror and (2) Boot Environments combined. If that sounds interesting, see here:
    Home Fileserver: Mirrored SSD ZFS root boot

  23. Hi
    Just got the time to read your answer, Simon.
    Thanks for taking the time.

    Indeed, I have unearthed another 120G IDE drive which is likely to end as the boot mirror.
    My system is double boot. I’ll guess I will use the space on the second drive which correspond to the XP partition on the first drive to be FAT-32 so I can nontheless easily ransfer data between OpenSolaris and XP.

    The point about Intel chipset and ECC was about another of your post “http://breden.org.uk/2008/03/02/home-fileserver-zfs-hardware/”
    in which you wrote:

    #
    Simon on January 7th, 2009 at 22:38
    … Also, ECC memory support was important to me, and I don’t recall finding any suitable socket 775 compatible motherboards that supported ECC — but don’t quote me on that, as it’s a while back now…

    But Intel have got some nice processors that support CPU frequency scaling for lower idle power consumption

    In your blog, as in many discussing cheap systems supporting ECC, the Intel chipset EX38 is often forgotten.

    but is is true that only a few manufacturer made motherboards properly for it.

    Before settling on the Gigabyte EX38-DS4.
    I tried the Asus P5 WS, which claimed to be server-class with a pci-x slot.
    This slot does not run at more than 100Mhz.
    Turthermore, the bios is badly programmed. It does not recognize expansion boards which have a too large ROM, thus preventing using some Hard disk adapters.

    Nicolas

  24. Hi Nicolas,

    Good luck with the mirror — it’s a really good idea to use a mirror for OS boot.
    It will be easier if you use the whole drive for use within a mirror, as partial drive usage is not recommended within the ZFS Best Practices due to admin complexity etc, but maybe it will work well for you…

    Thanks for the info on the Intel EX38 chipset, as I was not aware of that chipset. I originally did my hardware research around December 2007 for the current ZFS NAS I use, but when I did a quick search for EX38 and related motherboards I see dates from around mid-February 2008 onwards, so maybe this chipset & related motherboards became available a little after the time I researched?

    At that time, it was the case with Solaris that if you wanted low power support using CPU frequency scaling for a 24/7 NAS then it was better to go for Intel processors, but for these Intel processors ECC support was hard to find within cheap commodity hardware (read: non-server stuff, except EX38 perhaps). Whereas with AMD processors you got ECC support with almost every cheap processor they made, but these cheap AMD processors generally didn’t support CPU frequency scaling, so for 24/7 NAS use, power costs became an issue. Things have moved on now though, and I look forward to building a new NAS with the more modern hardware available soon… However, turning the NAS off when not used is a simple solution. ;-)

    Yes, PCI-X seems old tech now, and PCIe gets much better speeds and is also available on almost every motherboard.

    Cheers,
    Simon

  25. Hi Simon,

    There is a very interesting study on DRAM Error rates at http://blogs.zdnet.com/storage/?p=638
    which comments on a Google study. DRAM error rates: Nightmare on DIMM street

    It looks like ECC lover will be indicated for computers where important work is worked upon or stored.

    on a related subject, have you seen the storagepod at http://storagemojo.com/2009/09/01/cloud-storage-for-100-a-terabyte/ and https://www.backblaze.com/petabytes-on-a-budget-how-to-build-cheap-cloud-storage.html

    inetrsting isn’t it ?

    best

    Nicolas

  26. Hello Guys,

    I am VERY interested in your home-nas-zfs solutions and even more so in your success using iScsi with MacBookPro’s

    using 2009-06 opensolaris, and 10.6.1 snowleopard, and globalSAN 3.3.0.43 initiator — I did get it to work, and the persistant options works through a reboot — but — when I close the lid on the MacBook/Pro it hangs the sleep and I have to force-shutdown and reboot

    Any ideas?

    I’m planning on using the iScsi only to maintain a recent full backup of the laptops, via wireless of course.

    thanks!

    Al;

    this report/blog has been _extremely_ interesting!!

    p.s. — can one ONLY server a ZFS volume via iScsi? Should I be worried that the backing store is one great honkin’ file? Can I really trust in ZFS to protect all those bits in one file? (Or is this just my old-school supersitions? :)

  27. Hi Nicolas, that’s an interesting report regarding memory errors, and it doesn’t surprise me much and I agree with Robin Harris’ conclusion that it seems likely that there will be much more interest in the coming years in memory error detection and correction technology, like ECC.

    Whilst single bit error detection and correction in existing ECC is great to have, it seems highly desirable to have multi-bit error detection and correction as standard equipment for computer systems, and so perhaps there will be increased interest in memory correction technologies like IBM’s ‘chipkill’ which works by ’scattering the bits of an ECC word across multiple memory chips’ — see: http://en.wikipedia.org/wiki/Chipkill

    Chipkill sounds like RAID for memory, where data is striped across the storage devices, memory chips in this case instead of hard drives. Sounds like it makes a lot of sense.

    Quoting from that wikipedia page, there are other alternatives to IBM’s Chipkill tech from other hardware vendors:
    “The equivalent system from Sun Microsystems is called Extended ECC. The equivalent system from HP is called Chipspare. A similar system from Intel is called SDDC.”

    This REM (REliable Memory) from Divo Systems looks interesting. From their page at
    http://www.divo.com/page4.html they say:

    • Superior protection. For example REM-V6 corrects up to 12 bits in a 36-bit word! Also it allows up to four memory chips to fail completely without any affect on the SIMM’s behavior.
    • Protection logic resides on the SIMM itself that makes it compatible with the regular PC memory and can be used in any existing memory slot. No custom adapters, special motherboards or time consuming installations required. It is just plugged in as any regular memory SIMM.
    • Different levels of protection and various custom error correction schemes are possible.

    Maybe Divo were too far ahead of the pack with this technology at the time? And the price was perhaps too high? And awareness in memory errors too low?

    I saw the stuff on BackBlaze, but from what I remember, it looks like it will suffer from overloading, poor component choice and file system. If they had used LSI SAS/SATA controllers, Solaris and ZFS instead of the Linux/JFS combo, I think it would have been much more interesting and provide more solid data integrity and protection. And they didn’t use ECC memory???

    Anyway, forget my quick critique on it, read a report from someone who has spent more time looking at the Backblaze design:
    Some perspective to this DIY storage server mentioned at Storagemojo

    But I think they are doing something important — showing that large amounts of cheap storage can now be a reality, just perhaps not using their current choice of hardware/OS/file system.

    Cheers,
    Simon

  28. Hi Simon,

    I’ve found your blog very helpful and I’m now looking to take the plunge and build myself a NAS.

    One aspect I’m most interested in is the power management side. I wondered if you or anyone had got any further with the WOL feature. I understand with your current NIC you found it didn’t work, but maybe people have had more luck with other cards?

    Also, I’m thinking of holding off for the new Intel i3 processors. They should take advantage of the power saving features in Nahalem and have a lower max power usage. However, from your experience, would ZFS benefit from a more powerful processor?

    David.

  29. Thanks David. I never got suspend/resume to work with this system, so I never followed up on WOL. Maybe I’ll retry it.

    I expect WOL should work with an Intel GbE NIC, but I never tried it.

    You can see quite high CPU utilisation when doing heavy transfers with the processor I have (Athlon X2 BE-2350), presumably the combined network and RAID checksum/striping/parity calculations, but it’s not been a problem, as the NAS spends most of its life idle or doing moderate reads and transfers.

    Good luck and let us know how you get on with your choice of components. One thing, check out ECC memory.

    Cheers,
    Simon

  30. Cost/GB on the 2TB drives. It should be calculated as the entire system cost. If your system cost $1k for the server and you can only fit 8 drives you have to add $125 to the cost of each drive when figuring the cost/TB. So make sure you take into account max capacity when calculating cost/TB. I hoping to put 6 2TB drives in a HP DL185 for my project. Which will cost about $1k for the server $2200 with drives, dual parity gives about 7.5TB usable so that is 293/tb usable. I will still have 6 bays free. Lets just hope the drivers work. If I stuck with the cheaper 1.5TB, I would have to use only have 5.5TB usable at a cost of $1720 which comes to $312/TB.

    PS I just took off 500G for calculating the difference between Advertised and actual space, so don’t correct me.. Its just an estimate.

  31. Can I ask what settings you’re using for the ECC configuration?

    I’m using an ASUS M4N78 Pro, and in addition to DRAM ECC enable, there are:
    DRAM SCRUB REDIRECT
    4-Bit ECC mode (chipkill – I read somewhere that AMD recommended enabling if using 4 DIMMS)
    DRAM BG SCRUB
    Data Cache BG SCRUB/L2/L3 Cache BG SCRUB

    Do you have these options on yours as well?

    Thanks

  32. Why do people the cr@p drives? Nothing over 500gb is worth or reliable a damn.
    I use smicro or asus server mb with ecc ram. Only slightly more than consumer junk.
    2×4core 54xx and 32gb ram + case less than $1500

    Lsi/sun hba’s 20 on ebay. RE2,3 or NS drives $60 or less. Sas multiplier $300
    Anything less than 12 drivves isn’t a NAS imo

  33. I came across your article regarding ZFS and the Samsung 1.5TB EcoGreen drives. I plan to use these in a mirrored pool array (until I can afford more disks) with OpenSolaris.

    Thanks for blogging about your experiences all around. :)

    My biggest concern is the RealTek 8112L NIC on the motherboard I’ve chosen, the actual ASUS board I’m using is on Sun.com’s HCL list for Solaris, so that’s good news. But we know how terrible RealTek chipsets CAN be.

    That being said, I find it funny how many users running OpenSolaris look to that AOC-USAS-L8i card. I guess just because it’s pretty well priced. I’m eyeing it for future upgrades :)

  34. Hello Simon,

    I was wondering if there have been any updates in the hard drive market since your January update.

    All the best.

  35. Hi there,

    As of now (2010-05-27), I’m putting my trust in Samsung HD203WI hard drives, which is a 2TB drive.

    In the consumer 2TB sector right now, this drive leads the pack in that (1) it has consistently high newegg.com user feedback, (2) apparently high reliability, (3) one of the lowest prices, and (4) seems to run cool.

    This drive is a 3.5″ 4-platter, 500GB per platter drive, and spins at 5400 RPM instead of the usual 7200 RPM rotation speed. It has a min transfer speed of around 50MB/s on inner tracks, max speed of around 110MB/s on outer tracks, and an average transfer speed of around 80MB/s. In summary, for a SOHO NAS, this drive runs cool, quiet, is cheap, provides high capacity, and reasonable all-round performance, but is not the fastest drive on the block. Its performance is good, despite its low rotational speed due to high areal density. Also, it is possible to alter (reduce) the error-reporting time, which is interesting for RAID users to prevent the possibility of drives being kicked from the array, although this needs to be done at each boot — i.e. changes are not persistent.

    I bought some of these drives, after reviewing all the available alternatives. You will find that other manufacturers of consumer 2TB drives are struggling to match the reliability record of this drive.

    Cheers,
    Simon

  36. I have 32 x ST32000542AS running in a zpool with no issues, and the next one is being built now.
    I have stopped using all large drives, due to their inability to survive a scrub.
    The WD1600AAS is used for OS, as it has very good performance.

    This is all on Supermicro SAS chassis and hardware.
    I’ve built quite a few of similar size to this, with the same disks, and had better results with OpenSolaris than Windows.

    Although OpenSolaris is fussy about disks, and performance isn’t as good, I still prefer it for ZFS over the BSD version.
    After zfs, linux file systems are simply unbearable. Try a chmod or rm -rf on 40Tb of files !!

    Mark.

  37. > I have stopped using all large drives, due to their inability to survive a scrub.

    What do you mean by that? Zpool scrub killed your drive(s)?

  38. Juergen Nickelsen on July 30th, 2010 at 07:01

    To replace my aging and comparatively smallish FreeBSD home server, I turned to (Open)Solaris and ZFS, too.

    The old server was based on a Fujitsu-Siemens Activy 300 HTPC. It had a Celeron CPU with ~700 MHz, 256 MB RAM, two 160 GB Samsung disks. This was in between getting a little tight diskspace-wise. I was already pondering to replace it, as it had been running for over 4 years, with the hardware except the disks even a bit older. When the machine once stopped to work suddenly, and came back only at the second hard reboot, I began with the replacement in earnest.

    The new server is built around a Zotac mini-ITX board with NVIDIA ION chipset and Atom 330 CPU. I used a Sharkoon Rebel 9 case, which has space for up to 9 disks, and (for the beginning) 4 EcoGreen Samsung disks with 1.5 TB each. 4 GB RAM (the supported maximum) is okay, although not really abundant. No ECC support, unfortunately. OS is OpenSolaris snv_132, because that was current in February when I built the machine, and I have not yet updated.

    The disks are in two single-mirror pools. I plan to extend the non-root pool by one or two other mirrors once the disks get fuller, although I am not sure if I should rather keep one or two spare disks instead for safety. At the moment I am limited by the 4 SATA ports of the on-board host adapter, as the single PCIe slot is occupied by a network card for the DMZ.

    Once I have move the networks (DMZ and internal) to VLANs on a single wire at the on-board interface, I can remove the network card and put in another SATA adapter. But there is no rush, as the disks are far from getting full soon. After all, disk space is now nearly 10 times as large as with the previous machine.

    As the previous one, the server runs a lot of services for my wife and me and the neighbour family: DHCP, DNS (authoritative for my domains plus local resolver), SSH, FTP, Web server, HTTP Proxy, Mail with MTA and IMAPs including a few small mailing lists, Webmail interface, News, SMB (for the PVR), NFS (for my laptop), NTP. Even with all this, the machine is only very lightly loaded, so CPU power, with dual core @ 1.6 GHz and hyperthreading, is plenty. It was by far not enough for Dedup, though, which I had enabled for a short while — dramatically slowed disk I/O with long stretches of 100% CPU usage showed me the limit. Without Dedup, performance is very fine. ARC size is a little over 1 GB.

    I do not use OpenSolaris’s auto-snapshot service, because with it different sets for each interval it makes unneccesarily many snapshots. I have written my own script to maintain only one set of regular snapshots, which are then thinned out as they get older. This is configurable per file system; for instance for the user file systems I make them every 10 minutes, which I keep for two hours, with a distance of an hour for a day, with a distance of 6 hours for 3 days, daily for two weeks, weekly for six weeks; on other filesystems with bigger distances as apropriate.

    The previous server was a big success for five years, and I guess the new one will be one, too. Other than the previous one, the hardware platform is more flexible — I think I will add more disks as necessary, and perhaps replace the motherboard some time to get more RAM and more CPU power.

    Going with OpenSolaris and ZFS was a very good move. I feel much more confident with all data being on weekly scrubbed mirrors, and I really cherish regular snapshots.

  39. How are the Samsung HD203WI drives holding up for you Simon since installing them?

    I’m currently shopping around for possible hardware. If you have any recommendations for current AMD processors/motherboards, I’d appreciate any advice. Obviously ECC memory and low power use. It seems rather hard to find a mobo like what you recommend that is verifiable in being supported by opensolaris. I did find a dealer to buy the mobo you are using, but below is a possible option.

    Mobo I would consider if it works out of the box:
    ASUS M2N68-AM PLUS: http://www.newegg.com/Product/Product.aspx?Item=N82E16813131613

    Ironically, I keep finding Micro ATX mobos to put in a rather large Fractal Design Define R2 case made in Sweden.

  40. Hi Brian, the HD203WI drives have worked without any flaws so far – no read, write or checksum errors on scrubbing the storage pool.

    It’s difficult to recommend something specifically without having tried it personally, but I would have thought an AM3 mobo with an AMD Athlon II X2 ‘e’ low power processor should be fine. The main thing for the mobo is to check out the chipsets to ensure they have good driver support – try looking on the OpenSolaris HCL.

    The mobos I used are old now: (1) Asus M2N-SLI Deluxe (NAS) and (2) M2N-E (backup server).

    If you get one, let me know how that ‘Fractal Design Define R2 Black Pearl’ case turns out.

    Cheers,
    Simon

  41. >> I have stopped using all large drives, due to their inability to survive a scrub.

    >What do you mean by that? Zpool scrub killed your drive(s)?

    More of a clay pidgeon shoot.

    If Fault Management sees too high an io error rate on disks during the scrub, and drops them out of the zpool.

    When there are insufficient working drives left, the zpool goes offline.

    A reboot “may” bring it back, but the inability to scrub to fix or detect errors leads to slow decay and eventual failure and bye bye to 30Tb of data.

  42. Hi Simon,

    I have to agree with other who posts in here, brilliant site and source for inspiration! Actually to such an extent that I during the spring decided to go OpenSolaris for my home file storage set up – mainly due to the data integrity provided (read ZFS).

    I paused my project during the spring to wait for the next release to come 2010.3 … which never came and will never come – since OpenSolaris now is buried by Oracle.

    My question to you is now what your thoughts are and what you would do in my shoes – like so many others.

    Would you consider setting up a storage server today based on the last OpenSolaris release or rather chose another *nix platform?

    I will mainly use the server for as “safe” file storage as possible in the sense of up-time and data integrity in mind.

    Will the current state of OpenSolaris and ZFS be sufficient and “safe” – having the fact that the project is dead and the community might diminish with the years to come?

    What are your thoughts and advice for people like me, who right now hesitate to take the route with OpenSolaris (or the Solaris 11 Express to come)?

    Would appreciate your input and advice.

    Cheers!

    // Pelle

  43. Hi Pelle,

    Thanks for the compliments!

    I haven’t really followed the situation closely regarding Oracle and OpenSolaris, but it does seem a great pity that they have dropped OpenSolaris, even if they are supporting Solaris. I will have to read more regarding the situation about Solaris to be able to answer your question properly.

    Earlier in 2010 I upgraded both my OpenSolaris systems (NAS & clone) to build 134, and from what I read earlier it seems that it was build 134 that was destined to become OpenSolaris 2010.03 or whatever they were going to call it.

    I can say from my own personal experience that build 134 seems to have been very good and I have not experienced any serious bugs with this build, so I can recommend its use if you want to consider using it.

    For the meantime, until I discover what Oracle’s position is on home users using Solaris (i.e. will they charge & will they allow software updates?), I will continue to use build 134.

    However, if Oracle chooses to charge home users/non-profit organisations for using Solaris, even if only for running a home NAS, then I will definitely consider using other operating systems, such as FreeBSD, which also has a ZFS implementation that was ported from the Sun code. This would be a great pity, as I would rather know I am using the reference implementation of ZFS from Sun/Oracle, but that’s a decision I will have to make later, depending on what I discover.

    I heard somewhere that some developers have/will fork the OpenSolaris code, but I haven’t looked into this much yet. I believe the OS name is Illumos — see http://www.illumos.org/ . If this really happens and has proper developer support, it might turn out to be a good alternative to Solaris, assumuming that Oracle want to try charging home users for running Solaris.

    Also, there is OpenIndiana, another fork of the OpenSolaris code. I’m unsure of the exact relationship with Illumos, but you might find the following links interesting:

    OpenIndiana site
    OpenIndiana .iso files for each of their releases
    OpenIndiana downloads page (including instructions for upgrading from OpenSolaris)
    OpenIndiana twitter stream

    I hope this answers your points reasonably well.

    Cheers,
    Simon

  44. Hi Simon,

    Came across your thread today and gave me some great insight as I am currently constructing a Fileserver/HTPC as we speak.

    The problem I am facing is that I have now just discovered (thanks to your blog) that the AOC-USASLP-L8e have their components on the opposite sides to most. I am using (or plan to) a Norco 4224, and was wondering how difficult this would be in a case like this.

    I do not have any background in electronics/soldering and was wondering how difficult it would be to remove the backplate?

    Cheers,
    Trident911

  45. Hi there,

    Good luck building that system. Just took a look at the Norco RPC-4224, and it is quite a beast with 24 drive capacity utilising six SFF-8087 mini SAS connectors. Three Supermicro SATA/SAS adapters will handle all those drives if fully populated. It should make a really good video storage NAS with massive capacity (upto 48TB raw capacity using the currently-available 2TB drives).

    Regarding the fact that the Supermicro cards use a proprietary UIO backplate, you can either just unscrew the backplate and leave it off, relying on the fact that the mobo is horizontal in the RPC-4224 case, so the cards will normally be fine as they will be vertically mounted into the motherboard, or for more peace of mind, you will probably want to get a PCI 2.0 backplate from a component store and then reverse the screw mountings to fit the card.

    Cheers,
    Simon

  46. Thanks for the reply Simon, having a really tough time in regards to Hardware decisions.

    Its something I think we all face as geeks, but I can’t seem to find the ’sweetspot’ in relation to costs/performance. I want to obviously make this as power efficient as possible, but still retain the performance in regards to parity re-generation etc.

    I have a few Build options listed here:

    http://forums.overclockers.com.au/showthread.php?t=908505

    If you had the time to read and pass any comments. I had a look into Atom, and believe the performance from them will be too lackluster. That then leaves me with looking at i5/910e (can’t for the life of me find them in Australia), or to the higher end Xeon/AMD X6.

    Thoughts/Suggestions?

  47. Hey Gents!
    would like to express the experience of building of 48tb storage system. This iron has 4xFC ports and 2xgigabit NICs (FC & iSCSI) and based on Gentoo x64 and SCST as a target platform.

    http://log.momentics.ru/homemade-48tb-enterprise-storage-system

    I’d ask the community to review the article and probably to speed its writing up by asking questions… by twit or mail.
    However the article is not complete, so I’d expect to fill gaps by answering questions.

    btw: this all is written in Russian but contains lots of photos and graphs, so probably might be helpful

    Yours, momentics

Leave a Reply