Home Fileserver: Handling pool errors

Until today I had never encountered even one read, write or checksum error on my ZFS NAS. Today I saw one checksum error coming from the mirrored SSD root boot pool which I have just installed.

Unless you’re using a file system like ZFS (or NetApp / Veritas…$$$), then you’ll almost certainly never even know that errors are occurring within your storage system. Ignorance is bliss, apparently, but I’d rather have information available so I can act on it, determine the root cause of the problem and try to prevent re-occurrence.

Unresolved storage errors can often lead to bigger problems later, so let’s fix the problem right now while we can.

Checksum error found

As I’ve just recently installed a pair of mirrored SSDs as my ZFS root boot pool on this NAS, I decided to take a quick status check:

# zpool status -v rpool
  pool: rpool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: none requested
config:

	NAME           STATE     READ WRITE CKSUM
	rpool          ONLINE       0     0     0
	  mirror       ONLINE       0     0     0
	    c11t6d0s0  ONLINE       0     0     0
	    c11t7d0s0  ONLINE       0     0     1

errors: No known data errors
#

Oh dear, one checksum error found on the second SSD.

Is this caused by a bug in snv_121?

Update 2009-09-21: After installing the latest firmware/BIOS onto the AOC-USAS-L8i HBA, after a reboot, I looked through the SAS configuration menus in the card’s BIOS, and found two entries that may well throw light on the checksum errors encountered here. I saw the following:

PHY7
Invalid DWord Count:            0x0000191E
Running Disparity Error Count:  0x00001922

No errors in the logs for PHY6 (the other SSD), so that one looks OK, and this tallies up with the info below regarding the SSD that Solaris shows as generating checksum errors during scrub operations.

Whilst I was updating firmware, I installed an HBA utility called ‘lsiutil’ from the LSI site, as this HBA, although manufactured by SuperMicro, contains the LSI LSISAS1068E ASIC. When I ran ‘lsiutil’, I selected the HBA from the menu, selected the ‘Diagnostics’ option, and then selected the ‘Display phy counters’ option, and for the SSD showing errors in the HBA BIOS, I saw the following output:

Adapter Phy 7:  Link Up
  Invalid DWord Count                                       6,429
  Running Disparity Error Count                             6,401
  Loss of DWord Synch Count                                     0
  Phy Reset Problem Count                                       0

So, this would indeed appear to be a hardware problem with one of the SSDs… luckily, due to having a mirror, these problems can be fixed and managed for now! Time to see if other people have experienced checksum errors with OCZ Vertex Turbo SSDs… after all, they are overclocked versions of the standard OCZ Vertex devices, so if their quality control process has not correctly identified failing devices, this might be the result. Google time…

Update 2009-09-16: ZFS bug 6869090, which refers to checksum errors on RAID-Z1 / RAID-Z2 vdevs has been fixed now, and will be available in snv_124 which, judging by the usual release frequency of every 14 days, should be released on Friday October 2nd 2009. Presumably, this means that the packages should also be available to OpenSolaris 2009.06 users via the Package Manager tool.

Also, having discussed the mirror checksum errors with some experienced storage professionals, it is thought that these most probably indicate a hardware problem. The hardware is brand new hardware: the AOC-USAS-L8i SATA controller and the OCZ Vertex Turbo SSDs. I will investigate this more later, as it may not be trivial to determine the exact cause, and it doesn’t seem to be causing the system too much of a problem right now… On the other hand, the mirror checksum ‘bug’ submitted has been accepted here: Bug ID- 6880994 Checksum failures on mirrored drives, so we’ll see what happens.

Update 2009-09-05: Bug 11201 – Checksum failures on mirrored drives, has been created to report the checksum errors occurring within mirrors, which is the problem I have seen here.

The way to reproduce the checksum errors with a mirrored root boot pool is to:

  1. As root, run Richard Elling’s zcksummon script
  2. Scrub the mirrored root boot pool

If your system is experiencing checksum errors from the scrub on the mirror, then you will see output generated from zcksummon.

For example, run it 2 or 3 times and then diff the results:

# ./zcksummon -s 1024 > out1"
# zpool scrub rpool

When the scub completes, stop zcksummon and review the output in an editor. If you see the same blocks appearing multiple times, then it sounds like the defect/bug reported in Bug 11201 – Checksum failures on mirrored drives.

Repeat the test a couple of times, redirecting the output to different files and then diff the files. In my case, over three test runs, the only thing that differed were the timestamps within the files, so the output from zcksummon was effectively the same across each test run.

Please bear this information in mind when reading the rest of the post below.

Update 2009-09-03: ZFS bug 6869090, introduced since snv_120, seems to be the cause of checksum errors in RAID-Z1 & RAID-Z2 vdevs, but not mirrors — see the announcement of the problems found in this forum post: Problem with RAID-Z in builds snv_120 – snv_123
For the checksum errors occurring with mirrors, this might offer some clues.
Bear that in mind when reading the rest of the text below, which I will keep in place.

I wonder if this checksum error is related to a dodgy SSD, or this apparent problem reported in build snv_121, which I’m currently using by connecting to the http://pkg.opensolaris.org/dev/ package repository? See here:
snv_110 -> snv_121 produces checksum errors on Raid-Z pool

The forum post refers to checksum errors found in RAID-Z vdevs, and I’m using a mirror in this boot pool, but it probably is related to the same problem.

There is another reference to the checksum problem here at solarisinternals.com. It mentions that the expected ETA for a fix is snv_124 build. In the meantime, what to do? Maybe keep an eye on the opensolaris forum post.

If the problem I’m seeing here is related to this reported problem then it’s probably best to roll back to my previous boot environment which should have different ZFS code in it — snv_110? There seems to be no bug id that I could find yet related to this, but hopefully one will appear soon.

Luckily using ZFS I can roll back, as we have snapshots, and before rolling back, I first want to:

  1. try booting my previous boot environment, as this can be done without changing the file system, and I can then check the OS version with a ‘uname -a’ to be sure it’s a version prior to snv_121
  2. assess which files were modified between the last snapshots by something like: # zfs send -i rpool@snap1 rpool@snap2 | zfs receive backup/snaps

However, it’s a bit disconcerting that this error has slipped through the net, considering that the ZFS developers have a comprehensive test suite. Presumably the test suite has no tests that check for unexpected checksum errors yet, otherwise these problems would have been discovered before releasing the code.

I found some bugs that look like they might be related:

Anyway, I’ve rolled back to build 117 of OpenSolaris 2009.06 (dev package repository), as I hear reports of 118 working for someone else — i.e. no checksum errors. When build 124 is announced, hopefully they will have fixed the problem.

Fix the checksum error

Following the instructions listed at the URL displayed in the message:

  1. Determine if the device needs to be replaced: how? can’t find messages in log files so I’ll continue.
  2. Clear the errors using ‘zpool clear’: ok, let me RTFM…
  3. OK RTFM, so let’s issue: # zpool clear rpool c11t7d0s0
  4. Then let’s scrub the pool and check the status to see if the checksum error has been fixed.

Let’s clear the error and check the status to verify error has really been cleared:

# zpool clear rpool c11t7d0s0
# zpool status -v rpool
  pool: rpool
 state: ONLINE
status: The pool is formatted using an older on-disk format.  The pool can
	still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
	pool will no longer be accessible on older software versions.
 scrub: none requested
config:

	NAME           STATE     READ WRITE CKSUM
	rpool          ONLINE       0     0     0
	  mirror       ONLINE       0     0     0
	    c11t6d0s0  ONLINE       0     0     0
	    c11t7d0s0  ONLINE       0     0     0

errors: No known data errors
#

Good. Now, let’s do a scrub on the pool to see if things are OK now:

# zpool scrub rpool
# zpool status -v rpool
  pool: rpool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub completed after 0h1m with 0 errors on Wed Sep  2 11:12:20 2009
config:

	NAME           STATE     READ WRITE CKSUM
	rpool          ONLINE       0     0     0
	  mirror       ONLINE       0     0     0
	    c11t6d0s0  ONLINE       0     0     2  256K repaired
	    c11t7d0s0  ONLINE       0     0     5  128K repaired

errors: No known data errors
#

We can see messages about stuff repaired, but also it seems it found some more checksum errors.

So, let’s clear all errors, this time on all drives in the pool, i.e. specify the whole pool and not just an individual drive id, scrub again and see what we find after that:

# zpool clear rpool
# zpool status -v rpool
  pool: rpool
 state: ONLINE
status: The pool is formatted using an older on-disk format.  The pool can
	still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
	pool will no longer be accessible on older software versions.
 scrub: scrub completed after 0h1m with 0 errors on Wed Sep  2 11:29:02 2009
config:

	NAME           STATE     READ WRITE CKSUM
	rpool          ONLINE       0     0     0
	  mirror       ONLINE       0     0     0
	    c11t6d0s0  ONLINE       0     0     0
	    c11t7d0s0  ONLINE       0     0     0

errors: No known data errors

That looks better. Zero errors. Let’s hope all is well.

In the meantime, as ZFS makes it so quick and easy, let’s do a recursive snapshot of the pool and archive it to a separate box on the LAN. Just in case, you never know…:

On the NAS:
# zfs snapshot -r rpool@20090902
# zfs list -t snapshot | grep 20090902 | grep -v dump | grep -v swap
rpool@20090902                                 0      -  81.5K  -
rpool/ROOT@20090902                            0      -    19K  -
rpool/ROOT/be2@20090902                        0      -  6.95G  -
rpool/ROOT/be3@20090902                     526K      -  7.04G  -
rpool/ROOT/opensolaris@20090902                0      -  2.82G  -
rpool/export@20090902                          0      -    21K  -
rpool/export/home@20090902                     0      -   540M  -
# 

On the backup machine ('zfsnas' is the host name of the NAS):
# zfs create backup/snaps
# zfs set sharenfs='rw=zfsnas,root=zfsnas' backup/snaps
# share
-@backup/snaps  /backup/snaps   sec=sys,rw=zfsnas,root=zfsnas   "" 

On the NAS:
# zfs send -Rv rpool@20090902 > /net/192.168.0.45/backup/snaps/rpool.recursive.20090902

Done. Quick and easy.

Will keep an eye on the pool for the next few weeks to see if checksum errors are a regular occurrence or not.

Perhaps I’ll set up a cron task to email me the results of a daily scrub until we’re happy that things are OK. Will post script here once created.

Also, might be worth checking software versions in case there’s some drivers/firmware I need to update:

* AOC-USAS-L8i firmware check:
– using:
– latest:
* OCZ Vertex Turbo firmware check:
– using:
– latest:

For more ZFS Home Fileserver articles see here: A Home Fileserver using ZFS. Alternatively, see related articles in the following categories: ZFS, Storage, Fileservers, NAS.

Popularity: 7% [?]

Share and Enjoy:

  • RSS
  • del.icio.us
  • StumbleUpon
  • Digg
  • Twitter
  • Mixx
  • Slashdot
  • Technorati
  • Facebook
  • NewsVine
  • Reddit
  • Google Bookmarks
  • LinkedIn
  • Yahoo! Buzz
  • email

5 Responses to “Home Fileserver: Handling pool errors”

  1. The Adapter link error should not be occuring if all is well with the backplane.
    Changing to an LSI SAS3081E-R controller with the latest firmware,instead of an AOC-USASLP-L8i, made the dodgy backplane unuseable.
    The adapter link issues were fixed when the backplane was replaced, but not the timeouts seen using iostat -X -e -n. Only a few of the 34 disks give problems.

    e.g.
    —- errors —
    s/w h/w trn tot device
    ……
    0 1 10 11 c4t31d0
    0 2 20 22 c4t32d0
    ……
    0 1 10 11 c4t43d0
    0 3 31 34 c4t44d0
    0 1 10 11 c4t45d0
    0 2 20 22 c4t46d0
    0 1 10 11 c4t47d0

  2. Simon,

    Did you ever post the cron script for monitoring your system?

    Great site, thanks for writing about your experiences in building a NAS.
    I’m a newbie so examples are great way to learn.
    Thanks
    Kamal

  3. Hi Simon,

    I am also curious if you have a solution for the point 4 of your initial Requirement-List:

    4. report any data checksum errors or drive failures to me by email so I can fix problems quickly

    Right now I am wondering how the hell I get noticed about a degraded RaidZ without checking zpool status -v everyday. As you suggested – getting an E-Mail would be best…

    Thanks,

    Volekr

  4. Hi,

    Just thought I’d mention that the OCZ SSD’s are known for having problems. I have one as my boot drive on my media center running Win 7. I kept getting the ‘your drive needs to be checked for errors’ even though it had shut down correctly. The fix is to update the firmware on the SSD. Since updating I have had no more problems. (figers crossed)

    Regards

    Ian

  5. Thanks Ian, I haven’t checked the firmware level recently, so I should look to see if they have released a newer version. If I remember correctly, the problem was that this version of SSD was a special souped-up version and OCZ didn’t seem too quick on releasing updates for this particular model. But yes, it’s a good idea to have another look.

    Cheers,
    Simon

Leave a Reply