Until today I had never encountered even one read, write or checksum error on my ZFS NAS. Today I saw one checksum error coming from the mirrored SSD root boot pool which I have just installed.
Unless you’re using a file system like ZFS (or NetApp / Veritas…$$$), then you’ll almost certainly never even know that errors are occurring within your storage system. Ignorance is bliss, apparently, but I’d rather have information available so I can act on it, determine the root cause of the problem and try to prevent re-occurrence.
Unresolved storage errors can often lead to bigger problems later, so let’s fix the problem right now while we can.
Checksum error found
As I’ve just recently installed a pair of mirrored SSDs as my ZFS root boot pool on this NAS, I decided to take a quick status check:
# zpool status -v rpool pool: rpool state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: none requested config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 mirror ONLINE 0 0 0 c11t6d0s0 ONLINE 0 0 0 c11t7d0s0 ONLINE 0 0 1 errors: No known data errors #
Oh dear, one checksum error found on the second SSD.
Is this caused by a bug in snv_121?
Update 2009-09-21: After installing the latest firmware/BIOS onto the AOC-USAS-L8i HBA, after a reboot, I looked through the SAS configuration menus in the card’s BIOS, and found two entries that may well throw light on the checksum errors encountered here. I saw the following:
PHY7 Invalid DWord Count: 0x0000191E Running Disparity Error Count: 0x00001922
No errors in the logs for PHY6 (the other SSD), so that one looks OK, and this tallies up with the info below regarding the SSD that Solaris shows as generating checksum errors during scrub operations.
Whilst I was updating firmware, I installed an HBA utility called ‘lsiutil’ from the LSI site, as this HBA, although manufactured by SuperMicro, contains the LSI LSISAS1068E ASIC. When I ran ‘lsiutil’, I selected the HBA from the menu, selected the ‘Diagnostics’ option, and then selected the ‘Display phy counters’ option, and for the SSD showing errors in the HBA BIOS, I saw the following output:
Adapter Phy 7: Link Up Invalid DWord Count 6,429 Running Disparity Error Count 6,401 Loss of DWord Synch Count 0 Phy Reset Problem Count 0
So, this would indeed appear to be a hardware problem with one of the SSDs… luckily, due to having a mirror, these problems can be fixed and managed for now! Time to see if other people have experienced checksum errors with OCZ Vertex Turbo SSDs… after all, they are overclocked versions of the standard OCZ Vertex devices, so if their quality control process has not correctly identified failing devices, this might be the result. Google time…
Update 2009-09-16: ZFS bug 6869090, which refers to checksum errors on RAID-Z1 / RAID-Z2 vdevs has been fixed now, and will be available in snv_124 which, judging by the usual release frequency of every 14 days, should be released on Friday October 2nd 2009. Presumably, this means that the packages should also be available to OpenSolaris 2009.06 users via the Package Manager tool.
Also, having discussed the mirror checksum errors with some experienced storage professionals, it is thought that these most probably indicate a hardware problem. The hardware is brand new hardware: the AOC-USAS-L8i SATA controller and the OCZ Vertex Turbo SSDs. I will investigate this more later, as it may not be trivial to determine the exact cause, and it doesn’t seem to be causing the system too much of a problem right now… On the other hand, the mirror checksum ‘bug’ submitted has been accepted here: Bug ID- 6880994 Checksum failures on mirrored drives, so we’ll see what happens.
Update 2009-09-05: Bug 11201 – Checksum failures on mirrored drives, has been created to report the checksum errors occurring within mirrors, which is the problem I have seen here.
The way to reproduce the checksum errors with a mirrored root boot pool is to:
- As root, run Richard Elling’s zcksummon script
- Scrub the mirrored root boot pool
If your system is experiencing checksum errors from the scrub on the mirror, then you will see output generated from zcksummon.
For example, run it 2 or 3 times and then diff the results:
# ./zcksummon -s 1024 > out1" # zpool scrub rpool
When the scub completes, stop zcksummon and review the output in an editor. If you see the same blocks appearing multiple times, then it sounds like the defect/bug reported in Bug 11201 – Checksum failures on mirrored drives.
Repeat the test a couple of times, redirecting the output to different files and then diff the files. In my case, over three test runs, the only thing that differed were the timestamps within the files, so the output from zcksummon was effectively the same across each test run.
Please bear this information in mind when reading the rest of the post below.
Update 2009-09-03: ZFS bug 6869090, introduced since snv_120, seems to be the cause of checksum errors in RAID-Z1 & RAID-Z2 vdevs, but not mirrors — see the announcement of the problems found in this forum post: Problem with RAID-Z in builds snv_120 – snv_123
For the checksum errors occurring with mirrors, this might offer some clues.
Bear that in mind when reading the rest of the text below, which I will keep in place.
I wonder if this checksum error is related to a dodgy SSD, or this apparent problem reported in build snv_121, which I’m currently using by connecting to the http://pkg.opensolaris.org/dev/ package repository? See here:
snv_110 -> snv_121 produces checksum errors on Raid-Z pool
The forum post refers to checksum errors found in RAID-Z vdevs, and I’m using a mirror in this boot pool, but it probably is related to the same problem.
There is another reference to the checksum problem here at solarisinternals.com. It mentions that the expected ETA for a fix is snv_124 build. In the meantime, what to do? Maybe keep an eye on the opensolaris forum post.
If the problem I’m seeing here is related to this reported problem then it’s probably best to roll back to my previous boot environment which should have different ZFS code in it — snv_110? There seems to be no bug id that I could find yet related to this, but hopefully one will appear soon.
Luckily using ZFS I can roll back, as we have snapshots, and before rolling back, I first want to:
- try booting my previous boot environment, as this can be done without changing the file system, and I can then check the OS version with a ‘uname -a’ to be sure it’s a version prior to snv_121
- assess which files were modified between the last snapshots by something like: # zfs send -i rpool@snap1 rpool@snap2 | zfs receive backup/snaps
However, it’s a bit disconcerting that this error has slipped through the net, considering that the ZFS developers have a comprehensive test suite. Presumably the test suite has no tests that check for unexpected checksum errors yet, otherwise these problems would have been discovered before releasing the code.
I found some bugs that look like they might be related:
Anyway, I’ve rolled back to build 117 of OpenSolaris 2009.06 (dev package repository), as I hear reports of 118 working for someone else — i.e. no checksum errors. When build 124 is announced, hopefully they will have fixed the problem.
Fix the checksum error
Following the instructions listed at the URL displayed in the message:
- Determine if the device needs to be replaced: how? can’t find messages in log files so I’ll continue.
- Clear the errors using ‘zpool clear’: ok, let me RTFM…
- OK RTFM, so let’s issue: # zpool clear rpool c11t7d0s0
- Then let’s scrub the pool and check the status to see if the checksum error has been fixed.
Let’s clear the error and check the status to verify error has really been cleared:
# zpool clear rpool c11t7d0s0 # zpool status -v rpool pool: rpool state: ONLINE status: The pool is formatted using an older on-disk format. The pool can still be used, but some features are unavailable. action: Upgrade the pool using 'zpool upgrade'. Once this is done, the pool will no longer be accessible on older software versions. scrub: none requested config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 mirror ONLINE 0 0 0 c11t6d0s0 ONLINE 0 0 0 c11t7d0s0 ONLINE 0 0 0 errors: No known data errors #
Good. Now, let’s do a scrub on the pool to see if things are OK now:
# zpool scrub rpool # zpool status -v rpool pool: rpool state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: scrub completed after 0h1m with 0 errors on Wed Sep 2 11:12:20 2009 config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 mirror ONLINE 0 0 0 c11t6d0s0 ONLINE 0 0 2 256K repaired c11t7d0s0 ONLINE 0 0 5 128K repaired errors: No known data errors #
We can see messages about stuff repaired, but also it seems it found some more checksum errors.
So, let’s clear all errors, this time on all drives in the pool, i.e. specify the whole pool and not just an individual drive id, scrub again and see what we find after that:
# zpool clear rpool # zpool status -v rpool pool: rpool state: ONLINE status: The pool is formatted using an older on-disk format. The pool can still be used, but some features are unavailable. action: Upgrade the pool using 'zpool upgrade'. Once this is done, the pool will no longer be accessible on older software versions. scrub: scrub completed after 0h1m with 0 errors on Wed Sep 2 11:29:02 2009 config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 mirror ONLINE 0 0 0 c11t6d0s0 ONLINE 0 0 0 c11t7d0s0 ONLINE 0 0 0 errors: No known data errors
That looks better. Zero errors. Let’s hope all is well.
In the meantime, as ZFS makes it so quick and easy, let’s do a recursive snapshot of the pool and archive it to a separate box on the LAN. Just in case, you never know…:
On the NAS: # zfs snapshot -r rpool@20090902 # zfs list -t snapshot | grep 20090902 | grep -v dump | grep -v swap rpool@20090902 0 - 81.5K - rpool/ROOT@20090902 0 - 19K - rpool/ROOT/be2@20090902 0 - 6.95G - rpool/ROOT/be3@20090902 526K - 7.04G - rpool/ROOT/opensolaris@20090902 0 - 2.82G - rpool/export@20090902 0 - 21K - rpool/export/home@20090902 0 - 540M - # On the backup machine ('zfsnas' is the host name of the NAS): # zfs create backup/snaps # zfs set sharenfs='rw=zfsnas,root=zfsnas' backup/snaps # share -@backup/snaps /backup/snaps sec=sys,rw=zfsnas,root=zfsnas "" On the NAS: # zfs send -Rv rpool@20090902 > /net/192.168.0.45/backup/snaps/rpool.recursive.20090902
Done. Quick and easy.
Will keep an eye on the pool for the next few weeks to see if checksum errors are a regular occurrence or not.
Perhaps I’ll set up a cron task to email me the results of a daily scrub until we’re happy that things are OK. Will post script here once created.
Also, might be worth checking software versions in case there’s some drivers/firmware I need to update:
* AOC-USAS-L8i firmware check:
* OCZ Vertex Turbo firmware check: