Saturday, December 13, 2014

ZFS: RAID-Z Resilvering

Solaris 11.2 introduced a new ZFS pool version: 35 Sequential resilver.

The new feature is supposed to make disk resilvering (disk replacement, hot-spare synchronization, etc.) much faster. It achieves it by reading ahead some meta data first and then by trying to read the data to be resilvered in a sequential manner. And it does work!

Here is a real world case, with real data - over 150mln different sized files, most relatively small. Many of them were deleted, new were written, etc. so I expect that the data is already fragmented in the pool. The server is Sun/Oracle x4-2l with 26x 1.2TB 2.5" 10k SAS disks. The 24 disks in front are presented in a pass-thru mode and managed by ZFS, configured as 3 RAID-Z pools, the other 2 disks in rear are configured in RADI-1 in the raid controller and used for OS. A disk in one of the pools failed, and hot-spare automatically attached:

# zpool status -x
  pool: XXXXXXXXXXXXXXXXXXXXXXX-0
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function in a degraded state.
action: Wait for the resilver to complete.
        Run 'zpool status -v' to see device specific details.
  scan: resilver in progress since Fri Dec 12 21:02:58 2014
    3.60T scanned
    45.9G resilvered at 342M/s, 9.96% done, 2h45m to go
config:

        NAME                         STATE     READ WRITE CKSUM
        XXXXXXXXXXXXXXXXXXXXXXX-0    DEGRADED     0     0     0
          raidz1-0                   DEGRADED     0     0     0
            spare-0                  DEGRADED     0     0     0
              c0t5000CCA01D5EAE50d0  UNAVAIL      0     0     0
              c0t5000CCA01D5EED34d0  DEGRADED     0     0     0  (resilvering)
            c0t5000CCA01D5BF56Cd0    ONLINE       0     0     0
            c0t5000CCA01D5E91B0d0    ONLINE       0     0     0
            c0t5000CCA01D5F9B00d0    ONLINE       0     0     0
            c0t5000CCA01D5E87E4d0    ONLINE       0     0     0
            c0t5000CCA01D5E95B0d0    ONLINE       0     0     0
            c0t5000CCA01D5F8244d0    ONLINE       0     0     0
            c0t5000CCA01D58B3A4d0    ONLINE       0     0     0
        spares
          c0t5000CCA01D5EED34d0      INUSE
          c0t5000CCA01D5E1E3Cd0      AVAIL

errors: No known data errors

Let's see I/O statistics for the involved disks:

# iostat -xnC 1 | egrep "device| c0$|c0t5000CCA01D5EAE50d0|c0t5000CCA01D5EED34d0..."
...
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 16651.6  503.9 478461.6 69423.4  0.2 26.3    0.0    1.5   1  19 c0
 2608.5    0.0 70280.3    0.0  0.0  1.6    0.0    0.6   3  36 c0t5000CCA01D5E95B0d0
 2582.5    0.0 66708.5    0.0  0.0  1.9    0.0    0.7   3  39 c0t5000CCA01D5F9B00d0
 2272.6    0.0 68571.0    0.0  0.0  2.9    0.0    1.3   2  50 c0t5000CCA01D5E91B0d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t5000CCA01D5EAE50d0
    0.0  503.9    0.0 69423.8  0.0  9.7    0.0   19.3   2 100 c0t5000CCA01D5EED34d0
 2503.5    0.0 66508.4    0.0  0.0  2.0    0.0    0.8   3  41 c0t5000CCA01D58B3A4d0
 2324.5    0.0 67093.8    0.0  0.0  2.1    0.0    0.9   3  44 c0t5000CCA01D5F8244d0
 2285.5    0.0 69192.3    0.0  0.0  2.3    0.0    1.0   2  45 c0t5000CCA01D5E87E4d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t5000CCA01D5E1E3Cd0
 1997.6    0.0 70006.0    0.0  0.0  3.3    0.0    1.6   2  54 c0t5000CCA01D5BF56Cd0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 25150.8  624.9 499295.4 73559.8  0.2 33.7    0.0    1.3   1  22 c0
 3436.4    0.0 68455.3    0.0  0.0  3.3    0.0    0.9   2  51 c0t5000CCA01D5E95B0d0
 3477.4    0.0 71893.7    0.0  0.0  3.0    0.0    0.9   3  48 c0t5000CCA01D5F9B00d0
 3784.4    0.0 72370.6    0.0  0.0  3.6    0.0    0.9   3  56 c0t5000CCA01D5E91B0d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t5000CCA01D5EAE50d0
    0.0  624.9    0.0 73559.8  0.0  9.4    0.0   15.1   2 100 c0t5000CCA01D5EED34d0
 3170.5    0.0 72167.9    0.0  0.0  3.5    0.0    1.1   2  55 c0t5000CCA01D58B3A4d0
 3881.4    0.0 72870.8    0.0  0.0  3.3    0.0    0.8   3  55 c0t5000CCA01D5F8244d0
 4252.3    0.0 70709.1    0.0  0.0  3.2    0.0    0.8   3  53 c0t5000CCA01D5E87E4d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t5000CCA01D5E1E3Cd0
 3063.5    0.0 70380.1    0.0  0.0  4.0    0.0    1.3   2  60 c0t5000CCA01D5BF56Cd0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 17190.2  523.6 502346.2 56121.6  0.2 31.0    0.0    1.8   1  18 c0
 2342.7    0.0 71913.8    0.0  0.0  2.9    0.0    1.2   3  43 c0t5000CCA01D5E95B0d0
 2306.7    0.0 72312.9    0.0  0.0  3.0    0.0    1.3   3  43 c0t5000CCA01D5F9B00d0
 2642.1    0.0 68822.9    0.0  0.0  2.9    0.0    1.1   3  45 c0t5000CCA01D5E91B0d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t5000CCA01D5EAE50d0
    0.0  523.6    0.0 56121.2  0.0  9.3    0.0   17.8   1 100 c0t5000CCA01D5EED34d0
 2257.7    0.0 71946.9    0.0  0.0  3.2    0.0    1.4   2  44 c0t5000CCA01D58B3A4d0
 2668.2    0.0 72685.4    0.0  0.0  2.9    0.0    1.1   3  43 c0t5000CCA01D5F8244d0
 2236.6    0.0 71829.5    0.0  0.0  3.3    0.0    1.5   3  47 c0t5000CCA01D5E87E4d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t5000CCA01D5E1E3Cd0
 2695.2    0.0 72395.4    0.0  0.0  3.2    0.0    1.2   3  45 c0t5000CCA01D5BF56Cd0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 31265.3  578.9 342935.3 53825.1  0.2 18.3    0.0    0.6   1  15 c0
 3748.0    0.0 48255.8    0.0  0.0  1.5    0.0    0.4   2  42 c0t5000CCA01D5E95B0d0
 4367.0    0.0 47278.2    0.0  0.0  1.1    0.0    0.3   2  35 c0t5000CCA01D5F9B00d0
 4706.1    0.0 50982.6    0.0  0.0  1.3    0.0    0.3   3  37 c0t5000CCA01D5E91B0d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t5000CCA01D5EAE50d0
    0.0  578.9    0.0 53824.8  0.0  9.7    0.0   16.8   1 100 c0t5000CCA01D5EED34d0
 4094.1    0.0 48077.3    0.0  0.0  1.2    0.0    0.3   2  35 c0t5000CCA01D58B3A4d0
 5030.1    0.0 47700.1    0.0  0.0  0.9    0.0    0.2   3  33 c0t5000CCA01D5F8244d0
 4939.9    0.0 52671.2    0.0  0.0  1.1    0.0    0.2   3  33 c0t5000CCA01D5E87E4d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t5000CCA01D5E1E3Cd0
 4380.1    0.0 47969.9    0.0  0.0  1.4    0.0    0.3   3  36 c0t5000CCA01D5BF56Cd0
^C

These are pretty amazing numbers for RAID-Z - and the only reason why a single disk drive can do so many thousands reads per second is that most of them have to be very almost ideally sequential. From time to time I see even more amazing numbers: 

                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 73503.1 3874.0 53807.0 19166.6  0.3  9.8    0.0    0.1   1  16 c0
 9534.8    0.0 6859.5    0.0  0.0  0.4    0.0    0.0   4  30 c0t5000CCA01D5E95B0d0
 9475.7    0.0 6969.1    0.0  0.0  0.4    0.0    0.0   4  30 c0t5000CCA01D5F9B00d0
 9646.9    0.0 7176.4    0.0  0.0  0.4    0.0    0.0   3  31 c0t5000CCA01D5E91B0d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t5000CCA01D5EAE50d0
    0.0 3478.6    0.0 18040.0  0.0  5.1    0.0    1.5   2  98 c0t5000CCA01D5EED34d0
 8213.4    0.0 6908.0    0.0  0.0  0.8    0.0    0.1   3  38 c0t5000CCA01D58B3A4d0
 9671.9    0.0 6860.5    0.0  0.0  0.4    0.0    0.0   3  30 c0t5000CCA01D5F8244d0
 8572.7    0.0 6830.0    0.0  0.0  0.7    0.0    0.1   3  35 c0t5000CCA01D5E87E4d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t5000CCA01D5E1E3Cd0
 18387.8    0.0 12203.5    0.0  0.1  0.7    0.0    0.0   7  57 c0t5000CCA01D5BF56Cd0

It is really good to see the new feature work so well in practice. This feature is what makes RAID-Z much more usable in many production environments. The other feature which complements this one and also makes RAID-Z much more practical to use is: RAID-Z/mirror hybrid allocator introduced in Solaris 11 (pool version 29). It makes accessing meta-data in RAID-Z much faster.

Both features are only available in Oracle Solaris 11 and not in OpenZFS deriviates.
Although OpenZFS has its own interesting new features as well.

Friday, December 05, 2014

ZFS Performance Improvements

One of the blogs I value is Roch Bourbonnais's. He hasn't blogged much in some time but he is back to blogging! He listed the main ZFS performance improvements since Oracle took over, and then also provided more details on ReARC. There is more to come from him, hopefully soon.

ps. to make it clear he is describing improvements to Oracle's ZFS as in Solaris 11 and ZFS-SA and not OpenZFS.