Saturday, December 13, 2014

ZFS: RAID-Z Resilvering

Solaris 11.2 introduced a new ZFS pool version: 35 Sequential resilver.

The new feature is supposed to make disk resilvering (disk replacement, hot-spare synchronization, etc.) much faster. It achieves it by reading ahead some meta data first and then by trying to read the data to be resilvered in a sequential manner. And it does work!

Here is a real world case, with real data - over 150mln different sized files, most relatively small. Many of them were deleted, new were written, etc. so I expect that the data is already fragmented in the pool. The server is Sun/Oracle x4-2l with 26x 1.2TB 2.5" 10k SAS disks. The 24 disks in front are presented in a pass-thru mode and managed by ZFS, configured as 3 RAID-Z pools, the other 2 disks in rear are configured in RADI-1 in the raid controller and used for OS. A disk in one of the pools failed, and hot-spare automatically attached:

# zpool status -x
  pool: XXXXXXXXXXXXXXXXXXXXXXX-0
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function in a degraded state.
action: Wait for the resilver to complete.
        Run 'zpool status -v' to see device specific details.
  scan: resilver in progress since Fri Dec 12 21:02:58 2014
    3.60T scanned
    45.9G resilvered at 342M/s, 9.96% done, 2h45m to go
config:

        NAME                         STATE     READ WRITE CKSUM
        XXXXXXXXXXXXXXXXXXXXXXX-0    DEGRADED     0     0     0
          raidz1-0                   DEGRADED     0     0     0
            spare-0                  DEGRADED     0     0     0
              c0t5000CCA01D5EAE50d0  UNAVAIL      0     0     0
              c0t5000CCA01D5EED34d0  DEGRADED     0     0     0  (resilvering)
            c0t5000CCA01D5BF56Cd0    ONLINE       0     0     0
            c0t5000CCA01D5E91B0d0    ONLINE       0     0     0
            c0t5000CCA01D5F9B00d0    ONLINE       0     0     0
            c0t5000CCA01D5E87E4d0    ONLINE       0     0     0
            c0t5000CCA01D5E95B0d0    ONLINE       0     0     0
            c0t5000CCA01D5F8244d0    ONLINE       0     0     0
            c0t5000CCA01D58B3A4d0    ONLINE       0     0     0
        spares
          c0t5000CCA01D5EED34d0      INUSE
          c0t5000CCA01D5E1E3Cd0      AVAIL

errors: No known data errors

Let's see I/O statistics for the involved disks:

# iostat -xnC 1 | egrep "device| c0$|c0t5000CCA01D5EAE50d0|c0t5000CCA01D5EED34d0..."
...
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 16651.6  503.9 478461.6 69423.4  0.2 26.3    0.0    1.5   1  19 c0
 2608.5    0.0 70280.3    0.0  0.0  1.6    0.0    0.6   3  36 c0t5000CCA01D5E95B0d0
 2582.5    0.0 66708.5    0.0  0.0  1.9    0.0    0.7   3  39 c0t5000CCA01D5F9B00d0
 2272.6    0.0 68571.0    0.0  0.0  2.9    0.0    1.3   2  50 c0t5000CCA01D5E91B0d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t5000CCA01D5EAE50d0
    0.0  503.9    0.0 69423.8  0.0  9.7    0.0   19.3   2 100 c0t5000CCA01D5EED34d0
 2503.5    0.0 66508.4    0.0  0.0  2.0    0.0    0.8   3  41 c0t5000CCA01D58B3A4d0
 2324.5    0.0 67093.8    0.0  0.0  2.1    0.0    0.9   3  44 c0t5000CCA01D5F8244d0
 2285.5    0.0 69192.3    0.0  0.0  2.3    0.0    1.0   2  45 c0t5000CCA01D5E87E4d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t5000CCA01D5E1E3Cd0
 1997.6    0.0 70006.0    0.0  0.0  3.3    0.0    1.6   2  54 c0t5000CCA01D5BF56Cd0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 25150.8  624.9 499295.4 73559.8  0.2 33.7    0.0    1.3   1  22 c0
 3436.4    0.0 68455.3    0.0  0.0  3.3    0.0    0.9   2  51 c0t5000CCA01D5E95B0d0
 3477.4    0.0 71893.7    0.0  0.0  3.0    0.0    0.9   3  48 c0t5000CCA01D5F9B00d0
 3784.4    0.0 72370.6    0.0  0.0  3.6    0.0    0.9   3  56 c0t5000CCA01D5E91B0d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t5000CCA01D5EAE50d0
    0.0  624.9    0.0 73559.8  0.0  9.4    0.0   15.1   2 100 c0t5000CCA01D5EED34d0
 3170.5    0.0 72167.9    0.0  0.0  3.5    0.0    1.1   2  55 c0t5000CCA01D58B3A4d0
 3881.4    0.0 72870.8    0.0  0.0  3.3    0.0    0.8   3  55 c0t5000CCA01D5F8244d0
 4252.3    0.0 70709.1    0.0  0.0  3.2    0.0    0.8   3  53 c0t5000CCA01D5E87E4d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t5000CCA01D5E1E3Cd0
 3063.5    0.0 70380.1    0.0  0.0  4.0    0.0    1.3   2  60 c0t5000CCA01D5BF56Cd0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 17190.2  523.6 502346.2 56121.6  0.2 31.0    0.0    1.8   1  18 c0
 2342.7    0.0 71913.8    0.0  0.0  2.9    0.0    1.2   3  43 c0t5000CCA01D5E95B0d0
 2306.7    0.0 72312.9    0.0  0.0  3.0    0.0    1.3   3  43 c0t5000CCA01D5F9B00d0
 2642.1    0.0 68822.9    0.0  0.0  2.9    0.0    1.1   3  45 c0t5000CCA01D5E91B0d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t5000CCA01D5EAE50d0
    0.0  523.6    0.0 56121.2  0.0  9.3    0.0   17.8   1 100 c0t5000CCA01D5EED34d0
 2257.7    0.0 71946.9    0.0  0.0  3.2    0.0    1.4   2  44 c0t5000CCA01D58B3A4d0
 2668.2    0.0 72685.4    0.0  0.0  2.9    0.0    1.1   3  43 c0t5000CCA01D5F8244d0
 2236.6    0.0 71829.5    0.0  0.0  3.3    0.0    1.5   3  47 c0t5000CCA01D5E87E4d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t5000CCA01D5E1E3Cd0
 2695.2    0.0 72395.4    0.0  0.0  3.2    0.0    1.2   3  45 c0t5000CCA01D5BF56Cd0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 31265.3  578.9 342935.3 53825.1  0.2 18.3    0.0    0.6   1  15 c0
 3748.0    0.0 48255.8    0.0  0.0  1.5    0.0    0.4   2  42 c0t5000CCA01D5E95B0d0
 4367.0    0.0 47278.2    0.0  0.0  1.1    0.0    0.3   2  35 c0t5000CCA01D5F9B00d0
 4706.1    0.0 50982.6    0.0  0.0  1.3    0.0    0.3   3  37 c0t5000CCA01D5E91B0d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t5000CCA01D5EAE50d0
    0.0  578.9    0.0 53824.8  0.0  9.7    0.0   16.8   1 100 c0t5000CCA01D5EED34d0
 4094.1    0.0 48077.3    0.0  0.0  1.2    0.0    0.3   2  35 c0t5000CCA01D58B3A4d0
 5030.1    0.0 47700.1    0.0  0.0  0.9    0.0    0.2   3  33 c0t5000CCA01D5F8244d0
 4939.9    0.0 52671.2    0.0  0.0  1.1    0.0    0.2   3  33 c0t5000CCA01D5E87E4d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t5000CCA01D5E1E3Cd0
 4380.1    0.0 47969.9    0.0  0.0  1.4    0.0    0.3   3  36 c0t5000CCA01D5BF56Cd0
^C

These are pretty amazing numbers for RAID-Z - and the only reason why a single disk drive can do so many thousands reads per second is that most of them have to be very almost ideally sequential. From time to time I see even more amazing numbers: 

                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 73503.1 3874.0 53807.0 19166.6  0.3  9.8    0.0    0.1   1  16 c0
 9534.8    0.0 6859.5    0.0  0.0  0.4    0.0    0.0   4  30 c0t5000CCA01D5E95B0d0
 9475.7    0.0 6969.1    0.0  0.0  0.4    0.0    0.0   4  30 c0t5000CCA01D5F9B00d0
 9646.9    0.0 7176.4    0.0  0.0  0.4    0.0    0.0   3  31 c0t5000CCA01D5E91B0d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t5000CCA01D5EAE50d0
    0.0 3478.6    0.0 18040.0  0.0  5.1    0.0    1.5   2  98 c0t5000CCA01D5EED34d0
 8213.4    0.0 6908.0    0.0  0.0  0.8    0.0    0.1   3  38 c0t5000CCA01D58B3A4d0
 9671.9    0.0 6860.5    0.0  0.0  0.4    0.0    0.0   3  30 c0t5000CCA01D5F8244d0
 8572.7    0.0 6830.0    0.0  0.0  0.7    0.0    0.1   3  35 c0t5000CCA01D5E87E4d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t5000CCA01D5E1E3Cd0
 18387.8    0.0 12203.5    0.0  0.1  0.7    0.0    0.0   7  57 c0t5000CCA01D5BF56Cd0

It is really good to see the new feature work so well in practice. This feature is what makes RAID-Z much more usable in many production environments. The other feature which complements this one and also makes RAID-Z much more practical to use is: RAID-Z/mirror hybrid allocator introduced in Solaris 11 (pool version 29). It makes accessing meta-data in RAID-Z much faster.

Both features are only available in Oracle Solaris 11 and not in OpenZFS deriviates.
Although OpenZFS has its own interesting new features as well.

Friday, December 05, 2014

ZFS Performance Improvements

One of the blogs I value is Roch Bourbonnais's. He hasn't blogged much in some time but he is back to blogging! He listed the main ZFS performance improvements since Oracle took over, and then also provided more details on ReARC. There is more to come from him, hopefully soon.

ps. to make it clear he is describing improvements to Oracle's ZFS as in Solaris 11 and ZFS-SA and not OpenZFS.

Friday, November 14, 2014

Kernel Zones Internals

There is a very good presentation describing how Kernel Zones are implemented from an architecture point of view.

Kernel Zones: root disk recovery

I've been playing with Solaris Kernel Zones recently. I installed a zone named kz1 on an iscsi device so I could test Live Migration as well. At some point I wanted to modify contents of the kz1 root disk outside of the zone, so I imported the pool in a global zone, of course while kz1 was shut down. This worked fine (zpool import -t ...). However the zone crashed on boot as it couldn't import the pool. The reason is that when a pool is imported on another system it updates phys_path in its ZFS label and when kernel zone boots it tries to import its root pool based on phys_path which now might be invalid, which was exactly the case for me. The root disk of kz1 zone was configured as: 

add device
set storage=iscsi://iscsi1.test/lunme.naa.600144f0a613c900000054521d550001
set bootpri=0
set id=0
end

This results in phys_path: /zvnex/zvblk@0:b as a disk driver is virtualized in Kernel Zones. After the pool was imported in a global zone the phys_path was updated to: /scsi_vhci/disk@g600144f0a613c900000054521d550001:b which won't work in the kernel zone. One way to workaround the issue is to create another "rescue" zone with its root disk's id set to 1. Then add the root disk of kz1 to it as an additional non-bootable disk with id=0.
 
add device
set storage=dev:zvol/dsk/rpool/kz2
set bootpri=0
set id=1
end

add device
set storage=iscsi://iscsi1.test/lunme.naa.600144f0a613c900000054521d550001
set id=0
end

Now in order to update the phys_path to the correct one, the pool needs to be imported and exported in the rescue zone:
 
# zpool import -R /mnt/1 -Nf -t kz1 rpool
# zpool export kz1

Notice that the rescue zone doesn't even need to have a network configured - in fact all you need is a minimal OS installation in it with almost everything disabled and you login to it via zlogin.

The kz1 zone will now boot just fine. In most cases you don't need to use the above procedure.You should be able to do all the customizations via AI manifests and system profiles and SMF. But in some cases it is useful to be able to manipulate root disk contents of a kernel zone without actually booting it. Or perhaps you need to recover rpool after it became unbootable.

Monday, November 03, 2014

Rollback ZFS volume exported over iSCSI

While playing with Kernel Zones on Solaris 11.2 I noticed that once a ZFS volume is shared over iSCSI, while I can create a snapshot of it, I can't roll it back - I get "volume is busy" error message. I found a way to do it:

# stmfadm delete-lu 600144F0A613C900000054521D550001
# zfs rollback pool/idisk0@snap1
# stmfadm import-lu /dev/zvol/rdsk/pool/idisk0
Logical unit imported: 600144F0A613C900000054521D550001
# stmfadm add-view 600144F0A613C900000054521D550001

Although this should be easier...

The iSCSI lun contains a kernel zone image just after it was installed. However now I couldn't boot into it:

# zoneadm -z kz1 attach
zone 'kz1': error: Encryption key not available. See solaris-kz(5) for configuration migration
zone 'kz1': procedure or restore /etc/zones/keys/kz1.

Right, the man page explains it all - there is a host meta data needed to boot a kernel zone, which is encrypted. Since I rolled back the ZFS volume to a previous installation the encryption key stored in the zone's configuration is no longer valid. I had to re-create it:

# zoneadm -z kz1 attach -x initialize-hostdata
# zoneadm -z kz1 boot

And now it booted just fine.

Tuesday, October 28, 2014

Oracle Open World 2014

This was my first time attending OOW and I must say I did like it. What a big marketing extravaganza...! It wasn't all marketing though, there were many interesting technical sessions too, but most importantly it was a big opportunity to meet with Oracle engineers and management, both the people I met in the past and some new faces I only exchanged emails with. It was also interesting to talk to other customers and see what they've been up to, and some of them are doing very interesting things. On Sunday there was also a Solaris Customer Advisory Board meeting which was very interesting.

One of the things that surprised me was how many other vendors were present there, having their stands - virtually everyone, from startups to large tier one vendors. I guess it is a good opportunity for everyone to meet with their customers (and potentially new customers).

I also presented there on how we use Solaris and took part of a Solaris Customer Panel - both were good experiences.

For more details see Markus Flierl post.

Friday, August 29, 2014

How ASLR exposed an old bug

Recently I was investigating an issue where OpenAFS server processes where crashing on start-up if ASLR (Address Space Layout Randomization) is enabled. All of them were crashing in the same place. Initially I enabled ASLR globally and restarted AFS services:
$ sxadm enable -c model=all aslr
$ svcadm restart ms/afs/server
This resulted in core files from all daemons, let's look at one of them:
$ mdb core.fileserver.1407857175
Loading modules: [ libc.so.1 libnvpair.so.1 ld.so.1 ]
> ::status
debugging core file of fileserver (64-bit) from XXXX
file: /usr/afs/bin/fileserver
initial argv: /usr/afs/bin/fileserver -p 123 -pctspare 20 -L -busyat 50 -rxpck 2000 -rxbind
threading model: native threads
status: process terminated by SIGSEGV (Segmentation Fault), addr=ffffffffb7a94b20
> ::stack
libc.so.1`memset+0x31c()
afsconf_OpenInternal+0x965()
afsconf_Open+0x125()
main+0x118()
0x44611c()
>
All the other daemons crashed in the same place. Let's take a closer look at the core.
> afsconf_OpenInternal+0x965::dis
afsconf_OpenInternal+0x930: movl $0x0,0xfffffffffffffce4(%rbp)
afsconf_OpenInternal+0x93a: movq 0xfffffffffffffce0(%rbp),%r9
afsconf_OpenInternal+0x941: movq -0x8(%rbp),%r8
afsconf_OpenInternal+0x945: movq %r9,0x18(%r8)
afsconf_OpenInternal+0x949: movq -0x8(%rbp),%rdi
afsconf_OpenInternal+0x94d: movl $0x0,%eax
afsconf_OpenInternal+0x952: call +0x1619 
afsconf_OpenInternal+0x957: movq -0x8(%rbp),%rdi
afsconf_OpenInternal+0x95b: movl $0x0,%eax
afsconf_OpenInternal+0x960: call +0x8ecb <_afsconf_loadrealms>
afsconf_OpenInternal+0x965: movl $0x0,-0x24(%rbp)
afsconf_OpenInternal+0x96c: jmp +0x6 
afsconf_OpenInternal+0x96e: nop
afsconf_OpenInternal+0x970: jmp +0x2 
afsconf_OpenInternal+0x972: nop
afsconf_OpenInternal+0x974: movl -0x24(%rbp),%eax
afsconf_OpenInternal+0x977: leave
afsconf_OpenInternal+0x978: ret
0x4e4349: nop
0x4e434c: nop
ParseHostLine: pushq %rbp
>
It actually crashes in _afsconf_LoadRealms(), we need a little bit more debug info:
$ truss -u a.out -u :: -vall /usr/afs/bin/fileserver $args
...
/1: -> _afsconf_LoadRealms(0x831790290, 0x1, 0x3, 0x0, 0x5bbbe8, 0x8317952bc)
/1: -> libc:malloc(0x28, 0x1, 0x3, 0x0, 0x28, 0x8317952bc)
...
/1: <- -="" 0x8317965c0="" libc:malloc=""> libc:memset(0x317965c0, 0x0, 0x28, 0x0, 0x28, 0x8317952bc)
/1: Incurred fault #6, FLTBOUNDS %pc = 0x7FFD55C802CC
/1: siginfo: SIGSEGV SEGV_MAPERR addr=0x317965C0
/1: Received signal #11, SIGSEGV [default]
/1: siginfo: SIGSEGV SEGV_MAPERR addr=0x317965C0
It fails just after first malloc() followed by memset() in _afsconf_LoadRealms(), the corresponding source code is:
local_realms = malloc(sizeof(struct afsconf_realms));
if (!local_realms) {
code = ENOMEM;
goto cleanup;
}
memset(local_realms, 0, sizeof(struct afsconf_realms));
queue_Init(&local_realms->list);
The code looks fine... but notice in the above truss output that memset() is using a different pointer to what malloc() returned. Might be a bug in truss but since this is where it crashes it is probably real. Let's confirm it with other tool and maybe we can also spot some pattern.
$ dtrace -n 'pid$target::_afsconf_LoadRealms:entry{self->in=1}' \
  -n 'pid$target::memset:entry/self->in/{printf("%p %d %d", arg0, arg1, arg2);}' \
  -n 'pid$target::malloc:entry/self->in/{trace(arg0);}' \
  -n 'pid$target::malloc:return/self->in/{printf("%p, %p", arg0,arg1);}' \
  -c "/usr/afs/bin/fileserver $args"

CPU ID FUNCTION:NAME
3 99435 malloc:entry 40
3 99437 malloc:return 54, c62324a50
3 99433 memset:entry 62324a50 0 40

$ dtrace -n 'pid$target::_afsconf_LoadRealms:entry{self->in=1}' \
  -n 'pid$target::memset:entry/self->in/{printf("%p %d %d", arg0, arg1, arg2);}' \
  -n 'pid$target::malloc:entry/self->in/{trace(arg0);}' \
  -n 'pid$target::malloc:return/self->in/{printf("%p, %p", arg0,arg1);}' \
  -c "/usr/afs/bin/fileserver $args"

CPU ID FUNCTION:NAME
3 99435 malloc:entry 40
3 99437 malloc:return 54, 10288d120
3 99433 memset:entry 288d120 0 40

$ dtrace -n 'pid$target::_afsconf_LoadRealms:entry{self->in=1}' \
  -n 'pid$target::memset:entry/self->in/{printf("%p %d %d", arg0, arg1, arg2);}' \
  -n 'pid$target::malloc:entry/self->in/{trace(arg0);}' \
  -n 'pid$target::malloc:return/self->in/{printf("%p, %p", arg0,arg1);}' \
  -c "/usr/afs/bin/fileserver $args"

CPU ID FUNCTION:NAME
3 99435 malloc:entry 40
3 99437 malloc:return 54, de9479a10
3 99433 memset:entry ffffffffe9479a10 0 40
It looks like the lowest 4 bytes in the pointer returned from malloc() and passed to memset() are always preserved, while the top 4 bytes are mangled. I was curious how it looks like when ASLR is disabled:
$ elfedit -e 'dyn:sunw_aslr disable' /usr/afs/bin/fileserver
$ dtrace -n 'pid$target::_afsconf_LoadRealms:entry{self->in=1}' \
  -n 'pid$target::_afsconf_LoadRealms:return{self->in=0}' \
  -n 'pid$target::memset:entry/self->in/{printf("%p %d %d", arg0, arg1, arg2);}' \
  -n 'pid$target::malloc:entry/self->in/{trace(arg0);}' \
  -n 'pid$target::malloc:return/self->in/{printf("%p, %p", arg0,arg1);}' \
  -c "/usr/afs/bin/fileserver $args"

1 99436 malloc:entry 40
1 99438 malloc:return 54, 5bd170
1 99434 memset:entry 5bd170 0 40
... [ it continues as it doesn't crash ]
Now the pointer passed to memset() is the same as the one returned from malloc() - notice however that it is 32bit (all daemons are compiled as 64bit). Let's have a look at the core again where it actually fails:
_afsconf_LoadRealms+0x59: call -0xa7c36 
_afsconf_LoadRealms+0x5e: movl %eax,%eax
_afsconf_LoadRealms+0x60: cltq
_afsconf_LoadRealms+0x62: movq %rax,%r8
_afsconf_LoadRealms+0x65: movq %r8,-0x20(%rbp)
_afsconf_LoadRealms+0x69: movq -0x20(%rbp),%r8
_afsconf_LoadRealms+0x6d: cmpq $0x0,%r8
_afsconf_LoadRealms+0x71: jne +0xd <_afsconf_loadrealms x80="">
_afsconf_LoadRealms+0x73: movl $0xc,-0x18(%rbp)
_afsconf_LoadRealms+0x7a: jmp +0x1f5 <_afsconf_loadrealms x274="">
_afsconf_LoadRealms+0x7f: nop
_afsconf_LoadRealms+0x80: movl $0x28,-0x48(%rbp)
_afsconf_LoadRealms+0x87: movl $0x0,-0x44(%rbp)
_afsconf_LoadRealms+0x8e: movq -0x48(%rbp),%r8
_afsconf_LoadRealms+0x92: movq %r8,%rdx
_afsconf_LoadRealms+0x95: movl $0x0,%esi
_afsconf_LoadRealms+0x9a: movq -0x20(%rbp),%rdi
_afsconf_LoadRealms+0x9e: movl $0x0,%eax
_afsconf_LoadRealms+0xa3: call -0xa7d10 

Bingo! See the movl and cltq instructions just after returning from malloc(). This means that malloc() is returning a 64bit address but compiler thinks it returns a 32bit address, so it clears the top 4 bytes and then expands the pointer back to 64 bits and this is what is being passed to memset(). With ASLR disabled it just happens we get a low address that the lowest 4 bytes are enough to address it so we don't get the issue, with ASLR most of the time we end up with much higher address where you can't just chop of the top four bytes.

Compilers do it if they have an implicit function declaration and then they assume the return is an int which on x86_64 means 32 bits. The fix was trivial - all that was required was to add #include <stdlib.h> and recompile - now compiler knows that malloc() returns 64 bit pointer, the movl, cltq instructions are gone and we get no more crashes.

Friday, August 08, 2014

Monitoring Dropped UDP packets

I needed to monitor incoming UDP packets being dropped by kernel and report some basic information on each of them (src and dst IP:port is enough). There are multiple probes which one could us, particularly ip:::drop-in is the best candidate, however it is not fully implemented so it doesn't provide all the same arguments other probes in IP provider have (this seems easy to fix). We can intercept ip_drop_input() function though (which is where the ip::drop-in probe is located as well) and juggle pointers to get what we want - once the drop-in probe is fixed the below script will be shorter and easier to understand. 
# ./udp_dropped_packets.d
2014 Aug  6 14:32:14 dropped UDP packet [udpIfStatsInOverflows] from 10.10.1.4:52537 to 10.10.2.9:7001
2014 Aug  6 14:32:15 dropped UDP packet [udpIfStatsInOverflows] from 10.10.1.4:52537 to 10.10.2.9:7001
2014 Aug  6 14:32:17 dropped UDP packet [l2 multicast not forwarded] from 10.10.3.11:137 to 10.10.3.9:137

Notice that the script prints reason why a packet was dropped, if you are interested only in drops due to lack of buffer space then you need to grep the output for udpIfStatsInOverflows.

The script has only been tested on Solaris 11.1 and 11.2 but it probably works on 11.0 and Illumos based distros as well.

# cat udp_dropped_packets.d
#!/usr/sbin/dtrace -qa

/* Robert Milkowski */

fbt::ip_drop_input:entry
/ ((ipha_t *)args[1]->b_rptr)->ipha_protocol == IPPROTO_UDP /
{
  this->ipha = (ipha_t *)args[1]->b_rptr;
  this->ip_src = inet_ntoa(&this->ipha->ipha_src);
  this->ip_dst = inet_ntoa(&this->ipha->ipha_dst);

  this->iph_hdr_length = (int)(this->ipha->ipha_version_and_hdr_length & 0xF) << 2;
  this->udpha = (udpha_t *)((char *)this->ipha + this->iph_hdr_length);
  this->src_port = ntohs(this->udpha->uha_src_port);
  this->dst_port = ntohs(this->udpha->uha_dst_port);

  printf("%Y dropped UDP packed [%s] from %s:%d to %s:%d\n", walltimestamp,
          stringof arg0,
          this->ip_src, this->src_port,
          this->ip_dst, this->dst_port
        );
}

Tuesday, July 22, 2014

Massive msync() speed up on ZFS

MongoDB is using mmap() to access all its files. It also has a special thread which by default wakes up every 60s and calls msync() for all the mmap'ed files, one at a time. Initially when you start MongoDB all these msync's are fast, assuming there are no modifications. However if your server has hundreds of GBs of RAM (or more) and a database is also that big, all these msync()s are getting slower over time - they take longer the more data is cached in RAM. Eventually it can take even 50s or more for the thread to finish syncing all of the files even if there is nothing to write to disk. The actual problem is that while the thread is syncing all the files it holds a global lock and until it finishes the database is almost useless. If it takes 50s to sync all the files then the database can process requests only for 10s out of each 60s window...

If you have logging enabled in MongoDB you should see log entries like:

    Tue Apr 29 06:22:02.901 [DataFileSync] flushing mmaps took 33302ms for 376 files

On Linux this is much faster as it has a special optimization for such a case, which Solaris doesn't.
However Oracle fixed the bug sometime ago and now the same database reports:

    Tue Apr 29 12:55:51.009 [DataFileSync] flushing mmaps took 9ms for 378 files

This is over 3000x improvement!
 
The Solaris Bug ID is: 18658199 Speed up msync() on ZFS by 90000x with this one weird trick
which is fixed in Solaris 11.1 SRU21 and also Solaris 11.2

Note that the fix only improves a case when an entire file is msync'ed and the underlying file system is ZFS. Any application which has a similar behavior would benefit. For some large mappings, like 1TB, the improvement can even be 178,000x.


Wednesday, April 09, 2014

Slow process core dump

Solaris limits rate at which a process core dump is generated. There are two tunables with default values: core_delay_usec=10000 core_chunk=32 (in pages). This essentially means that process core dump is limited to 100 writes of 128KB per second (32x 4k pages on x86) which is an eqivalen of 12.5MB/s - for large processes this will take quite a long time.

I set core_chunk=128 and core_delay_usec=0 via mdb which increased the core dump write rate from 12.5MB/s to about 300-600MB/s.

Long core dump generations can cause issues like delay in restarting a service in SMF. But then making it too fast might overwhelm OS and impact other running processes there. Still, I think the defaults are probably too conservative.

update: I just noticed that this has been fixed in Illumos, see https://illumos.org/issues/3673
 

Thursday, April 03, 2014

Home NAS - Data Corruption

I value my personal pictures so I store them on a home NAS which is running Solaris/ZFS. The server has two data disks mirrored. It's been running perfectly fine for the last two years but recently one of the disk drives returned corrupted data during ZFS scrub. If it wasn't for ZFS I probably wouldn't even know that some pixels are wrong... or perhaps that some pictures are missing... This is exactly why I've used ZFS on my home NAS.

# zpool status vault
  pool: vault
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
  scan: scrub repaired 256K in 3h54m with 0 errors on Tue Mar 11 17:34:22 2014
config:

        NAME        STATE     READ WRITE CKSUM
        vault       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            c4t1d0  ONLINE       0     0     2
            c5t0d0  ONLINE       0     0     0

errors: No known data errors

Friday, January 31, 2014

SMF: svcbundle

One of the missing things in SMF for me is an ability to mark a service to be disabled or enabled after reboot during postinstall phase of OS installation. One way of doing it is to generate a site profile, but this is rather cumbersome to do from a postinstaller. But there is a better way - svcbundle(1M). For example, to generate a profile which would enable ssh after reboot (or after manifest-import service is restarted:
svcbundle -s bundle-type=profile -s service-name=network/ssh \
          -s enabled=true >/etc/svc/profile/site/postinstall-ssh.xml
You can modify any parameter you want, for example to modify a sendmail service to use /etc/mail/cf/cf/nullclient.mc to generate its configuration from each time it starts:
svcbundle -s bundle-type=profile -s service-name=network/smtp -s instance-name=sendmail \
          -s instance-property-config:patht_to_sendmail_mc:astring:/etc/mail/cf/cf/nullclient.mc \
          >/etc/svc/profile/site/postinstall-sendmail-nullclient.xml

Thursday, January 23, 2014

mkdir() performance

Update: the fix is in Solaris 11.1 + SRU17, and should be in Solaris 11.2 once it is out. It now has a similar optimization to Linux. Network based file systems like AFS or NFS benefit most from it.

Recently I came across an issue where 'make install' on a Solaris server was taking *much* more time than on Linux. Files were being installed into AFS file system. After some debugging I found that GNU install calls mkdir() for all directories for a specified path and relies on EEXIST if a given directory already exists. For example: 

$ truss -D -t mkdir /usr/bin/ginstall -c -d \
 /ms/dev/openafs/core/1.6.5-c3/compile/x86_64.sunos64.5.11/sunx86_511/dest/bin
0.0026 mkdir("/ms", 0755)                              Err#17 EEXIST
0.0003 mkdir("dev", 0755)                              Err#30 EROFS
0.0002 mkdir("openafs", 0755)                          Err#30 EROFS
0.0002 mkdir("core", 0755)                             Err#30 EROFS
0.0083 mkdir("1.6.5-c3", 0755)                         Err#17 EEXIST
3.0085 mkdir("compile", 0755)                          Err#17 EEXIST
3.0089 mkdir("x86_64.sunos64.5.11", 0755)              Err#17 EEXIST
0.0005 mkdir("sunx86_511", 0755)                       Err#17 EEXIST
0.0002 mkdir("dest", 0755)                             Err#17 EEXIST
0.0065 mkdir("bin", 0755)                              Err#17 EEXIST
$

Notice that two mkdir()s took about 3s each! Now if there are lots of directories to be ginstall'ed it will take a very long time... I couldn't reproduce it on Linux though. Actually what happens is that Linux checks on VFS layer if there is a valid dentry with an inode allocated and if there is it will return with EEXIST without calling a file system specific VOP_MKDIR. The relevant code is:

https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/fs/namei.c
…
int vfs_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
{
        int error = may_create(dir, dentry);
        unsigned max_links = dir->i_sb->s_max_links;

        if (error)
               return error;

        if (!dir->i_op->mkdir)
               return -EPERM;
…
static inline int may_create(struct inode *dir, struct dentry *child)
{
        audit_inode_child(dir, child, AUDIT_TYPE_CHILD_CREATE);
        if (child->d_inode)
               return -EEXIST;
…

Unfortunately, Solaris doesn't have this optimization (though it does optimize couple of other cases, for example for EROFS), so each mkdir() results in VOP_MKDIR being called and for AFS it means sending a request over a network to a file server and waiting for a reply. That alone will make it slower than on Linux, but it still doesn't explain the 3s.

It turned out that AFS file server has a throttling mechanism - if a client is generating requests which results in error then by default it will delay answering to the client after 10 errors. This can be disabled or the threshold can be adjusted though. See -abortthreshold option to file server.

This was also tested (by an Oracle Solaris engineer) over NFS and showed 100x difference in response time. There is a negligible difference for local file systems.

A bug was opened against Solaris to get it fixed - see Bug 18115102 - mkdir(2) calls VOP_MKDIR() even though the directory or file already exists

Hopefully it will get fixed soon.

Tuesday, January 21, 2014

Systemtap

I had to measure what's going on in kernel for a specific VFS call both on Linux and on Solaris. On Solaris it was quick and easy - got an answer in couple of minutes, on Linux... Systemtap didn't like version of kernel-debuginfo installed (the error message wasn't that obvious either), had to uninstall it and install the proper one, as one can't have multiple versions installed which would be useful with multiple kernels installed...

While I like Systemtap in principle, in practice every time I use it I find it very frustrating, especially if I have to measure something similar on Solaris with DTrace which usually is so much easier and quicker to do.

Then, we do not use Systemtap on prod servers... in minutes I unintentionally crashed a Linux box by tracing with Systemtap :(

Now, Oracle Linux has DTrace... :)

Friday, January 17, 2014

AFS Client - better observability

Wouldn't it be nice if fsstat(1M) and fsinfo::: DTrace provider were working with AFS client? It turned out this was trivial to implement - Solaris itself will provide necessary stats for fsstat and register fsinfo::: probes if one extra flag is specified.Thanks to Andrew Deason for implementing it in AFS, and big thanks to Oracle's engineers for providing information on how to do it. See the actual patch to AFS.

Let's see fsstat in action against an AFS mountpount:
$ fsstat /ms 1
new  name   name  attr  attr lookup rddir  read read  write write
file remov  chng   get   set    ops   ops   ops bytes   ops bytes
    0     0     0 10.5K     0   173K   939 20.7K 72.8M     0     0 /ms
    0     0     0     0     0      0     0     0     0     0     0 /ms
    0     0     0     0     0      0     0     0     0     0     0 /ms
    0     0     0    26     0    296     0     9 8.06K     0     0 /ms
    0     0     0     0     0      0     0     0     0     0     0 /ms
    0     0     0     0     0      0     0     0     0     0     0 /ms
    0     0     0     0     0      0     0     0     0     0     0 /ms
    0     0     0   170     0  2.36K    10   106  265K     0     0 /ms
    0     0     0   159     0  1.97K     2    90  261K     0     0 /ms
    0     0     0    25     0    331     0 1.05K 4.15M     0     0 /ms
    0     0     0   138     0  1.80K     0 1.48K 5.37M     0     0 /ms
    0     0     0   360     0  4.72K    12 1.42K 5.35M     0     0 /ms
    0     0     0   122     0  1.63K     0 1.30K 4.68M     0     0 /ms
Now DTrace fsinfo::: provider showing results for AFS related operations:
$ dtrace -q -n fsinfo:::'/args[0]->fi_fs == "afs"/ \
               {printf("%Y %s[%d] %s %s\n", walltimestamp, execname, pid, probename, args[0]->fi_pathname);}'

2014 Jan 16 16:49:07 ifstat[1485] getpage /ms/dist/perl5/PROJ/core/5.8.8-2/.exec/ia32.sunos.5.10/bin/perl
2014 Jan 16 16:49:07 ifstat[1485] lookup /ms/dist
2014 Jan 16 16:49:07 ifstat[1485] lookup /ms/dist/aurora
2014 Jan 16 16:49:07 ifstat[1485] inactive /ms/dist
2014 Jan 16 16:49:07 ifstat[1485] lookup /ms/dist/aurora/bin
2014 Jan 16 16:49:07 ifstat[1485] inactive /ms/dist/aurora
2014 Jan 16 16:49:07 ifstat[1485] inactive /ms/dist/aurora/bin
2014 Jan 16 16:49:07 ifstat[1485] lookup /ms/dist
2014 Jan 16 16:49:07 ifstat[1485] lookup /ms/dist/fsf
2014 Jan 16 16:49:07 ifstat[1485] inactive /ms/dist
2014 Jan 16 16:49:07 ifstat[1485] lookup /ms/dist/fsf/bin
2014 Jan 16 16:49:07 ifstat[1485] inactive /ms/dist/fsf
2014 Jan 16 16:49:07 ifstat[1485] inactive /ms/dist/fsf/bin
2014 Jan 16 16:49:07 ifstat[1485] delmap /ms/dist/perl5/PROJ/core/5.8.8-2/.exec/ia32.sunos.5.10/bin/perl
2014 Jan 16 16:49:07 ifstat[1485] delmap /ms/dist/perl5/PROJ/core/5.8.8-2/.exec/ia32.sunos.5.10/bin/perl
2014 Jan 16 16:49:07 ifstat[1485] delmap /ms/dist/perl5/PROJ/core/5.8.8-2/.exec/ia32.sunos.5.10/lib/perl5/auto/Time/HiRes/HiRes.so
2014 Jan 16 16:49:07 ifstat[1485] delmap /ms/dist/perl5/PROJ/core/5.8.8-2/.exec/ia32.sunos.5.10/lib/perl5/auto/Time/HiRes/HiRes.so
2014 Jan 16 16:49:07 ifstat[1485] close /ms/dist/perl5/PROJ/core/5.8.8-2/.exec/ia32.sunos.5.10/bin/perl
2014 Jan 16 16:49:07 ifstat[964] open /ms/dist/perl5/PROJ/core/5.8.8-2/.exec/ia32.sunos.5.10/bin/perl
2014 Jan 16 16:49:07 ifstat[964] addmap /ms/dist/perl5/PROJ/core/5.8.8-2/.exec/ia32.sunos.5.10/bin/perl
2014 Jan 16 16:49:07 tcpstat[965] open /ms/dist/perl5/PROJ/core/5.8.8-2/.exec/ia32.sunos.5.10/bin/perl
2014 Jan 16 16:49:07 tcpstat[965] addmap /ms/dist/perl5/PROJ/core/5.8.8-2/.exec/ia32.sunos.5.10/bin/perl
2014 Jan 16 16:49:07 tcpstat[965] addmap /ms/dist/perl5/PROJ/core/5.8.8-2/.exec/ia32.sunos.5.10/bin/perl
2014 Jan 16 16:49:07 tcpstat[965] addmap /ms/dist/perl5/PROJ/core/5.8.8-2/.exec/ia32.sunos.5.10/lib/perl5/auto/Time/HiRes/HiRes.so
2014 Jan 16 16:49:07 tcpstat[965] addmap /ms/dist/perl5/PROJ/core/5.8.8-2/.exec/ia32.sunos.5.10/lib/perl5/auto/Time/HiRes/HiRes.so
2014 Jan 16 16:49:07 tcpstat[1484] getpage /ms/dist/perl5/PROJ/core/5.8.8-2/.exec/ia32.sunos.5.10/bin/perl