Tuesday, July 22, 2014

Massive msync() speed up on ZFS

MongoDB is using mmap() to access all its files. It also has a special thread which by default wakes up every 60s and calls msync() for all the mmap'ed files, one at a time. Initially when you start MongoDB all these msync's are fast, assuming there are no modifications. However if your server has hundreds of GBs of RAM (or more) and a database is also that big, all these msync()s are getting slower over time - they take longer the more data is cached in RAM. Eventually it can take even 50s or more for the thread to finish syncing all of the files even if there is nothing to write to disk. The actual problem is that while the thread is syncing all the files it holds a global lock and until it finishes the database is almost useless. If it takes 50s to sync all the files then the database can process requests only for 10s out of each 60s window...

If you have logging enabled in MongoDB you should see log entries like:

    Tue Apr 29 06:22:02.901 [DataFileSync] flushing mmaps took 33302ms for 376 files

On Linux this is much faster as it has a special optimization for such a case, which Solaris doesn't.
However Oracle fixed the bug sometime ago and now the same database reports:

    Tue Apr 29 12:55:51.009 [DataFileSync] flushing mmaps took 9ms for 378 files

This is over 3000x improvement!
 
The Solaris Bug ID is: 18658199 Speed up msync() on ZFS by 90000x with this one weird trick
which is fixed in Solaris 11.1 SRU21 and also Solaris 11.2

Note that the fix only improves a case when an entire file is msync'ed and the underlying file system is ZFS. Any application which has a similar behavior would benefit. For some large mappings, like 1TB, the improvement can even be 178,000x.