Friday, May 29, 2009


IBM has posted yet another marketing article. I don't know if it is funny or irritating - perhaps both. It is the usual pseudo-technical article from IBM - International Bureau for Misinformation? :) Don't get me wrong - I like IBM and I often admire of what they are doing and not only in a server market-space but for science in general. It's just that their server division seems to be made of only a marketing people and nothing more. And not entirely honest ones...

So lets go thru some of the revelations in the article.

"HP-UX, HP's flavor of UNIX, is now up to release 11iV3. HP-UX is based on System V and runs on both HP9000 RISC servers and HP Integrity Itanium systems. In this respect, it is similar to Solaris, which can run on their SPARC RISC architecture, as well as x86 machines. AIX can only run on the POWER® architecture; however, given how UNIX is a high-end operating system, it is a positive thing that AIX and the POWER architecture are tightly integrated."

How ridiculous it is! Do they want to imply that HP-UX or Solaris are not tightly integrated with their respective RISC platforms? Of course they are. The fact is that AIX and HP-UX do not run on the most commonly used platform these days: x86/x64. And this is one of the reasons why they are dying platforms. Sure, they will stay in market for a long time mostly due to the fact that a lot of enterprise customers won't/can't migrate quickly off them. But if you are building a new environment in almost all cases there is no point in deploying AIX or HP-UX, no point for a customer it is.

Later in the document there is a section called "Solaris innovations" however they do not actually list Solaris innovation only a couple of selected feature updates to the 10/08 releases (there is already the 05/09 release for some time). From the marketing point of view it is very clever as if you are not reading carefully the article you would probably be under impression that there aren't many innovations in Solaris... What about DTrace? SMF? FMA? Branded Zones? Resource Management, Recent improvements for Intel platform (intelligent power management, MPO, etc.), etc.

Then they and the section with another astonishing claim:
"These recent improvements to ZFS are very important. When ZFS first came out, it looked incredible, but the root issue was a glaring omission in feature functionality. With this ability now added, Solaris compares favorably in many ways to JFS2 from AIX and VxFs from HP."
What do they smoke? There isn't even sense in trying to argue with it, IBM can only wish it was true. You won't even find most of the ZFS features everyone cares so much about in IBM's JFS2. ZFS is years more advanced than JFS2 - it's a completely different kind of technology and I doubt IBM will ever catch up with JFS2 - it would probably be easier to write an fs/lvm from scratch or port ZFS...

Later on they move to "AIX innovations" and they start with:
"AIX 6.1, first released about two years ago, is now available in two editions: standard, which includes only the base AIX, and the Enterprise edition, which includes workload partition manager and several Tivoli® products."
I hope it is not supposed to be an innovation... well I prefer Linux or Solaris model when you get ALL the features in standard OS and entirely for free. And it doesn't matter if it is a low-end x86 server or your laptop or if it is a large SPARC server with over 100+ cores and terabytes of memory... you still use the same Solaris with all the features and for free if you do not need a support.

One of the listed "innovations" in AIX are WPARs... well they provided that functionality many years after Solaris had Zones (which in turn were inspired by BSD's Jails). It will take at least couple of years for WPARs to mature... then they claim that "No other UNIX can boast the ability to move over running workloads on a workload partition from one system to another without shutting down the partition". Well, it is not true. You can live migrate LDOMs on Solaris, you can live migrate xVM guests on Solaris and you can live migrated XEN guests on Linux. Well, IBM is trying to catch up here again...

Then there is a lot of false (or incomplete) claims about virtualization technologies in other OS'es to AIX.

They claim AIX can do:
"Micro-partitioning: This feature allows you to slice up a POWER CPU on as many as 10 logical partitions, each with 1/10th of a CPU. It also allows for the capability of your system to exceed the amount of entitled capacity that the partition has been granted. It does this by allowing for uncapped partitions."
Well, it is all great but... I don't know about HP-UX but on Solaris you can slice you CPU (be it SPARC or x86 or IBM's mainframes soon...) as much as you want and of course is does support uncapped partitions (zones). You are not limited to just 10 of them - I know environments with many more than 10 and you can allocate 1/1000th of CPU or less if you want - basically you don't have any artificial or design limits as in AIX.

Then they move to networking and complain that in Solaris you have to edit text files... how funny is it? btw: if you really want you can use Webmin which is a GUI interface to manage an OS and it is delivered with Solaris - much easier than IBM's SMIT and it works on Linux and couple of other platforms too... or you can use Visual Panels in Open Solaris, still more powerful.

Later on they move to performance tuning and claim that using gazillions of different AIX commands is somehow easier than managing two txt files on Solaris... well... whatever. Of course performance tuning is all about observability because you need to first to understand what you want to tune, why and then measure the effect you changes will introduce. Solaris is currently the most observable production OS on the planet, mostly due to DTrace. AIX is far far behind in this respect.

Why is it so hard to find an article from IBM which at least tries to be objective? Why everything they do is an ultimate marketing machine? Maybe because it works in so many cases...

The plain truth is that AIX was one of the innovative UNIX flavours in a market but it stayed behind many years ago. And while they do try to catch up here and there they no longer lead the market and it is the dying platform. If it wasn't for a large legacy enterprise customer base it would have already share the fate of OS2. Sure there are some enthusiasts - they always are for any product but it doesn't change anything. The future seems to be with Linux, Open Solaris and Windows.

Thursday, May 28, 2009

Wednesday, May 27, 2009

L2ARC Turbo WarmUp

This has been integrated into snv_107.

"The L2ARC warms up at a maximum rate of l2arc_write_max per second,
which currently defaults to 8 Mbytes. This value was picked to minimise
the expense of both finding this data in the ARC, and writing it to current
read-bias SSDs - which maximises workload performance from warm L2ARC devices.

This value isn't ideal if the L2ARC devices are cold or cool - since we'd like
them to warm up as quick as possible. Also, since they are cold - there is
less concern for interfering with L2ARC hits by writing faster.
This also applies to shifts in workload, where the L2ARC is warm with content
that is no longer getting hits - and behaves as like it is cool.

This RFE is to dynamically increase the write rate when the current workload
isn't cached on the L2ARC (either because it is cold, or because the workload
has shifted); and to idle back the write rate when the L2ARC is warm."

What's New in the Open Solaris 2009.06

Friday, May 15, 2009

Is it really a random load?

I'm running some benchmarks in a background and for some reason I wanted to verify if the workload filebench is generating is actually random withing a large file. Then the file is 100GB in size but workload is supposed to do random reads to only first 70GB of the file.

# dtrace -n io:::start'/args[2]->fi_name == "00000001"/ \
{@=lquantize(args[2]->fi_offset/(1024*1024*1024),0,100,10);}' \
-n tick-3s'{printa(@);}'
0 49035 :tick-3s

value ------------- Distribution ------------- count
0 | 0
0 |@@@@@@ 218788
10 |@@@@@@ 219156
20 |@@@@@@ 219233
30 |@@@@@@ 218420
40 |@@@@@@ 218628
50 |@@@@@@ 217932
60 |@@@@@@ 217572
70 | 0

So it is true on both cases - we can see evenly distributed access all over first 70GB of the file.

128Gflops per chip?

link 1

Thursday, May 14, 2009

Open Storage Wish List

1. L2ARC should survive reboots

There is already an RFE for this and AFAIK it's been working on.

2. Ability to mirror ARC between cluster nodes

In some workloads it is there important to be able to sustain a provided performance. To warm up a cache it could take even hours during which time the delivered performance could be lower than usual. #1 once implemented should partly fix the issue but still filling-in 128GB of cache could take some time and negatively impact the performance. I think the replication not necessarily would have to be a synchronous one. It probably would be hard to implement and maybe is not worth it...

3. L2ARC and SLOG SSDs shouldn't be included in disk drive IOPS stats.

While I haven't looked into it very closely it seems that when graphing IOPS for disk drives both L2ARC and SLOG numbers are included in totals. I think it is slightly misleading and a separate graph should be provided just for L2ARC and SLOG.

4. Ability to create a storage pool without L2ARC and/or SLOG devices even if they are present

While this is not necessarily important for production deployments it would help with testing/benchmarking so one doesn't have to physically remove SSDs in order to be able to build a pool without them.

Open Storage and Data Caching

I've been playing with Open Storage 7410 recently. Although I've been using its GUI for quite some time thanks to FishWorks beta program it still amazes me how good it is, especially when you compare it to NetApp or Data Domain.

One of the really good things about Open Storage is it allows for quite a lot of Read/Write cache (currently up-to 128GB). If it is still not enough you can put up to ~600GB of additional Read Cache in terms of SSDs. What it means in practice is that many real-life workloads will entirely fit into the cache which in turn will provide excellent performance. In a way this is nothing new except for... economics! Try to find any other NAS product in the market when you can put ~600GB of cache and within the same price range as Open Storage. You won't find anything like this.

I have created a disk pool out of 20x 1TB SATA disk drives which are protected with RAID-DP (akd RAIDZ2 which is an implementation of RAID-6). Now RAIDZ2 is known for a very bad random read performance from multiple streams if data is not cached. Using filebench I run a random read workload for a dataset of 10GB (let's say a small MySQL database) with 16 active streams. The 7410 appliance has been rebooted prior to test so all caches were clean. As you can see on below screenshot at the beginning it was able to sustain ~400 NFSv3 operations per second. After about 50 minutes it delivered ~12,000 NFSv3 operations per second which saturated my 1GbE link. At about the same time the average latency for NFS operations were getting smaller and smaller, same for number of operations to physical disks. At some point all data had been in cache and there were no operations to physical disks at all. B

The appliance could do certainly much more if I would use more GbE links of 10GbE links. Now remember that I used 20x 1TB SATA disk drives in a RAID-DP configuration to get this performance and it could sustain it for workloads of up to ~600GB of a working set size. If you put these numbers into perspective: one 15K FC disk drive can deliver ~250 8KB random reads at most. You would need almost 100 such disk drives configured in RAID-10 to be able to match the performance and still you would get less capacity (even assuming 300GB FC 15K drives).

Open Storage is the game changer for a lot of workloads both in terms of a delivered performance and a cost - currently there isn't really anything in the market which can match it.

Tuesday, May 05, 2009

GNU grep vs. grep

Open Solaris b111 x64
$ ptime /usr/gnu/bin/grep mysqld 1_4GB_txt_file

real 1:32.055472017
user 1:30.202692546
sys 0.907308690

$ ptime /usr/bin/grep mysqld

real 8.725173958
user 7.621411130
sys 1.056151347

I guess it's due to GNU version being compiled without optimizations... or maybe it is something else. Once I find some time I will try to investigate it.