Tuesday, March 31, 2009

ZFS Deduplicatuion This Summer?

Jeff Bonwick wrote:
"Yes -- dedup is my (and Bill's) current project. Prototyped in December.
Integration this summer. I'll blog all the details when we integrate,
but it's what you'd expect of ZFS dedup -- synchronous, no limits, etc."

The CPU Overclocks itself

Joerg reports:
"With the announcement of Intel Nehalem support in Solaris, we pointed to some interesting features, but from my perspective the power-aware dispatcher is the most interesting one. I wrote a while ago about the turbo boost feature of the Nehalem processors. The processor overclocks itself, when there is still head room in the power and thermal budget. It can overclock a core even higher, when other cores are in deep sleep. Otherwise it can make sense not to use a core for a single process, when there is enough compute power available otherwise you could put this core into a deep sleep mode just to save power. The new power-aware dispatcher in Solaris is aware of this side conditions and can dispatch the processes in a System accordingly. You will find more informations at the projects website."

Thursday, March 26, 2009

Trying too hard

From time to time I can see people trying to be too clever about some problems. What I mean by that is that sometimes they try too hard to use latest technologies to do something while there is already a solution which does the job. Or sometimes instead of taking a step back and taking a deep breath they dive directly into problem solving coming up with crazy ways to accomplish something. I guess it happens to all of us from time to time. This time it happened to me :) :)

A colleague approached me with a problem he had on some old Solaris 7 server which is stripped and customized and there is no pargs command there. He needed to get a full argument list of a running process but ps truncate it to 80 characters. Well I thought a simple C program should be able to extract the information via /proc. So me trying to be helpful I started to write it right a way. After some time I came up with:

bash-2.05# cat pargs.c

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <procfs.h>
#include <sys/procfs.h>
#include <sys/prsystm.h>

int main(int argc, char *argv[])
psinfo_t p;
char *file;
int fd;
int fd_as;
uintptr_t pargv[1024];
char arg[1024];
int i;

if(argc != 3)
printf("Usage: %s /proc/PID/psinfo\n", argv[0]);

file = argv[1];
fd = open(file, O_RDONLY);
if (fd == -1)
printf("Can't open %s file\n", file);

read(fd, &p, sizeof(p));

fd_as = open(argv[2], O_RDONLY);

printf("nlwp: %d\n", p.pr_nlwp);
printf("exec: %s\n", p.pr_fname);
printf("args: %s\n", p.pr_psargs);
printf("argc: %d\n", p.pr_argc);

pread(fd_as, &pargv, p.pr_argc * sizeof (uintptr_t), p.pr_argv);
for (i=0; i<p.pr_argc; i++)
pread(fd_as, &arg, 256, ((uintptr_t *)pargv)[i]);
printf(" %s\n", arg);


Job done.
Well couple of minutes later I realized that UCB version of ps is able to show long argument list...

bash-2.05# /usr/ucb/ps -axuww |grep "19179"
XXXX 19179 9.3 2.23998422056 ? S 11:02:30 0:02 /usr/java/bin/../bin/sparc/native_threads/java -classpath :./classes/packages/jakarta-regexp-1.3.jar:./classes/packages/classes12.zip:./classes/packages/mail.jar:./classes/packages/activation.jar:./classes/ MailSender

I had a good laugh at myself.

Tuesday, March 24, 2009

Library Interposer

Recently I have used Dtrace to change the output of uname() syscall. But if one wants a more permanent and selective approach it is easier to write a small library which would interpose the uname() syscall (well, actually uname() libC function and not a syscall itself). I slightly modified the malloc_interposer example.

After you compiled the library all you have to do is to LD_PRELOAD it in your script so everything started by that script will use it or you can LD_PRELOAD it only for a given binary as shown below. Additionally you have to set a variable uname_release to whatever string you like otherwise the library won't do anything.

# uname -a
SunOS test-server 5.10 Generic_125100-10 sun4u sparc SUNW,Sun-Fire-V440
# uname_release="5.7" LD_PRELOAD=./uname_interposer.so uname -a
SunOS test-server 5.7 Generic_125100-10 sun4u sparc SUNW,Sun-Fire-V440

# cat uname_interposer.c
/* Based on http://developers.sun.com/solaris/articles/lib_interposers_code.html#malloc_interposer.c
/* Example of a library interposer: interpose on
* uname().
* Build and use this interposer as following:
* cc -o malloc_interposer.so -G -Kpic malloc_interposer.c
* setenv LD_PRELOAD $cwd/uname_interposer.so
* run the app
* unsetenv LD_PRELOAD

#include <stdio.h>
#include <dlfcn.h>
#include <stdlib.h>

#include <sys/utsname.h>

int uname(struct utsname *name)
int rc;
char *release;

static int (*uname_func)(struct utsname *) = NULL;
uname_func = (int (*)(struct utsname*)) dlsym(RTLD_NEXT, "uname");
rc = uname_func(name);
if (release=getenv("uname_release"))
strlcpy(name->release, release, _SYS_NMLN);


# gcc -fPIC -g -o uname_interposer.so -G uname_interposer.c

Thursday, March 12, 2009

When Free is Too Expensive

I like Jonathan Schwartz blog entries and his last post he clarifies Sun's business model. I like the funny part about free software - how true it is.
"When Free is Too Expensive
One of my favorite customer stories relates to an American company that did nearly 30% of its yearly revenue on Christmas Day. They were a mobile phone company, whose handsets appeared under Christmas trees, opened en masse and provisioned on the internet within about a 48 hour period. When we won the bid to supply their datacenter, their CIO gave me the purchase order on the condition I gave him my home phone number. He said, "If I have any issues on Christmas, I want you on the phone making sure every resource available is solving the problem." I happily provided it (and then made sure I had my direct staff's home numbers). Christmas came and went, no problems at all.

A year later, he was issuing a purchase order to Sun for several of our software products. To have a little fun with him (and the Sun sales rep), I told him before he passed me the purchase order that the products were all open source, freely available for download.

He looked at me, then at his rep, and said "What? Then why am I paying you a million dollars?" I responded, "You can absolutely run it for free. You just can't call me on Christmas day, you'll be on your own." He gave me the PO. At the scale he was running, the cost of downtime dwarfed the cost of the license and support.

Numerically, most developers and technology users have more time than money. Most readers of this blog are happy to run unsupported software, and we are very happy to supply it. For a far smaller population, the price of downtime radically exceeds the price of a license or support - for some, the cost of downtime is measured in millions per minute. If you're tracking packages or fleets of aircraft, running an emergency response networking or a trading floor, you almost always have more money than time. And that's our business model, we offer utterly exceptional service, support and enterprise technologies to those that have more money than time. It's a good business."

Saturday, March 07, 2009

Open Storage - What's Next?

If you wonder what's coming in storage area and also in ZFS in particular watch Open Solaris Storage Summit. To get your attention here is a list of some really exiting features coming to ZFS:

  • DeDuplication in ZFS
  • User Quotas in ZFS
  • Disk Eviction/Pool Shrinking
  • VSS Shadow Copies with ZFS Snapshots
  • Persistent L2ARC
  • ZFS Encryption
  • Lustre + ZFS
  • pNFS + ZFS

Wednesday, March 04, 2009

Oracle 8.0.6 on Solaris 10

I'm working on getting Oracle 8.0.6 32bit running on Solaris 7 migrated to Solaris 10. There is no branded zone for Solaris 7 and we have decided to try to run Oracle 8.0.6 directly on Solaris 10. Basically it just works. Basically... the problem was that some of a database files are larger than 2GB and Oracle fails to recover database on these files. After checking some log files and a little bit of dtrace'ing I found out that it does a stat() syscall on each db file before recovery starts and stat() fails with EOVERFLOW. So it uses wrong API... but it seems to work fine on Solaris 7 with the same binaries. It turned out that while Oracle is starting it is calling uname() to determine an OS version and based on that information it can change its behavior (like not using proper API to access large files). The easiest way is to use dtrace to intercept uname() syscall and put a fake output just before it returns. After that everything seems to be working fine.

Below dtrace script will put "5.7" string in uname() structure for every application calling uname() with uid=300 (oracle in my case). One might also write a small interposing library and LD_PRELOAD it while starting Oracle - that should also work.

#!/usr/sbin/dtrace -qs

#pragma D option destructive

self->addr = arg0;

copyoutstr("5.7", self->addr+(257*2), 257);