Thursday, January 21, 2010

The EC gives a green light

The European Commission clears Oracle's proposed acquisition of Sun Microsystems:
"The European Commission has approved under the EU Merger Regulation the proposed acquisition of US hardware and software vendor Sun Microsystems Inc. by Oracle Corporation, a US enterprise software company. After an in-depth examination, launched in September 2009 (see IP/09/1271 ), the Commission concluded that the transaction would not significantly impede effective competition in the European Economic Area (EEA) or any substantial part of it."

Friday, January 15, 2010

MySQL TOP

This blog entry has been updated:
  • added NCQRS, NCQRS/s columns
  • fixed issue with dtrace dropping variables if the script was running for extended time periods
  • cosmetic changes re output

I need to observe MySQL load from time to time and DTrace is one of the tools to use. Usually I'm using one-liners or I come up with a short script. This time I thought it would be nice to write a script so other people like DBAs could use without having to understand how it actually works. The script prints basic statistics for each client connecting to a database. It gives a nice overview for all clients using a database.

CLIENT IP CONN CONN/s QRS QRS/s NCQRS NCQRS/s TIME VTIME
10.10.10.35 10 0 61 0 32 0 0 0
10.10.10.30 17 0 73 0 73 0 0 0
10.10.10.100 52 0 90 0 90 0 0 0
xx-www-11.portal 92 0 249 0 48 0 0 0
xx-cms-1.portal 95 0 1795 5 1669 4 48 48
xx-www-9.portal 198 0 634 1 278 0 0 0
xx-www-13.portal 239 0 986 2 366 1 0 0
xx-www-3.portal 266 0 1028 2 455 1 1 0
xx-www-12.portal 266 0 1070 2 561 1 3 2
xx-www-5.portal 300 0 1431 3 593 1 2 2
xx-www-10.portal 333 0 1221 3 678 1 3 2
xx-www-6.portal 334 0 989 2 446 1 1 0
xx-www-8.portal 358 0 1271 3 497 1 1 0
xx-www-4.portal 395 1 1544 4 744 2 0 0
xx-www-2.portal 445 1 1729 4 764 2 3 2
xx-www-1.portal 962 2 3555 9 1670 4 22 21
xx-www-7.portal 1016 2 3107 8 1643 4 117 115
====== ===== ====== ===== ====== ===== ===== =====
5378 14 20833 58 10607 29 207 199
Running for 359 seconds.

CONN total number of connections
CONN/s average number of connections per second
QRS total number of queries
QRS/s average number of queries per second
NCQRS total number of executed queries
NCQRS/s average number of executed queries per second
TIME total clock time in seconds for all queries
VTIME total CPU time in seconds for all queries

The NCQRS column represents the number of queries which were not served from the MySQL Query Cache while QRS represents all queries issued to MySQL (cached, non-cached or even non-valid queries). If values of VTIME are very close to values of TIME it means that queries are mostly CPU bound. On the other hand the bigger the difference between them the more time is spent on I/O. Another interesting thing to watch is how evenly load is coming from different clients especially in environments where clients are identical www servers behind load balancer and should be generating about the same traffic to a database.

All values are measured since the script was started. There might be some discrepancies with totals in the summary line - this is due to rounding errors. The script should work for MySQL versions 5.0.x, 5.1.x and perhaps for other versions as well. The script doesn't take into account connections made over a socket file - only tcp/ip connections.

The script requires PID of a mysql database as its first argument and a frequency at which output should be refreshed as a second argument, for example to monitor mysql instance with PID 12345 and refresh output every 10s:

./mysql_top.d 12345 10s



# cat mysql_top.d
#!/usr/sbin/dtrace -qCs

/*
Robert Milkowski
*/

#define CLIENTS self->client_ip == "10.10.10.11" ? "xx-www-1.portal" : \
self->client_ip == "10.10.10.12" ? "xx-www-2.portal" : \
self->client_ip == "10.10.10.13" ? "xx-www-3.portal" : \
self->client_ip == "10.10.10.14" ? "xx-www-4.portal" : \
self->client_ip == "10.10.10.15" ? "xx-www-5.portal" : \
self->client_ip == "10.10.10.16" ? "xx-www-6.portal" : \
self->client_ip == "10.10.10.17" ? "xx-www-7.portal" : \
self->client_ip == "10.10.10.18" ? "xx-www-8.portal" : \
self->client_ip == "10.10.10.19" ? "xx-www-9.portal" : \
self->client_ip == "10.10.10.20" ? "xx-www-10.portal" : \
self->client_ip == "10.10.10.21" ? "xx-www-11.portal" : \
self->client_ip == "10.10.10.22" ? "xx-www-12.portal" : \
self->client_ip == "10.10.10.23" ? "xx-www-13.portal" : \
self->client_ip == "10.10.10.29" ? "xx-cms-1.portal" : \
self->client_ip

#define LEGEND "\n \
CONN total number of connections \n \
CONN/s average number of connections per second \n \
QRS total number of queries \n \
QRS/s average number of queries per second \n \
NCQRS total number of executed queries \n \
NCQRS/s average number of executed queries per second \n \
TIME total clock time in seconds for all queries \n \
VTIME total CPU time in seconds for all queries\n"

BEGIN
{
start = timestamp;
total_queries = 0;
total_nc_queries = 0;
total_conn = 0;
total_time = 0;
total_vtime = 0;
}

syscall::getpeername:entry
/ pid == $1 /
{
self->in = 1;

self->arg0 = arg0; /* int s */
self->arg1 = arg1; /* struct sockaddr * */
self->arg2 = arg2; /* size_t len */
}

syscall::getpeername:return
/ self->in /
{
this->len = *(socklen_t *) copyin((uintptr_t)self->arg2, sizeof(socklen_t));
this->socks = (struct sockaddr *) copyin((uintptr_t)self->arg1, this->len);
this->hport = (uint_t)(this->socks->sa_data[0]);
this->lport = (uint_t)(this->socks->sa_data[1]);
this->hport <<= 8; this->port = this->hport + this->lport;

this->a1 = lltostr((uint_t)this->socks->sa_data[2]);
this->a2 = lltostr((uint_t)this->socks->sa_data[3]);
this->a3 = lltostr((uint_t)this->socks->sa_data[4]);
this->a4 = lltostr((uint_t)this->socks->sa_data[5]);
this->s1 = strjoin(this->a1, ".");
this->s2 = strjoin(this->s1, this->a2);
this->s1 = strjoin(this->s2, ".");
this->s2 = strjoin(this->s1, this->a3);
this->s1 = strjoin(this->s2, ".");
self->client_ip = strjoin(this->s1, this->a4);

@conn[CLIENTS] = count();
@conn_ps[CLIENTS] = count();

total_conn++;

self->arg0 = 0;
self->arg1 = 0;
self->arg2 = 0;
}

pid$1::*close_connection*:entry
/ self->in /
{
self->in = 0;
self->client_ip = 0;
}

pid$1::*mysql_parse*:entry
/ self->in /
{
self->t = timestamp;
self->vt = vtimestamp;

@query[CLIENTS] = count();
@query_ps[CLIENTS] = count();

total_queries++;
}

pid$1::*mysql_parse*:return
/ self->in /
{
@time[CLIENTS] = sum(timestamp-self->t);
@vtime[CLIENTS] = sum(vtimestamp-self->vt);

total_time += (timestamp - self->t);
total_vtime += (vtimestamp - self->vt);

self->t = 0;
self->vt = 0;
}

pid$1::*mysql_execute_command*:entry
/ self-> in /
{
@nc_query[CLIENTS] = count();
@nc_query_ps[CLIENTS] = count();

total_nc_queries++;
}

tick-$2
{
/* clear the screen and move cursor to top left corner */
printf("\033[H\033[J");

this->seconds = (timestamp - start) / 1000000000;

normalize(@conn_ps, this->seconds);
normalize(@query_ps, this->seconds);
normalize(@nc_query_ps, this->seconds);
normalize(@time, 1000000000);
normalize(@vtime, 1000000000);

printf("%-16s %s %s %s %s %s %s %s %s\n", \
"CLIENT IP", "CONN", "CONN/s", "QRS", "QRS/s", "NCQRS", "NCQRS/s", "TIME", "VTIME");
printa("%-16s %@6d %@5d %@6d %@5d %@6d %@5d %@5d %@5d\n", \
@conn, @conn_ps, @query, @query_ps, @nc_query, @nc_query_ps, @time, @vtime);
printf("%-16s %s %s %s %s %s %s %s %s\n", \
"", "======", "=====", "======", "=====", "======", "=====", "=====", "=====");
printf("%-16s %6d %5d %6d %5d %6d %5d %5d %5d\n", "", \
total_conn, total_conn/this->seconds, total_queries, total_queries/this->seconds, \
total_nc_queries, total_nc_queries/this->seconds, \
total_time/1000000000, total_vtime/1000000000);

/*
denormalize(@conn_ps);
denormalize(@query_ps);
denormalize(@nc_query_ps);
denormalize(@total_time);
denormalize(@total_vtime);
*/

printf("Running for %d seconds.\n", this->seconds);

printf(LEGEND);
}

Thursday, January 14, 2010

MySQL Query Time Histogram

When doing MySQL performance tuning on a live server it is often hard to tell what impact there will be on all queries as sometimes by increasing one of the MySQL caches you can make some queries to execute faster but others might get actually slower. However, depending on your environment, it might not necessarily be a bad thing. For example in web serving if most queries would execute within 0.1s but some odd queries need 5s to complete it is generally very bad as user would need to wait at least 5s to get a web page. Now if by some tuning you manage to get these long queries down to below 1s with the cost of getting some sub 0.1s queries taking more time but still less than 1s it would generally be a very good thing to do. Of course in other environments the time requirements might be different but the principle is the same.

Now it is actually very easy to get such a distribution of number of queries being executed by a given MySQL instance within a given time slot if you use DTrace.

1s resolution
value ------------- Distribution ------------- count
< 0 | 0
0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 4700573
1 | 6366
2 | 35
3 | 23
4 | 39
5 | 8
6 | 6
7 | 5
8 | 7
9 | 4
>= 10 | 9

Running for 73344 seconds.

The above histogram shows that 4,7mln queries were executed below 1s each, then for another 6366 queries it took between 1-2s for each query to execute, and so on. Now lets do some tuning and see the results again (of course you want to measure for a similar amount of time during similar period of activity - these are just examples):

1s resolution
value ------------- Distribution ------------- count
< 0 | 0
0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 4686051
1 | 2972
2 | 0

Running for 73024 seconds.

That is much better. It is of course very easy to change the resolution of the histogram - but I will leave it for you.

The script requires 2 arguments - PID of a database and how often it should refresh its output, for example in order to get an output every 10s for a database running with PID 12345 run the script as:

./mysql_query_time_distribution.d 12345 10s

The script doesn't distinguish between cached and non-cached queries, it doesn't detect bad (wrong syntax) queries either - however it is relatively easy to extend it to do so (maybe another blog entry one day). It should work fine with all MySQL versions 5.0.x and 5.1.x, possibly with other versions as well.


# cat mysql_query_time_distribution.d
#!/usr/sbin/dtrace -qs


BEGIN
{
start=timestamp;
}

pid$1::*mysql_parse*:entry
{
self->t=timestamp;
}

pid$1::*mysql_parse*:return
/ self->t /
{
@["1s resolution"]=lquantize((timestamp-self->t)/1000000000,0,10);

self->t=0;
}

tick-$2
{
printa(@);
printf("Running for %d seconds.\n", (timestamp-start)/1000000000);
}

Thursday, January 07, 2010

Identifying a Full Table Scan in MySQL with Dtrace

Yesterday I was looking at some performance issues with a mysql database. The database is version 5.1.x so no built-in DTrace SDT probes but still much can be done even without them. What I quickly noticed is that mysql was issuing several hundred thousands syscalls per second and most of them were pread()s and read()s. The databases are using MyISAM engine so mysql does not have a data buffer cache and leaves all the caching to a filesystem. I was interested in how many reads were performed per given query so I wrote a small dtrace script. The script takes as arguments a time after which it will exit and a threshold which represents minimum number of [p]read()s per query to query be printed.

So lets see an example output where we are interested only in queries which causes at least 10000 reads to be issued:
# ./m2.d 60s 10000
### read() count: 64076 ###
SELECT * FROM clip WHERE parent_id=20967 AND type=4 ORDER BY quality ASC

### read() count: 64076 ###
SELECT * FROM clip WHERE parent_id=14319 AND type=4 ORDER BY quality ASC

### read() count: 64076 ###
SELECT * FROM clip WHERE parent_id=20968 AND type=4 ORDER BY quality ASC

There are about 60k entries form parent_id column which suggests that mysql is doing a full table scan when executing above queries. A quick check within mysql revealed that there was no index for parent_id column so mysql was doing full table scans. After the index was created:

# ./m2.d 60s 1
[filtered out all unrelated queries]
### read() count: 6 ###
SELECT * FROM clip WHERE parent_id=22220 AND type=4 ORDER BY quality ASC

### read() count: 8 ###
SELECT * FROM clip WHERE parent_id=8264 AND type=4 ORDER BY quality ASC

### read() count: 4 ###
SELECT * FROM clip WHERE parent_id=21686 AND type=4 ORDER BY quality ASC

### read() count: 4 ###
SELECT * FROM clip WHERE parent_id=21687 AND type=4 ORDER BY quality ASC

So now each query is issuing 5 orders of magnitude less reads()!

Granted, all these reads were satisfied from ZFS ARC cache but still it saves hundreds of thousands unnecessary context switches and memory copying s making the queries *much* more quicker to execute and saving valuable CPU cycles. The real issue I was working on was a little bit more complicated but you get the idea.

The point I'm trying to make here is that although MySQL lacks good tools to analyze its workload you have a very powerful tool called dtrace which allows you to relatively quickly identify what queries are causing an issue and why. And all of that on a running live service without having to reconfigure or restart mysql. I know there is the MySQL Query Analyzer (or whatever it is called) but it requires a mysql proxy to be deployed... In this case it was much quicker and easier to use dtrace.

Below you find the script. Please notice that I had hard-coded the PID of the database and the script could be clean up, etc. - it is the working copy I used. The script can be easily modified to provide lots of additional useful information or it can be limited to only a specific myisam file, etc.

# cat m2.d
#!/usr/sbin/dtrace -qs

#pragma D option strsize=8192


pid13550::*mysql_parse*:entry
{
self->a=1;
self->query=copyinstr(arg1);
self->count=0;

}

pid13550::*mysql_parse*:return
/ self->a && self->count > $2 /
{
printf("### read() count: %d ###\n%s\n\n", self->count, self->query);

self->a=0;
self->query=0;

}

pid13550::*mysql_parse*:return
/ self->a /
{
self->a=0;
self->query=0;
}

syscall::*read*:entry
/ self->a /
{
self->count++;
}

tick-$1
{
exit(0);
}

Tuesday, January 05, 2010

zpool split

PSARC/2009/511 zpool split:
OVERVIEW:

Some practices in data centers are built around the use of a volume
manager's ability to clone data. An administrator will attach a set of
disks to mirror an existing configuration, wait for the resilver to
complete, and then physically detach and remove those disks to a new
location.

Currently in zfs, the only way to achieve this is by using zpool offline
to disable a set of disks, zpool detach to permanently remove them after
they've been offlined, move the disks over to a new host, zpool
force-import of the moved disks, and then zpool detach the disks that were
left behind.

This is cumbersome and prone to error, and even then the new pool
cannot be imported on the same host as the original.

PROPOSED SOLUTION:

Introduce a "zpool split" command. This will allow an administrator to
extract one disk from each mirrored top-level vdev and use them to create
a new pool with an exact copy of the data. The new pool can then be
imported on any machine that supports that pool's version.
The new feature should be available in build 131.
See implementation details.