Monday, May 15, 2006

T1000 arrived

Another T1000 arrived few days ago for testing - just after UNIX DAYS I'm going to start testing it.

Wednesday, May 10, 2006

USDT enhancements

Adam Leventhal has added new features to USDT probes in DTrace - especially is-enabled probes are a great add-on.

PSH + SMF = less downtime

Richard's Ranch has posted some info on Memory Page Retirement in Solaris which is part of Predictive Self Healing. If you want more details you should read: Assessment of the Effect of Memory Page Retirement on System RAS Against Hardware Faults.

What a coincidence because just one day earlier one of our servers encountered uncorrectable memory error. Fortunately it happened in user space so Solaris 10 just cleared that page, killed affected application and thanks to SMF application was automatically restarted. It all happened not only automatically but also quick enough that our monitoring detected problem AFTER Solaris already took care of it and everything was working properly.

Here we have a report in /var/adm/messages about problem with memory.

May 8 22:47:03 syrius.poczta.srv SUNW,UltraSPARC-II: [ID 321281 kern.warning] WARNING: [AFT1] Uncorrectable Memory Error on CPU0 Data access at TL=0, errID 0x000c303b.ed832017
May 8 22:47:03 syrius.poczta.srv AFSR 0x00000000.00200000 AFAR 0x00000001.f0733b38
May 8 22:47:03 syrius.poczta.srv AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0xffffffff7e7043c8
May 8 22:47:03 syrius.poczta.srv UDBH 0x00a0 UDBH.ESYND 0xa0 UDBL 0x02fc UDBL.ESYND 0xfc
May 8 22:47:03 syrius.poczta.srv UDBL Syndrome 0xfc Memory Module Board 6 J????
May 8 22:47:03 syrius.poczta.srv SUNW,UltraSPARC-II: [ID 714160 kern.info] [AFT2] errID 0x000c303b.ed832017 PA=0x00000001.f0733b38
May 8 22:47:03 syrius.poczta.srv E$tag 0x00000000.18c03e0e E$State: Exclusive E$parity 0x0c
May 8 22:47:03 syrius.poczta.srv SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x00): 0x2d002d01.2d022d03
May 8 22:47:03 syrius.poczta.srv SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x08): 0x2d672d68.2d692d6a
May 8 22:47:03 syrius.poczta.srv SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x10): 0x2d6b2d09.2c912c92
May 8 22:47:03 syrius.poczta.srv SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x18): 0x2c932d0d.2d0e2d0f
May 8 22:47:03 syrius.poczta.srv SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x20): 0x2d102d11.2d122d13
May 8 22:47:03 syrius.poczta.srv SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x28): 0x00000000.09040000
May 8 22:47:03 syrius.poczta.srv SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x30): 0x000006ea.00002090
May 8 22:47:03 syrius.poczta.srv SUNW,UltraSPARC-II: [ID 989652 kern.info] [AFT2] E$Data (0x38): 0x2091001c.2d1d2d1e *Bad* PSYND=0x00ff
May 8 22:47:03 syrius.poczta.srv unix: [ID 321153 kern.notice] NOTICE: Scheduling clearing of error on page 0x00000001.f0732000
May 8 22:47:03 syrius.poczta.srv SUNW,UltraSPARC-II: [ID 863414 kern.info] [AFT3] errID 0x000c303b.ed832017 Above Error is in User Mode
May 8 22:47:03 syrius.poczta.srv and is fatal: will SIGKILL process and notify contract
May 8 22:47:20 syrius.poczta.srv unix: [ID 221039 kern.notice] NOTICE: Previously reported error on page 0x00000001.f0732000 cleared


Then by just using 'svcs' I learned which application was restarted and looked into the application's smf log file which has (XXXXX put instead of application path):

[ May 8 22:47:03 Stopping because process killed due to uncorrectable hardware error. ]
[ May 8 22:47:03 Executing stop method ("XXXXXXXX stop") ]
[ May 8 22:47:04 Method "stop" exited with status 0 ]
bash: line 1: 22242 Killed LD_PRELOAD=libumem.so.1 XXXXXXXX
[ May 8 22:48:44 Executing start method ("XXXXXXXXX start") ]
[ May 8 22:48:46 Method "start" exited with status 0 ]

UNIX DAYS - Gdansk 2006

UNIX DAYS - Gdansk 2006. This is a second edition of a conference about What's New in UNIX (on a production). The conference will be at Gdansk, Poland on May 18-19th. This time we managed to get Wirtualna Polska, Sun, Symantec, EMC, Implix and local university involved - thank you.

What is it about and how it started?
In October 2005 I thought about creating conference in Poland about new technologies in UNIX made by sysadmins for sysadmins (or by geeks for geeks). The idea was to present new technologies without any marketing crap - just technical. So I asked two of my friends to join me and help me make it real. Then I asked my company, Sun, Veritas and local university to sponsor us - and they did. That way UNIX DAYS 2004 was born. The conference took two days and all speakers were people who were actually using technologies they were talking about in a production environments. I must say that conference was very well received. Unfortunately due to lack of time there was no UNIX DAYS in 2005.

See you at UNIX DAYS!

update: well, we had to close public registration just after one day - over two hundred people registered in one day and we do not have free places yet. It really surprised us.

ps. our www pages are only in Polish - sorry about that.