ESXi 4 with 3ware card lost access to volume due t...

paavo · ‎10-05-2009

Hello.

I have ESXi 4 (Build 171294) server (without any upgrades, should i have some?) with Intel Core i5 CPU, intel motherboard (DP55WG) and 3ware 9650SE-12ML PCIe SATA raid card with two raid arrays.

RAID6 10x 1.5TB (Seagate Barracuda LowPower) ST31500541AS and RAID1 2x 300Gt (WD VelociRaptor) WD3000HLFS.

I'm using latest 3ware's driver for VMWare and latest bios for motherboard and 3ware card.

And problem is that randomly i got those events

Lost access to volume 4ab5fa2b-86ce8030-8016-001b2144c3be (hba00ds0) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly. info 5.10.2009 2:57:43

Successfully restored access to volume 4ac64ee9-59114344-5f4d-001b2144c3be (velo) following connectivity issues. info 5.10.2009 2:57:43

Lost access to volume 4ac64ee9-59114344-5f4d-001b2144c3be (velo) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly. info 5.10.2009 2:57:35

Successfully restored access to volume 4ab5fa2b-86ce8030-8016-001b2144c3be (hba00ds0) following connectivity issues. info 5.10.2009 2:55:58

and system logs says

3w-9xxx: scsi1: WARNING: (0x06:0x002c): Unit #0: Command (0x2a) timed out, resetting card.

and VM's SCSI device timeouted.

First i tested array on CentOS 5.3 without problems when did some benchmark test.

Later i installed Debian 5.0 lenny and configured 11TB LVM.

After got first timeout problems on Debian box , my filesystem got hit and automatically mounted read-only and required fsck.

I changed SCSI controller from LSI SAS to LSI PARALLEL (tried it, because used CentOS with LSI Parallel without problems) and generated lot of disk I/O but didn't get that error again.

Next night when transfered backups (with ftp, a few threads, about 30MB/s) from old machine to this new box, i got again those warning, but luckily filesystem didn't got hit.

I'll paste more information if/when i got more errors...

--

Paavo Neuvonen

str1k3r · ‎10-15-2009

any solution?

paavo · ‎10-27-2009

Still same problem.. so what happend recently and what i tried?

I install ESXi400-VEM-200907001 patch, tried chang storpolicy settings, disabled NCQ and autoverify. Still same problem, timeouts, on host and guest machine. And once whole box crashed. No purple screen, just hanged. Atleast my friend told so, i don't have physical access to server now.

No BBU unit installed, should we have? i think that i'll order it anyway.

I'll try also contact 3ware's support but here some more information anyway.

Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy

u0 RAID-6 OK - - 256K 11175.8 RiW OFF

u1 RAID-1 OK - - - 279.387 RiW OFF

I'll attach a few log files here.

paavo_log_files.txt includes 4 log files,

tw_cli_show_diag.txt, output from tw_cli /c1 show diag
tw_cli_show_alarms.txt output from tw_cli /c1 show alarms
core_vmkernel_logs.txt from esxi host
guest_kern.log from debian lenny 64-bit guest machine.

paavo_log.jpg

ESXi host disk usage, as you can se, not much transfers.
Vsphere log's, notice that vsphere uses GMT+3 and host is configured using UTC
Debian guest configuration. And yes, vmware-tools are installed from debian's own package.

there is one big LVM, about 10.5TB configured. Since VMFS maximum filesize is limited 2TB, there is 5x 1.9TB disks and sixth is about 1TB vmfs file and configured as one big LVM using ext3.

DSTAVERT · ‎10-27-2009

The BBU would certainly be a good choice. Write caching can improve performance but it is only safe with BBU. Whether that would cure your problem is doubtful but ? ?

-- David -- VMware Communities Moderator

salmonj · ‎10-28-2009

Paavo,

Welcome to the club, that is we have a same problem here!

However, looking through your logs, I came across this:

E=1019 T=19:17:42 : Drive removed

task file written out : cd dh ch cl sn sc ft

: 00 00 00 00 00 00 00

E=1019 T=19:17:42 P=7h: Hard reset drive

P=7h: HardResetDriveWait

task file read back : st dh ch cl sn sc er

: 50 00 00 00 01 01 01

E=1019 T=19:17:42 P=7 : Soft reset drive

E=0207 T=19:17:42 P=7 : ResetDriveWait

E=1019 T=19:17:42 P=7 : Inserting Set UDMA command

E=1019 T=19:17:42 P=7 : Check power mode, active

E=1019 T=19:17:42 P=7 : Check drive swap, same drive

E=1019 T=19:17:42 P=7 : Check power cycles, initial=39, current=39, port=7

This basically means that drive on port 7 experienced a momentary on/off/on transition on its interface on 19:17 Oct 26. This can and this will result in a card timeout, as from what I previously seen hot-removing a drive results in controller freezing for some short period of time.

So if I would be you, I'd keep looking for this info to see if this happens on port 7 only (defective drive or cable) or happens on other ports randomly (drive firmware/controller compatibility issue).

PZh

paavo · ‎11-15-2009

Box crashed again.

I installed BBU on last tuesday, and worked fine (no errors, but really slow anyway) if write cache was disabled on RAID6, but after 5 days it crashed. If i found something from logs, i'll post here. I dont have physical access to box, so i have to wait that someone is going to reset box.

Salmonj, do you have write- or readcache enabled? how about ncq? autoverify? how often you got those errors? Any filesystem corruption?

What i last time checked, sometimes port 2 also gave same error.

salmonj · ‎11-16-2009

Paavo,

We've tried different combinations of NCQ/caching and line speed settings, and it did not help. The problem is not related to autoverify, as it runs weekly and completes ok. We do experience this problem under specific load pattern (specific snapshot of filesystem inside one of the VMs), but it is hard to reproduce otherwise, e.g. introducing 350MB/sec load r/w and insane iops rate does not reliably trigger an issue, while filesystem snapshot does quite reliably under very low IO, around 20-40MB/sec. We're going to try to upgrade fimrware on our SAS drives to see if it helps.

In your case if you see drive related issues in logs, you should check power supply, adapter cabling and drives. I strongly suspect we're talking about slightly different issues here, as we have no more than one reset per two weeks in production (and we get no resets at all if we skip backup of problematic VM); in your case it seems that you get alot of resets. Look for drive errors in your controller logs after card reset occurs and if it happens on specific ports check cables and drives on those, if random ports then consider to seek firmware compatibility issues (drive firmware in the first place) or PSU-related problems.

PZh

All

ESXi 4 with 3ware card lost access to volume due to connectivity issues.