VMware Cloud Community
tbsky
Contributor
Contributor

Lost connectivity to storage device and all vm become unknown

hi:

we use lsi 20320ie pci-e ultra 320 scsi card to connect external disk array. we install esxi 4.0 and vm images to the disk array. we seperate one channel of the disk array to two luns. first lun install esxi 4.0, and second lun to store the vm image.

all are fine until last month, one esxi 4.0 complain about "Lost connectivity to storage device".

it tried to reconnect again and again every second, but can not make it success. and all the booted vm became "unknown". even after reboot, they are still "unknown". only the vms which is not booted at the "lost moment" survive. other vms in the lun seems locked and can not access. I need to delete the lun and format it again.

we have three real machines with disk array and it happend 4 times these two monthes. it seems hap

pened after patch "ESXi400-201002001" at March. we also apply April "ESXi400-201003001", but it still h

appened.

we notice when the disk array have some delay(like some disk read error), the scsi bus will reset and esxi will show "lost connection" warning. it may recover soon after several seconds, but if the first recover failed, the rest recover won't success any more then all booted vm died and became "unknown".

when esxi compalined about failed recover, the root lun on the same scsi channel is fine. and the other channel which runs linux with the same type scsi card is fine.(our scsi disk array are dual channel, and we divide it for two real machines). so it seems there are some problem of the "mptspi" scsi driver in the esxi. but I don't know the real reason.

thanks a lot for hint and help!!

0 Kudos
16 Replies
tbsky
Contributor
Contributor

hi:

I just restore all the image yesterday, but all of them gone this morning. the situation is the same, one disk in the disk array which has "read error" and make the disk array to have some delay or bus reset, then vmware lost connection and can not recover anymore.

I search internet, and found other people which use lsi 20320ie + esxi 4.0 had the same problem like me. so the current driver for this scsi card has some timing problem I think.

is there any way to tune the "mptspi" driver, so it can survive with scsi delay or reset?

and is there way to "unlock" the vmfs which lost connection after reboot? I need to reformat the VMFS and restore all the images everytime. that's really pain.

thanks a lot for help!!

0 Kudos
tbsky
Contributor
Contributor

hi:

I check the current driver I use: "cat /proc/mpt/version" => mptlinux-4.00.37.00.27vmw

I didn't see driver update information in ESXi update. but in ESX U1 update. I saw info below:

VMware ESX 4.0, Patch ESX400-200911216-UG: Updates mptspi driver to

version 4.00.37.00.27vmw-2vmw

so I wonder the situation of esxi 4.0. I don't know if my problem is because my driver is too new or too old.

is "4.00.37.00.27vmw" the same as "4.00.37.00.27vmw-2vmw" ?

thanks a lot for help!!

0 Kudos
tbsky
Contributor
Contributor

hi:

I rechecked the log of our disk array. the scsi disk array has two host channel(one connect to linux, one connect to esxi 4). and I found that when "disk read error" occur, only the vmware channel issue a "scsi reset". linux channel is fine and didn't have "scsi reset". so the "scsi reset" is issue by vmware. and for some reason, vmware didn't think it can read the scsi disk anymore. although in fact, I can read files in the "lost disk" which are not locked by vmware at that moment. I can read files in the "lost storage", that's very strange. there seems some bug in the vmware side.

at the beginning, I divide the storage to scsi id 0 which include lun0 (for root) and lun1 (for vmware image). since lun0 never lost, now I try to use scsi id 0 and id 1 (both lun0) for root and image, and hope vmware can survive for every lun0 next time when it issues scsi reset.

I still want to hear about how to "unlock" the failed vmfs. the vmware crashed 5 times at different machines these two monthes. fortunately these images are below 100GB. so I can use linux (with vmfs-fuse) to copy them out, reformat vmfs, and copy them back. but I have other vmware which handle several TBs of data.if they failed, I have no place to copy these data out and in. I must learn how to fix the vmfs without reformat. i hope someone can give me hint about "fix" or "unlock" vmfs.

thanks again for your kindly help!!

0 Kudos
tbsky
Contributor
Contributor

hi:

the vm died again today. change the scsi id didn't help. one disk of the disk array is unstable, the "read error" happened everyday. so the disk array is a good experiment device to show esxi problem.

I will try to revert the exsi version perior to 4.0U1, and hope the "mptspi" driver there is different and didn't have the problem.

in the mean time, I am copying the vm image from the dead vmfs. I search google and found nothing about how to fix dead vmfs. I think that's the destiny of closed proprietary system. I switched from xen to esxi after citrix bought xen.now the next step seems go to kvm someday...

0 Kudos
tbsky
Contributor
Contributor

hi:

the original mptspi driver version is "4.00.37.00.23vmw". and after 4.0U1, the driver become "4.00.37.00.27vmw". I am now using the original "4.00.37.00.23vmw". the disk "read error" happened again yesterday, and esxi issue a "scsi reset" as usual. but there are no more "lost connectivity to volume..." event. and of course no need to "resotre connectivity to volume...".

I will keep monitor it. but it looks like the new mptspi driver is broken. I hope vmware can fix the driver. if not, I hope people use mptspi driver to access datasotre can find this thread well one day when they encounter disk error. there seems very few people use the driver, since nobody reply to this thread. or they are all lucky guys Smiley Happy

0 Kudos
tbsky
Contributor
Contributor

hi:

thanks to my unstable harddisk, I had several read errors during the month. one of them cause scsi bus reset. but vmware is happy about these. and there is no lost connectivity event, so no more worry about recovery.

I think the old driver is quite stable. so my esxi is stable again like last year.

0 Kudos
tbsky
Contributor
Contributor

hi:

unfortunately, the system crash again today. it survived 6 times of read errors these days. but it died at 7th time.

the old driver seems better, not not good enough.

so I am totally out of ideas now. the unstable scsi driver becomes a nightmare. and VMFS is the most terrible filesystem I ever met. I need to spent another 6 hours to copy files in and out again... sigh..

0 Kudos
tbsky
Contributor
Contributor

warning! esxi 4.1 is the same.

we have another crash this week. and one big vm is so dead that we can not use linux to copy it out.

I think this bug may never be fixed...

now we are trying to transfer important service to new hardware.

0 Kudos
tbsky
Contributor
Contributor

hi:

esxi 4.1 crashed again today. will migrate to kvm soon. hope it is the last I need to copy data from vmware file system Smiley Happy

0 Kudos
tbsky
Contributor
Contributor

hi:

another machine crashed again with ESXi 4.1 and latest hotfix. it's sad that RHEV 3.0 is delayed. so I need to copy all my vms again..

hope this time is really the last time..

0 Kudos
tbsky
Contributor
Contributor

hi:

     the problem still exists in lasted exsi build. just a record for other poor guys.

0 Kudos
viraj201110141
Contributor
Contributor

upgrade to esxi4.1 or 5.0 the latest one

0 Kudos
tbsky
Contributor
Contributor

I upgrade my last testing machine to esxi 5 today. just to check if esxi can survive in our environment anymore.

I think I need to wait two month or more to see the result. and I find it's good that esxi 5 can still let you use VMFS-3.

that's very important. because no other tools can access VMFS-5 now. so if you get a corruption in VMFS-5, it's the end of story.

0 Kudos
ShadowLight
Contributor
Contributor

Hi,

I would like to know if by any chance you were having this issue on a HP server.  I have a similar issue why my storage get unaccessible and it shows lost connection and restores.  My issue look very similar to you.

Did you have any chance so far with ESX 5?

Regards,

0 Kudos
tbsky
Contributor
Contributor

hi:

   I must wait "read error" or some other delay issues for my hard disk. it need several monthes to happen. no useful information for esxi 5 so far.

0 Kudos
ShadowLight
Contributor
Contributor

In case it cans be useful for someone,

I've been investigating my issue and HP has an official fix for it.  Smart Array 410i  They have fix this issue with the firmware version  2.50 +

Maybe you could investigate the chipset in the HP smart array 410i and yours and see if you could find a compatible firmware to fix it as well. 

Keep us update if you have any useful information with ESXi 5 Smiley Happy

Regards,

0 Kudos