VMware Cloud Community
omidkosari
Contributor
Contributor

one HDD failed on VSAN makes vmdk lost ! Help

Hello,

I have 5 esxi 5.5 servers with 20TB of VSAN . Suddenly one of VMs got soft lockup error . i tried to reboot it but it freezed . after some investigation i found that one 1TB hdd has failed . I did not worried because of vsan .

But now VM could not start . in the vsan datastore we can see that the 1TB vmdk of the vm is lost !!!!

Please take a look at attachments .

0 Kudos
5 Replies
ramakrishnak
VMware Employee
VMware Employee

The problem seems to be double fault has occurred. and your policy can only tolerate 1

meaning, both the RAIDO copies(atleast one in each) of the RAID-1 are in bad state. ( your vm storage policy image confirms this)

you should check why the other RAID0 component was down. this is from a different spinning disk (disk UUID can be found in the vm storage policy image)

Thanks,

0 Kudos
joergriether
Hot Shot
Hot Shot

Yes but look at the first RAID0 object, there are 5(!) components (hard drives) missing there. At least that is what VSAN can "see". So i suggest to open a support ticket with vmware. I am pretty sure the first RAID0 object is some bug which could eventually be cleared out. I don´t believe 5 hard drives failed at once...

0 Kudos
omidkosari
Contributor
Contributor

Yes you find what i tried to say .

Also other VMs which uses those hard drives , does not have any problem .

Unfortunately we don't have support contract to send ticket . Any other way ?

0 Kudos
joergriether
Hot Shot
Hot Shot

You *need* to have a service contract - at least when using critical vm´s on it.

Now - IF i had this in a testlab (and ONLY in a testlab). And all in the testlab would be totally for fun and play. Only THEN I would EVENTUALLY shutdown all the running VMs residing on the VSAN and after that I would shutdown all hosts participating in the VSAN cluster. After that i´d boot em up again and check if the error persists. It it would persist i would EVENTUALLY think about disabling VSAN and re-enabling it again.

But this is only trial and error. This is NO advice to you. It is all hypothetical. So don´t do this!

I´d *really* suggest to implement a fresh contract / renew yours and then open an official case with vmware.

0 Kudos
ramakrishnak
VMware Employee
VMware Employee

yes, correct.

sorry you misunderstood what i mentioned


what am saying is in this case, there is two failures, resulting in both copies unusable, we don't handle double faults and the VM cannot be recovered in such cases


now i agree with you on the other front that we need to know why and how this condition occurred to begin with and if there is a defect here we need to address.

Thanks,

0 Kudos