VMware Cloud Community
spawnxx
Contributor
Contributor
Jump to solution

Error encountered while restarting virtual machine after taking snapshot

Configuration:

2 X ESX 3.5 U3 in cluster ( entreprise licensing ) + Virtual Center 2.5 update 3

18 guests running Win2k3 servers and 3 Oracle Servers on RHEL

2 X SAN DELL MD3000i with dual controllers ( Active / Active ).

Issue: Yesterday night , during the VCB Backup , we lost the SAN #1 completely ( both controllers at the same time ).

We have few guests with Microsoft Raid 1 ( 1 disk on San 1 and 1 disk on San 2 ). the guests in mirror mode survived 40 minutes after the SAN # 1 crashed.

after checking logs files i found that: ESX shutdowned the guests because of a problem with the snapshot.

I feel really bad this morning..because i told my boss we are safe if we lose 1 SAN because we are in mirror on 2 SAN.

Why ESX shutdown Guests after having problems with the snapshots . Also we received a lot of Alarms from Virtual Center with False alert s( vm not responding ).

ESX filled 800 megs of log files in few minutes with the same message ( failed path, trying another one blabla ).

What the purpose of putting my guests in mirror mode if ESX shutdown my guests after losing path during backup ???

0 Kudos
1 Solution

Accepted Solutions
Texiwill
Leadership
Leadership
Jump to solution

Hello,

Conclusion?

If SAN #1 crash during VCB Backup you are dead ?

If you are using software raid within the VM then is SAN1 dies, the VM will crash because the backing store file (this case a delta) is no longer available and there is no way to read or even write data.

Software Raid does not apply to things outside the VM. A solution would be to have a device that can do Raid mirroring across SANs.


Best regards,

Edward L. Haletky

VMware Communities User Moderator

====

Author of the book 'VMWare ESX Server in the Enterprise: Planning and Securing Virtualization Servers', Copyright 2008 Pearson Education.

SearchVMware Blog: http://itknowledgeexchange.techtarget.com/virtualization-pro/

Blue Gears Blogs - http://www.itworld.com/ and http://www.networkworld.com/community/haletky

As well as the Virtualization Wiki at http://www.astroarch.com/wiki/index.php/Virtualization

--
Edward L. Haletky
vExpert XIV: 2009-2023,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill

View solution in original post

0 Kudos
10 Replies
spawnxx
Contributor
Contributor
Jump to solution

Anyone using RAID 1 Mirror at the Guest Level for HA across 2 SAN?

0 Kudos
Texiwill
Leadership
Leadership
Jump to solution

Hello,

The problem is not really with your mirror, it is with the fact that the snapshot could possibly be corrupt. You should either patch the VM by hand to remove the snapshot, or try using the various tools to revert to a previous snapshot level. YOu do not want to use the VIC 'delete' as that is really a 'commit' of the snapshot.

Try reverting from within the VIC first, if that does not work you will have to patch the .vm* files by hand to remove the bad snapshot. This does mean you will loose any data that was written to the snapshot. This is a non-trivial task however, yet it is possible.


Best regards,

Edward L. Haletky

VMware Communities User Moderator

====

Author of the book 'VMWare ESX Server in the Enterprise: Planning and Securing Virtualization Servers', Copyright 2008 Pearson Education.

SearchVMware Blog: http://itknowledgeexchange.techtarget.com/virtualization-pro/

Blue Gears Blogs - http://www.itworld.com/ and http://www.networkworld.com/community/haletky

As well as the Virtualization Wiki at http://www.astroarch.com/wiki/index.php/Virtualization

--
Edward L. Haletky
vExpert XIV: 2009-2023,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill
0 Kudos
spawnxx
Contributor
Contributor
Jump to solution

Thanks Ed, after the restart of my 2 ESX i was able to start my guests. My concern is why ESX shutdown guests if you lose SAN during VCB Backup.

We have 2 SAN with SW mirroring at the guest level. Guests survived after we lose the first SAN, but few minutes after ESX hard shutdowned all of them .

Check the log in hostd. My problem is easy to reproduce, create a win2k3 guest with 2 disks ( 1 disk san 1, 1disk san 2 ) , activate Ms RAID 1, start VBC Backup. Unplug SAN #1 or SAN #2

and ESX will shutdown your guests after few minutes.

Event 884 : Message on BCQVGORA1P on bcq-vm-02.pac.bcq.noa.alcoa.com in ha-datacenter: Error encountered while saving snapshot file "(null)".

A needed file was not found.Cannot open the disk 'bcqvgora1p-000001.vmdk' or one of the snapshot disks it depends on.

Reason: Input/output error.

Received a duplicate transition from foundry: 1

Received a duplicate transition from foundry: 1

Disconnect check in progress: /vmfs/volumes/48d8f87e-2af619b7-070b-00145e5bda5d/bcqvgora1p/bcqvgora1p.vmx

Question info: Cannot open the disk 'bcqvgora1p-000001.vmdk' or one of the snapshot disks it depends on.

Reason: Input/output error., Id: 1 : Type : 2, Default: 0, Number of options: 1

Received a duplicate transition from foundry: 1

Disconnect check in progress: /vmfs/volumes/48d8f87e-2af619b7-070b-00145e5bda5d/bcqvgora1p/bcqvgora1p.vmx

Failed to find activation record, event user unknown.

Event 885 : Message on BCQVGORA1P on bcq-vm-02.pac.bcq.noa.alcoa.com in ha-datacenter: Cannot open the disk 'bcqvgora1p-000001.vmdk' or one of the snapshot disks it depends on.

Reason: Input/output error.

Received a duplicate transition from foundry: 1

Disconnect check in progress: /vmfs/volumes/48d8f87e-2af619b7-070b-00145e5bda5d/bcqvgora1p/bcqvgora1p.vmx

Question info: Failed to reopen disk '(null)'

, Id: 2 : Type : 2, Default: 0, Number of options: 1

Received a duplicate transition from foundry: 1

Disconnect check in progress: /vmfs/volumes/48d8f87e-2af619b7-070b-00145e5bda5d/bcqvgora1p/bcqvgora1p.vmx

Failed to find activation record, event user unknown.

Event 886 : Message on BCQVGORA1P on bcq-vm-02.pac.bcq.noa.alcoa.com in ha-datacenter: Failed to reopen disk '(null)'

Received a duplicate transition from foundry: 1

Disconnect check in progress: /vmfs/volumes/48d8f87e-2af619b7-070b-00145e5bda5d/bcqvgora1p/bcqvgora1p.vmx

Question info: Error encountered while restarting virtual machine after taking snapshot. The virtual machine will be powered off.

0 Kudos
Texiwill
Leadership
Leadership
Jump to solution

Hello,

Sounds like your mirror did not recover fast enough for ESX. So you have the following:

LUN (mirror)
|- SAN1
|- SAN2

If SAN1 Fails then the mirror should recover from SAN2. Or do you have the following:

VMDK
|- LUN1 VMDK
|- LUN2 VMDK

If it is the later, the VMDK may be mirror'd within the VM but the delta file is not within a mirror so when the SAN died it disappeared as well.


Best regards,

Edward L. Haletky

VMware Communities User Moderator

====

Author of the book 'VMWare ESX Server in the Enterprise: Planning and Securing Virtualization Servers', Copyright 2008 Pearson Education.

SearchVMware Blog: http://itknowledgeexchange.techtarget.com/virtualization-pro/

Blue Gears Blogs - http://www.itworld.com/ and http://www.networkworld.com/community/haletky

As well as the Virtualization Wiki at http://www.astroarch.com/wiki/index.php/Virtualization

--
Edward L. Haletky
vExpert XIV: 2009-2023,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill
0 Kudos
spawnxx
Contributor
Contributor
Jump to solution

We have the following configuration

SAN1

LUN1-VMWARE ( preferred controller 0 )

LUN2-VMWARE ( preferred controller 1 )

SAN2

LUN1B-VMWARE ( preferred controller 0 )

LUN2B-VMWARE ( preferred controller 1 )

SAN2 is only for mirrrored guests. So when i create a new guest, i configure 2 disks ( ex: 1 disk on lun1, 1 disk on lun1b).

im also adding this value in the configuration parameters ( scsi0.returnBusyOnNoConnectStatus = FALSE ). Each ESX have been configurated with these command.

esxcfg-advcfg -s 0 /Disk/UseDeviceReset

esxcfg-advcfg -s 1 /Disk/UseLunReset

0 Kudos
Texiwill
Leadership
Leadership
Jump to solution

Hello,

This is what you have then....

Software RAID within VM
|-VMDK on SAN1
|-VMDK on SAN2

If so the software raid does not apply to the delta (snapshot) file, so when the SAN failed it lost the delta file. Remember the delta file is external to the VM therefore software raid within a VM does not apply to it.


Best regards,

Edward L. Haletky

VMware Communities User Moderator

====

Author of the book 'VMWare ESX Server in the Enterprise: Planning and Securing Virtualization Servers', Copyright 2008 Pearson Education.

SearchVMware Blog: http://itknowledgeexchange.techtarget.com/virtualization-pro/

Blue Gears Blogs - http://www.itworld.com/ and http://www.networkworld.com/community/haletky

As well as the Virtualization Wiki at http://www.astroarch.com/wiki/index.php/Virtualization

--
Edward L. Haletky
vExpert XIV: 2009-2023,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill
0 Kudos
spawnxx
Contributor
Contributor
Jump to solution

Conclusion?

If SAN #1 crash during VCB Backup you are dead ?

ESX will turn off guests because he cannot " create or delete the VCB snapshot" .

0 Kudos
Texiwill
Leadership
Leadership
Jump to solution

Hello,

Conclusion?

If SAN #1 crash during VCB Backup you are dead ?

If you are using software raid within the VM then is SAN1 dies, the VM will crash because the backing store file (this case a delta) is no longer available and there is no way to read or even write data.

Software Raid does not apply to things outside the VM. A solution would be to have a device that can do Raid mirroring across SANs.


Best regards,

Edward L. Haletky

VMware Communities User Moderator

====

Author of the book 'VMWare ESX Server in the Enterprise: Planning and Securing Virtualization Servers', Copyright 2008 Pearson Education.

SearchVMware Blog: http://itknowledgeexchange.techtarget.com/virtualization-pro/

Blue Gears Blogs - http://www.itworld.com/ and http://www.networkworld.com/community/haletky

As well as the Virtualization Wiki at http://www.astroarch.com/wiki/index.php/Virtualization

--
Edward L. Haletky
vExpert XIV: 2009-2023,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill
0 Kudos
spawnxx
Contributor
Contributor
Jump to solution

Thanks Ed, another option is VMWARE supporting SW mirroring across LUN.

0 Kudos
Texiwill
Leadership
Leadership
Jump to solution

Hello,

Not sure that is going to happen as Software RAID is pretty slow. I use it myself but rather see it in hardware.


Best regards,

Edward L. Haletky

VMware Communities User Moderator

====

Author of the book 'VMWare ESX Server in the Enterprise: Planning and Securing Virtualization Servers', Copyright 2008 Pearson Education.

SearchVMware Blog: http://itknowledgeexchange.techtarget.com/virtualization-pro/

Blue Gears Blogs - http://www.itworld.com/ and http://www.networkworld.com/community/haletky

As well as the Virtualization Wiki at http://www.astroarch.com/wiki/index.php/Virtualization

--
Edward L. Haletky
vExpert XIV: 2009-2023,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill
0 Kudos