Solved: Re: ESX 4 U1 Hosts Lock with SD 0:0:0:0 still retr...

btallon · ‎07-29-2010

We just upgraded fro ESX3.5 to ESX 4 U1 a few weeks back. We are running on a IBM Blade Center S with HS21-7995 blades and using the SAS RAID Module for Storage.

On two occasion all the ESX hosts stopped responding/VMS were offline and on console of each Host was displayed - SD 0:0:0:0 still retrying after 360s.

After power cycling the Blades and BCS Chassis, everything came back online.

Each hosts has a 20GB volume on the SAS RAID for BOOT that is mapped to the blade.

This install was stable for over a year on 3.5 and started this behavior after the upgrade . After the first time this occured I subsequently reloaded all the hosts with a fresh install of ESX4 U1 after wiping the volume. We experenced this again yesterday morning.

I will likely open a support case with Vmware and I am already talking to IBM as well. Any thoughts or suggestions are appreciated.

DSTAVERT · ‎07-29-2010

Logs will be the place to start. Unless it is a known issue to IBM and or VMware you will need to collect logs from the time of the problem. Do you have all the firmware up to date?

-- David -- VMware Communities Moderator

View solution in original post

golddiggie · ‎07-29-2010

I would try update 2 first, as well as all the additional firmware/updates after it (prior to 4.1)... You could also go to ESX 4.1 and see if the issue remains. But, that would also mean you'd need to update the vCenter server (if you're running one) and a bit more work.

The issue could be caused by something wrong with the storage array you're using... Have you tried placing local drives into the blades and installing ESX 4 (update 2) there? That would eliminate any issues from the storage array bringing the host(s) down.

Personally, I'm not a fan of the older blade hardware (especially IBM blades)... The new/current models might be better, but I've yet to work with those. You could (most likely) replace the blade center chassis with current 1U or 2U servers, taking up the same total rack units, and end up spending less money, have more power, and use less electricity (and need less cooling too).

Network Administrator

VMware VCP4

Consider awarding points for "helpful" and/or "correct" answers.

DSTAVERT · ‎07-29-2010

Logs will be the place to start. Unless it is a known issue to IBM and or VMware you will need to collect logs from the time of the problem. Do you have all the firmware up to date?

-- David -- VMware Communities Moderator

f10 · ‎07-29-2010

Hi,

I had a simillar issue and figured that one of the Virtual Machine was accessing the Physical CD ROM, check the configuration for the VM's and ensure that none of them are connected to the physical cdrom. Since this is a Blade Center ensure that the shared CD ROM is working fine.

These two steps helped me resolve my issue, hope this help

If you found this or other information useful, please consider awarding points for "Correct" or "Helpful".

f10

VCP3,VCP4,HP UX CSA

Regards, Arun Pandey VCP 3,4,5 | VCAP-DCA | NCDA | HPUX-CSA | http://highoncloud.blogspot.in/ If you found this or other information useful, please consider awarding points for "Correct" or "Helpful".

btallon · ‎07-29-2010

Thanks for the quick replys folks.

Unfortunately, this is my customers system and they just purchased last year (was not my recommendation, I would have went with servers and a SAN).

These HS21 blades do not have local drive bays on them so that is not an option ( I think they can get solid state drives for them but I doubt the customer would be able to get them).

I did generate the Diagnostic log bundle prior to reloading the Hosts the first time this happened, I am doing so for this incident as well.

Firmware could likely use an update, we did a full FW update last year at time of install, I am sure that will be the first angle of attack for IBM support as well.

Again, thanks for the info.

golddiggie · ‎07-29-2010

The HS21 blades I've seen have internal drive connections (you have to fully remove the blade and open it up to get to them). Another knock against blades is if they don't allow you to (ever) install hard drives into them.

Did they get the configuration brand new?

If there are no VM's using the host(s) optical drive, and all the other VM settings are kosher, and you have either Update Manager or the Host Update Utility installed, check for host updates and at least get it up to the final version of ESX 4.0... If updates haven't been applied to the hosts in a year, then I would really start there. Otherwise, I think it really is a hardware issue that IBM will need to figure out/fix...

Network Administrator

VMware VCP4

Consider awarding points for "helpful" and/or "correct" answers.

btallon · ‎07-29-2010

You are correct the blades have an internal connection for solid state drives.

This was purchased new from IBM last year, its been running well up untill now. I am fairly convinced that it is a HW or FW issue with the SAS RAID Module or Disk Storage Modules due to the fact that all the Host blades lock up at the same time. There is another HS12 blade in the system that has its OS on local drives and it stays online when this occurs.

I did verify that all the VMS are set to Client Device for CD-ROM , two of the hosts had VMS connected to the host CD ROM.

golddiggie · ‎07-29-2010

The drive connection(s) should be standard SAS/SATA type connections, allowing you to install either a drive with platters, or SSD. Since SSD drives (2.5") use the exact same connections, I cannot see the hardware you're working on locking you to just SSD. They might have preferred you use SSD, or sold that blade with SSD drives as an option, but any 2.5" SAS drive should also fit. Unless it's set up to only accept 1.8" drives, where you'll be boned.

I really would get the hosts up to the latest/last build level for ESX 4.0 to see if that helps out. You can do one host and see how it behaves before applying to the second host. If you have Update Manager available, you can even keep the old configuration around for a time, before committing to the update.

Network Administrator

VMware VCP4

Consider awarding points for "helpful" and/or "correct" answers.

DSTAVERT · ‎07-29-2010

I will guess that there have been a few updates to the firmware for both the blades and bladecenter over the past year.

A great tool for checking for things like the existence of snapshots or attached CDs etc

RVtools http://www.robware.net/

-- David -- VMware Communities Moderator

btallon · ‎07-29-2010

I am sure that is the case, FW and upgrade to U2 are my first steps. New local drives are not an option unfortunately.

From: DSTAVERT

Sent: Thursday, July 29, 2010 11:36 AM

To: Bob Tallon

Subject: New message: "ESX 4 U1 Hosts Lock with SD 0:0:0:0 still retrying after 360s"

,

A new message was posted in the thread "ESX 4 U1 Hosts Lock with SD 0:0:0:0 still retrying after 360s":

http://communities.vmware.com/message/1580703#1580703

Author : DSTAVERT

Profile : http://communities.vmware.com/people/DSTAVERT

Message:

DSTAVERT · ‎07-29-2010

I wouldn't worry at all about local drives.

-- David -- VMware Communities Moderator

btallon · ‎07-29-2010

One additional thing I have noticed in the Event logs of the VM that is running vcenter, at the time the issue began :

SYSTEM LOG

EVENT ID 11 – SOURCE SYMMPI

The device \Device\SCSI\SYMMPI is not ready for access yet

Followed by

EVENT ID 11 – SOURCE DISK

The driver detected a controller error on \Device\HardDisk0

These repeat about once a minute for several hours up to the point that the Hosts became unresponsive. None of the other VMS have this error in the logs.

I might build a new Vcenter VM as well while I am at it.

From: DSTAVERT

Sent: Thursday, July 29, 2010 11:58 AM

To: Bob Tallon

Subject: New message: "ESX 4 U1 Hosts Lock with SD 0:0:0:0 still retrying after 360s"

,

A new message was posted in the thread "ESX 4 U1 Hosts Lock with SD 0:0:0:0 still retrying after 360s":

http://communities.vmware.com/message/1580714#1580714

Author : DSTAVERT

Profile : http://communities.vmware.com/people/DSTAVERT

Message:

golddiggie · ‎07-29-2010

How is the storage array connected to the blades/ESX hosts? If it's using a fiber connection, I'm wondering if there's a firmware update either for the fiber switch, storage array, or any of the interconnects... It could also be an issue addressed either in update 2, or one of the post update 2 ESX firmware updates... Would it be possible for you to install ESXi 4.1 onto an USB flash drive and boot a host from that? Just to see if updates there resolve the issue. Otherwise, you might want to test out ESXi 4.1 (if you don't require ESX) to see if that helps... If you want to continue to use vCenter, you'll need to update that to 4.1 as well (or build the new one with 4.1)... You shouldn't have any issue managing both ESX 4 u1 and ESX 4.1 hosts with vCenter 4.1...

Network Administrator

VMware VCP4

Consider awarding points for "helpful" and/or "correct" answers.

DSTAVERT · ‎07-29-2010

http://kb.vmware.com/kb/1005204

-- David -- VMware Communities Moderator

btallon · ‎07-29-2010

That’s weird, they only have two data stores, each on a single LUN.

I am in the process of building a new 4.0 U2 Vcenter and upgrading the Hosts to 4.0 U2 at the moment.

From: DSTAVERT

Sent: Thursday, July 29, 2010 1:07 PM

To: Bob Tallon

Subject: New message: "ESX 4 U1 Hosts Lock with SD 0:0:0:0 still retrying after 360s"

,

A new message was posted in the thread "ESX 4 U1 Hosts Lock with SD 0:0:0:0 still retrying after 360s":

http://communities.vmware.com/message/1580778#1580778

Author : DSTAVERT

Profile : http://communities.vmware.com/people/DSTAVERT

Message:

btallon · ‎07-29-2010

Its a SAS RAID module that plugs into the back of the BCS. Each of the servers has an expansion card that when installed allows them to see the configured volumes as local logical drives( This is done by mapping them to a blade in the Module config).

Unfortunately the USB access for these blades is provided by the switchable media tray component and not directly attached.

From: golddiggie

Sent: Thursday, July 29, 2010 1:07 PM

To: Bob Tallon

Subject: New message: "ESX 4 U1 Hosts Lock with SD 0:0:0:0 still retrying after 360s"

,

A new message was posted in the thread "ESX 4 U1 Hosts Lock with SD 0:0:0:0 still retrying after 360s":

http://communities.vmware.com/message/1580777#1580777

Author : golddiggie

Profile : http://communities.vmware.com/people/golddiggie

Message:

DSTAVERT · ‎07-29-2010

It was a reference to the event IDs and not necessarily your issue. Storage is still something to suspect. I would still look through the logs.

Blindly reinstalling things isn't going to identify the problem and may in fact mask it. If you haven't already I would file an SR with VMware. and perhaps with your storage vendor.

-- David -- VMware Communities Moderator

btallon · ‎07-29-2010

OK, I appreciate all the info and advice!

From: DSTAVERT

Sent: Thursday, July 29, 2010 1:28 PM

To: Bob Tallon

Subject: New message: "ESX 4 U1 Hosts Lock with SD 0:0:0:0 still retrying after 360s"

,

A new message was posted in the thread "ESX 4 U1 Hosts Lock with SD 0:0:0:0 still retrying after 360s":

http://communities.vmware.com/message/1580786#1580786

Author : DSTAVERT

Profile : http://communities.vmware.com/people/DSTAVERT

Message:

sandtb · ‎08-01-2010

We are seeing this issue with a new Cisco UCS blade server deployment. We are running a new deployment of 4.1 have seen this issue happen 3 times in the last week on 3 out of four servers. It appears that one server locks the LUN and then the other hosts cannot access it. The ESX host locks with the error and is never responsive for login once this happens. Our only issue is to power cycle the server.

Storage environment has been the same for over a year without issue - we have used this storage platform with 3.5 and 4.0 with both FC and iSCSI connectivity.

SandyB · ‎08-04-2010

we have just had this happen to a production ESX4 U1 server running on a Dell R710 server attached to an EMC CX3-80 SAN, all VMs and ESX show as disconnected in vCenter (4 U1 also) on the ESX console i get the "sd 0:0:0:0: still retrying 0 after 360s"

The production VMs on the ESX are still running and accessable via RDP, however i can Vmotion them off, dont want to force a reboot of the ESX host as this could cause data loss.

Any ideas?

All

ESX 4 U1 Hosts Lock with SD 0:0:0:0 still retrying after 360s