VMware Cloud Community
dfosbenner
Enthusiast
Enthusiast
Jump to solution

VMs very slow or won't start after applying March 2017 ESXi patch to 6.0

I applied the patch from CVE-2017-4903 / ESXi600-201703401-SG to some ESXi 6 hosts today.  I got 4 hosts done with no issues, but then I hit trouble.  The patch installed (using VUM) and reboot was clean.  However, after 20-30 minutes, the VMs were still trying to start and appeared stuck.  Forcing a host reboot didn't help.  Finally another forced reboot, and SHIFT+R to revert back to previous build which is Update3. 

There were no warnings or errors in vCenter.  Looking at Realtime stats, CPU, memory and disk were high but not off the charts.  I've never had an issue like this before, and wouldn't know where to start.

Anyone else have problems like this?

Reply
0 Kudos
1 Solution

Accepted Solutions
dfosbenner
Enthusiast
Enthusiast
Jump to solution

Thank you all for your help with this issue, it is resolved.  Dell had an updated drive for the PERC H730p controller, lsimr3, which was the culprit.  I'm all patched and online, whew.

View solution in original post

Reply
0 Kudos
12 Replies
galbitz_cv
Contributor
Contributor
Jump to solution

I had similar issues with esxi 6.5 updates in march. Running on dell r730 servers with intel 10gb network cards. Mind sharing what hardware?

Reply
0 Kudos
dfosbenner
Enthusiast
Enthusiast
Jump to solution

Dell PowerEdge T430

Dual Intel 2.6GHx 10 core E5-2660 procs

128GB

Broadcom DP 5720 (onboard)

Broadcom QP 5719 (PCI)

600GB 10K SAS drives

PERC H730P

Only 3 Windows VMs, very underutilized at present.  It's only 2 weeks old, all Dell firmware updates done 2 weeks ago.

Reply
0 Kudos
Eric_Allione
Enthusiast
Enthusiast
Jump to solution

This happened to me once on Dell PowerEdge hypervisors but they were a very old and slow model (R510). Out of many ESXi updates it was just this one time that it got hung up, and it was while loading after an update because it took too long (I'm guessing) and something crashed. I simply tried again and it was fine then and thereafter. Did you try again?

With software in general it's not uncommon for something to go wrong with installs and updates, in which case a second or third retry might be needed. However, my issue was on ESXi 5.5 which was much more prone to breaking during installs and upgrades (with vCenter at least).

Reply
0 Kudos
dfosbenner
Enthusiast
Enthusiast
Jump to solution

I didn't retry yet.  I have a small maintenance window and needed to get things back online.  I deployed the patch successfully on 5 hosts, and have 3 to go.  2 of them are the same hardware on which I had the issue, but the T430 has been around for a while and there's a lot of them out there, so if it was a hardware issue I think there'd be others talking about it. 

I guess I'll give it another try on the weekend.

Thank you for the replies.

-Dave

Reply
0 Kudos
learningvms
Contributor
Contributor
Jump to solution

I have a the same problem on my whitebox ZBOX CI323 (N3150, unsupported HCL).  However, I've had the issue with update 3 itself.   The bare metal box itself became extremely slow and nearly unresponsive, with an SSH session it I can see it work the same way, not ideal at all.

I ended up rolling back the patch back to update 2 (where I was originally on).

Reply
0 Kudos
Supi_du
Contributor
Contributor
Jump to solution

@dfosbenner

Dell Released a new Bios 2.4.2 and FW 25.5.2.0001 for your Raid Controller these Days.

Also there where a lot FW Updates for Harddrives.

Have you checked these updates?

We have 3 r430 with Intel I350 additional Nic.  No Problems with latest ESXi-6.0.0-20170304001-standard (Build 5224934).

(We have a SAN, no Raid controller, just SD Card for ESXi.)

Reply
0 Kudos
dfosbenner
Enthusiast
Enthusiast
Jump to solution

@supi_du

Thank you.  I have the BIOS and drives up to date, but the PERC looks new.  I'll have to give that a try.

Reply
0 Kudos
dfosbenner
Enthusiast
Enthusiast
Jump to solution

Well it may be time for me to get some support.  I ran the Dell Lifecycle Controller and all firmware is current.  I rebooted without reapplying the patch, and what I saw was just like what I saw last night.  It was in maint mode, so I took it out and the VMs were stopped.  If I started more than one VM at a time, that's when things go south.  I pretty much have to let one VM fully boot and ensure I can connect to it before starting another.  Otherwise everything crawls, and the VMs I'm trying to start seem to power off and on again.

The host is very underutilized, so this makes no sense.  I have a virtually identical server with a lot more VMs and load, and they all start up simultaneously almost.

I don't see any performance issues on the charts of the problem host.  Something is definitely wacky.

Reply
0 Kudos
Eric_Allione
Enthusiast
Enthusiast
Jump to solution

At the very least to gather solid troubleshooting data, on top of getting the job done quicker, you might want to vMotion the VMs off the host in question and confirm that they don't have that issue.

I'm not sure how many hosts you have but if you can afford to move all the critical VMs off of it, and perhaps apply some host anti-affinity rules, then you could make some test VMs and see if they have similar issues with minimal configurations. That way you would not be putting any critical applications at risk. This might help you zoom in on a particular hardware fault such as RAM, CPU, or motherboard. Also, if you vMotion the critical VMs off of it then you could power off the host (even during production hours) and run hardware diagnostics on it from the boot menu.

Reply
0 Kudos
dfosbenner
Enthusiast
Enthusiast
Jump to solution

This is a new host, I actually just migrated the VMs on to it.  They were running on an old PowerEdge T410 with much less hardware.  I never had performance issues there.  It was running ESXi 5.5  because it wasn't certified for 6.

Unfortunately I don't have a SAN, so juggling stuff around isn't so easy.

Reply
0 Kudos
dfosbenner
Enthusiast
Enthusiast
Jump to solution

I'm making some headway.  For about a 5 minute time period after ESXi startup, connection to the datastores is being lost repeatedly.  Here's a snippet from the host event log.  I have 4 RAID-1 drive sets, each is a datastore, names are NY-ESXi4-600GB-1, NY-ESXi4-600GB-2, NY-ESXi4-600GB-3, NY-ESXi4-600GB-4.  It's looking like a hardware problem. 

Successfully restored access to volume 58cfca26-426ed559-bb2f-000af7aa8b00 (NY-ESXi4-600GB-2) following connectivity issues. Information 4/4/2017 7:42:53 PM  ny-esxi4.easternalloys.com

Successfully restored access to volume 58cfca7a-faf88262-d358-000af7aa8b00 (NY-ESXi4-600GB-4) following connectivity issues. Information 4/4/2017 7:42:53 PM  ny-esxi4.easternalloys.com

Lost access to volume 58cfca26-426ed559-bb2f-000af7aa8b00 (NY-ESXi4-600GB-2) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly. Information 4/4/2017 7:42:51 PM  ny-esxi4.easternalloys.com

Lost access to volume 58cfca7a-faf88262-d358-000af7aa8b00 (NY-ESXi4-600GB-4) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly. Information 4/4/2017 7:42:51 PM  ny-esxi4.easternalloys.com

Successfully restored access to volume 58cfca26-426ed559-bb2f-000af7aa8b00 (NY-ESXi4-600GB-2) following connectivity issues. Information 4/4/2017 7:42:39 PM  ny-esxi4.easternalloys.com

Successfully restored access to volume 58cfca7a-faf88262-d358-000af7aa8b00 (NY-ESXi4-600GB-4) following connectivity issues. Information 4/4/2017 7:42:39 PM  ny-esxi4.easternalloys.com

Lost access to volume 58cfc9ba-cccd2854-9c3a-000af7aa8b00 (NY-ESXi4-600GB-1) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly. Information 4/4/2017 7:42:37 PM  ny-esxi4.easternalloys.com

Lost access to volume 58cfca26-426ed559-bb2f-000af7aa8b00 (NY-ESXi4-600GB-2) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly. Information 4/4/2017 7:42:37 PM  ny-esxi4.easternalloys.com

Lost access to volume 58cfca69-1b7b83a7-9d1a-000af7aa8b00 (NY-ESXi4-600GB-3) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly. Information 4/4/2017 7:42:37 PM  ny-esxi4.easternalloys.com

Reply
0 Kudos
dfosbenner
Enthusiast
Enthusiast
Jump to solution

Thank you all for your help with this issue, it is resolved.  Dell had an updated drive for the PERC H730p controller, lsimr3, which was the culprit.  I'm all patched and online, whew.

Reply
0 Kudos