VMware Cloud Community
bellocarico
Enthusiast
Enthusiast

ESXi 6.5 stuck on Starting service vmtoolsd

As per title my ESXi is not loading any more. I have been playing around with the adaptec drivers as I got few purple screen adaptec releted recently.

However I did reboot few times (upgrading/downgrading the drivers etc) until this happened:

ScreenShot005.png

And no, the web client doesn't work. I can ping and even log into the the box via SSH but the VMs are not booting up. If I remember well the host was into mainteinance mode last time I rebooted it.

I went as far as reinstalling (upgrading) ESXi with the "old" adaptec drivers+CIM+arcconf as it' used to work but even after the upgrade it still gets stuck here.

I'm lost. Any help please?

Thanks!

16 Replies
a_p_
Leadership
Leadership

Please press Shift-F12 Alt-F12 to switch to the vmkernel log. Maybe this will give you a hint about what's causing the issue.


André

vijayrana968
Virtuoso
Virtuoso

Can you hit Alt-F1 to see what is going on in the back-end. Post a screenshot of that screen?

bellocarico
Enthusiast
Enthusiast

F1 gives

ScreenShot007.png

F12 gives

ScreenShot009.png

as far as I can see it goes in a look until I get the next purple

ScreenShot006.png

It seems an adaptec driver issue as far as I can tell. One additional thing is: one array is degraded (dealing with the vendor to have this replaced)

However this doesn't explain how this error happened before and after the upgrade (with old drivers).

0 Kudos
vijayrana968
Virtuoso
Virtuoso

As you can see the third last line, issue is with arc-cim-provider. What is the adaptec controller model in use and is the vib version installed. Switch to DCUI and check with esxcli software vib list | grep arcconf

What is maker and model of system !

0 Kudos
bellocarico
Enthusiast
Enthusiast

It's a home server built with pieces off ebay, however it has been working non stop for 4 years now (tyan motherboard and ECC RAM)

Incidentally I have a similar server with slightly different motherboard but identical controller. I'm now verifying what's on that server driver/CIM/arcconf wise to make sure the same versions are used on this other troublesome device.

This is what the working server reports:

ScreenShot010.png

And this is my custom build:

Using Imageprofile ESXi-6.5.0-20181004002-standard ...

(dated 10/10/2018 09:59:40, AcceptanceLevel: PartnerSupported,

Updates ESXi 6.5 Image Profile-ESXi-6.5.0-20181004002-standard)

Loading Offline bundles and VIB files from .\vibs_old ...

   Loading C:\Users\Fefo\Desktop\vmware\vibs_old\Adaptec_Inc_bootbank_scsi-aacraid_6.0.6.2.1.52011-1OEM.600.0.0.2494585.vib ... [OK]

      Add VIB scsi-aacraid 6.0.6.2.1.52011-1OEM.600.0.0.2494585 [OK, replaced 1.1.5.1-9vmw.650.0.0.4564106]

   Loading C:\Users\Fefo\Desktop\vmware\vibs_old\vmware-esx-provider-arc-cim-provider.vib ... [OK]

      Add VIB arc-cim-provider 1.08-21375 [OK, added]

   Loading C:\Users\Fefo\Desktop\vmware\vibs_old\vmware-esx-provider-arcconf.vib ... [OK]

      Add VIB arcconf 1.08-21375 [OK, added]

   Loading C:\Users\Fefo\Desktop\vmware\vibs_old\vmware-esxi-drivers-scsi-aacraid-600.6.2.1.52011.-1.0.6.2494585.x86_64.vib ... [OK]

      Add VIB scsi-aacraid 6.0.6.2.1.52011-1OEM.600.0.0.2494585 [IGNORED, already added]

Exporting the Imageprofile to 'C:\Users\Fefo\Desktop\vmware\ESXi-6.5.0-20181004002-standard-customized.iso'. Please be patient ...


0 Kudos
bellocarico
Enthusiast
Enthusiast

This is odd, the problem still persists and the command suggested:

esxcli software vib list | grep arcconf

gives no output. Also if I reduce the command to:

[root@blue:~] esxcli software vib list

Connection failed

?-(

0 Kudos
vijayrana968
Virtuoso
Virtuoso

I see. Can you check /etc/init.d/hostd status. Start if require.

0 Kudos
bellocarico
Enthusiast
Enthusiast

[root@blue:~] /etc/init.d/hostd status

hostd is running.

I just attempted 2x different type of upgrade

1) standard ESXi ISO ->> it doesn't see the adaptec raid as expected

2) with custom ISO but adding only adaptec drivers (no CIM provider or arcconf)

this second returned to the usual issue

ScreenShot011.png

Gosh I'm lost.... I guess next would be to try to downgrade the Firmware... but why does it work on the other server?

0 Kudos
vijayrana968
Virtuoso
Virtuoso

If your standard image installation is making ESXi up, you can try adaptec vibs installing additionally post ESXi installation and see the behavior.

vmware-esx-provider-arcconf.vib

vmware-esx-provider-arc-cim-provider.vib

gregsn
Enthusiast
Enthusiast

This problem could be potentially caused by VT-d enabled in the bios with Adaptec controllers.  Try disabling this first (though, I believe this issue was fixed in newer driver versions, it wouldn't hurt to rule this out if you don't require hardware pass-though).

If you are still getting the same issue with only the driver installed (not vmware-esx-provider-arcconf.vib or vmware-esx-provider-arc-cim-provider.vib as they are not required for accessing the data, only management of the controller / array), then you could potentially be having a more serious problem with the array.

Controller hangs / continuously scrolling "Host adapter abort requests" can also occur if the array contains a "badly behaving drive" when ESXi attempts to load the data store.  By "badly behaving drive", I mean specifically the drive is not responding to commands from the controller and the controller is (for whatever reason) unable to drop the drive from the array.

If you are using a 5-series or 6-series controller without a SAS expander, you can enable hard drive activity lights from within the controller's bios settings during boot up of the server (CTRL-A).  It might be possible to identify the "badly behaving drive" by observing the hard drive activity lights (I believe they are green LEDs on the controller).  The bad drive will sometimes illuminate the activity light continuously while others with flash as the drives are accessed.

If you have a 7-series or newer controller, I believe Adaptec had removed the activity LEDs from these controllers, so you would not be able to identify a drive in this fashion.  You would need some sort of HDD backplane which would illuminate a drive activity LED when the drive is being accessed by the controller.

If this is the case *AND* if you have some remaining redundancy within your array (eg a RAID-6 array with one failed disk and *ALL* remaining disks are working flawlessly with no undiscovered bad blocks), you could try disconnecting the "badly behaving" drive.  *** USE CAUTION *** as if this does not solve the problem, you cannot simply re-add the disk to restore the previously level of redundancy without performing a rebuild.  If you have multiple failed/failing disks, you could experience a total loss of your array data by doing this.  Proceed at your own risk.

If the above is not the case, you could try booting the server using an Adaptec supported Linux distro or Windows and see if the OS reports similar abort commands.  Linux might provide more detailed information as to what is going on with the controller/array.  If you are successful in booting Linux or Windows without having the system lock up, you could attempt to download the controller's firmware logs by creating a "support archive" and reviewing those logs.  The "support archive" logs often have more detail as to what his happening within the controller/array.  ***USE CAUTION*** to NOT write any data to the array while in Windows or Linux as it could permanently damage the VMFS filesystem. Proceed at your own risk.

*** Additional Caution ***

If you have a 5-series or 6-series controller in RAID-6 with >2TB disks, DO NOT attempt to rebuild the array.  There is an undocumented firmware bug in these controllers that *WILL* corrupt your data if you attempt a RAID-6 rebuild with >2TB disks while there is background write activity happening on the array (such as VMs running).  The issue may also apply to RAID-5 arrays, but has not been confirmed as far as I'm aware.  The same applies to older firmware versions of 7-series and 8-series controllers, though newer firmware resolved this issue for these controllers.

*** Additional Caution #2 ***

Don't change the firmware of the controller while the array is in a degraded state.  It is recommended to first get the array up to "Optimal" state before changing firmware levels on the controller.

bellocarico
Enthusiast
Enthusiast

Thank you everybody for the great support. I'm so impressed to get so much attention on a Saturday!!

So long story short: is fixed

All I had to do was to removed the faulty disk from the RAID5 and now it boots properly. Note this was part of a data array and not the boot one ESXi uses.

On one side I'm happy things eventually work because, I don't know you, but after hours trying to make things working I get bored and lose interest... so when it finally works and I can relax on the sofa it does come with a sense of achievement 😄

On the other side (the predominant right now) I just want to say "what the heck!" A RAID controller is made to provide resiliency not to introduce extra issues. I don't have major concerns if now and again one disk fails but why would the OS be affected by this only God and perhaps not even the Adaptec engineers know.

Final note: The last installation/upgrade I made was with the adaptec controllers drivers only (no CIM provider or ARCCONF) this is good enough for my needs as I just need to see the status of the array from Hardware/Storage, however after a quick verification under Manage/Packages if I filter by Adaptec and this is what I can see:

arc-cim-providerAdaptec CIM Provider for managing Adaptec RAID controllers on ESXi 5.x2.05-22932AdaptecSat Nov 03 2018 15:26:14 GMT+0000 (Greenwich Mean Time)
arcconfAdaptec CLI Provider for supporting remote ARCCONF tool to manage Adaptec RAID controllers on ESXi 6.x2.05-22932AdaptecSat Nov 03 2018 15:26:14 GMT+0000 (Greenwich Mean Time)
scsi-aacraidAdaptec HBA Driver6.0.6.2.1.52040-1OEM.600.0.0.2494585Adaptec_IncSat Nov 03 2018 15:26:14 GMT+0000 (Greenwich Mean Time)
scsi-adp94xxAdaptec ADP94xx1.0.8.12-6vmw.650.0.0.4564106VMWSat Nov 03 2018 15:26:14 GMT+0000 (Greenwich Mean Time)
scsi-ipsAdaptec IPS7.12.05-4vmw.650.0.0.4564106VMW

Sat Nov 03 2018 15:26:14 GMT+0000 (Greenwich Mean Time)

I'm really puzzled trying to understand where the arc-cim-provider and the arcconf are coming from since this is my custom build process:

PS C:\Users\Fefo\Desktop\vmware> dir vibs3

    Directory: C:\Users\Fefo\Desktop\vmware\vibs3

Mode                LastWriteTime         Length Name

----                -------------         ------ ----

-a----       11/08/2017     04:45          86002 vmware-esxi-drivers-scsi-aacraid-600.6.2.1.52040.-1.0.6.2494585.x86_64.vib

#############################################################################################

PS C:\Users\Fefo\Desktop\vmware> .\ESXi-Customizer-PS-v2.6.0.ps1 -izip .\ESXi650-201810002.zip -pkgDir .\vibs3 -nsc

Security warning

Run only scripts that you trust. While scripts from the internet can be useful, this script can potentially harm your computer. If you trust this script, use the

Unblock-File cmdlet to allow the script to run without this warning message. Do you want to run C:\Users\Fefo\Desktop\vmware\ESXi-Customizer-PS-v2.6.0.ps1?

[D] Do not run  [R] Run once  [S] Suspend  [?] Help (default is "D"): R

This is ESXi-Customizer-PS Version 2.6.0 (visit https://ESXi-Customizer-PS.v-front.de for more information!)

(Call with -help for instructions)

Logging to C:\Users\Fefo\AppData\Local\Temp\ESXi-Customizer-PS-12704.log ...

Running with PowerShell version 5.1 and VMware PowerCLI version 11.0.0.10336080

Adding base Offline bundle .\ESXi650-201810002.zip ... [OK]

Getting Imageprofiles, please wait ... [OK]

Using Imageprofile ESXi-6.5.0-20181004002-standard ...

(dated 10/10/2018 09:59:40, AcceptanceLevel: PartnerSupported,

Updates ESXi 6.5 Image Profile-ESXi-6.5.0-20181004002-standard)

Loading Offline bundles and VIB files from .\vibs3 ...

   Loading C:\Users\Fefo\Desktop\vmware\vibs3\vmware-esxi-drivers-scsi-aacraid-600.6.2.1.52040.-1.0.6.2494585.x86_64.vib ... [OK]

      Add VIB scsi-aacraid 6.0.6.2.1.52040-1OEM.600.0.0.2494585 [OK, replaced 1.1.5.1-9vmw.650.0.0.4564106]

Exporting the Imageprofile to 'C:\Users\Fefo\Desktop\vmware\ESXi-6.5.0-20181004002-standard-customized.iso'. Please be patient ...

All done.

0 Kudos
vijayrana968
Virtuoso
Virtuoso

Good to hear your issue is fixed. Thanks for sharing details.

Dave_the_Wave
Hot Shot
Hot Shot

That's Adaptec.

They add all sorts of BS that you won't see from AMCC/3ware/LSI.

To each their own.

bellocarico
Enthusiast
Enthusiast

Out of curiosity what is the "best" (or should I say less worst) SATA raid controller as per today on 6.5?

Installation out of the box would be really nice (no custom build needed) and drivers that actually work.

It seems like I'm asking a lot :smileycry:

Dave_the_Wave
Hot Shot
Hot Shot

I would stay away from Adaptec and all those chinese software SATA cards on ebay that are nothing more than adding ports on the mainboard. They barely work in Windows, let alone for hypervising.

Look at the raid controllers that are put with HPE, Lenovo, Dell, and you get an idea of what ESXi likes.

0 Kudos
vijayrana968
Virtuoso
Virtuoso

That’s true.

0 Kudos