As per title my ESXi is not loading any more. I have been playing around with the adaptec drivers as I got few purple screen adaptec releted recently.
However I did reboot few times (upgrading/downgrading the drivers etc) until this happened:
And no, the web client doesn't work. I can ping and even log into the the box via SSH but the VMs are not booting up. If I remember well the host was into mainteinance mode last time I rebooted it.
I went as far as reinstalling (upgrading) ESXi with the "old" adaptec drivers+CIM+arcconf as it' used to work but even after the upgrade it still gets stuck here.
I'm lost. Any help please?
Thanks!
Please press Shift-F12 Alt-F12 to switch to the vmkernel log. Maybe this will give you a hint about what's causing the issue.
André
Can you hit Alt-F1 to see what is going on in the back-end. Post a screenshot of that screen?
F1 gives
F12 gives
as far as I can see it goes in a look until I get the next purple
It seems an adaptec driver issue as far as I can tell. One additional thing is: one array is degraded (dealing with the vendor to have this replaced)
However this doesn't explain how this error happened before and after the upgrade (with old drivers).
As you can see the third last line, issue is with arc-cim-provider. What is the adaptec controller model in use and is the vib version installed. Switch to DCUI and check with esxcli software vib list | grep arcconf
What is maker and model of system !
It's a home server built with pieces off ebay, however it has been working non stop for 4 years now (tyan motherboard and ECC RAM)
Incidentally I have a similar server with slightly different motherboard but identical controller. I'm now verifying what's on that server driver/CIM/arcconf wise to make sure the same versions are used on this other troublesome device.
This is what the working server reports:
And this is my custom build:
Using Imageprofile ESXi-6.5.0-20181004002-standard ...
(dated 10/10/2018 09:59:40, AcceptanceLevel: PartnerSupported,
Updates ESXi 6.5 Image Profile-ESXi-6.5.0-20181004002-standard)
Loading Offline bundles and VIB files from .\vibs_old ...
Loading C:\Users\Fefo\Desktop\vmware\vibs_old\Adaptec_Inc_bootbank_scsi-aacraid_6.0.6.2.1.52011-1OEM.600.0.0.2494585.vib ... [OK]
Add VIB scsi-aacraid 6.0.6.2.1.52011-1OEM.600.0.0.2494585 [OK, replaced 1.1.5.1-9vmw.650.0.0.4564106]
Loading C:\Users\Fefo\Desktop\vmware\vibs_old\vmware-esx-provider-arc-cim-provider.vib ... [OK]
Add VIB arc-cim-provider 1.08-21375 [OK, added]
Loading C:\Users\Fefo\Desktop\vmware\vibs_old\vmware-esx-provider-arcconf.vib ... [OK]
Add VIB arcconf 1.08-21375 [OK, added]
Loading C:\Users\Fefo\Desktop\vmware\vibs_old\vmware-esxi-drivers-scsi-aacraid-600.6.2.1.52011.-1.0.6.2494585.x86_64.vib ... [OK]
Add VIB scsi-aacraid 6.0.6.2.1.52011-1OEM.600.0.0.2494585 [IGNORED, already added]
Exporting the Imageprofile to 'C:\Users\Fefo\Desktop\vmware\ESXi-6.5.0-20181004002-standard-customized.iso'. Please be patient ...
This is odd, the problem still persists and the command suggested:
esxcli software vib list | grep arcconf
gives no output. Also if I reduce the command to:
[root@blue:~] esxcli software vib list
Connection failed
?-(
I see. Can you check /etc/init.d/hostd status. Start if require.
[root@blue:~] /etc/init.d/hostd status
hostd is running.
I just attempted 2x different type of upgrade
1) standard ESXi ISO ->> it doesn't see the adaptec raid as expected
2) with custom ISO but adding only adaptec drivers (no CIM provider or arcconf)
this second returned to the usual issue
Gosh I'm lost.... I guess next would be to try to downgrade the Firmware... but why does it work on the other server?
If your standard image installation is making ESXi up, you can try adaptec vibs installing additionally post ESXi installation and see the behavior.
vmware-esx-provider-arcconf.vib
vmware-esx-provider-arc-cim-provider.vib
This problem could be potentially caused by VT-d enabled in the bios with Adaptec controllers. Try disabling this first (though, I believe this issue was fixed in newer driver versions, it wouldn't hurt to rule this out if you don't require hardware pass-though).
If you are still getting the same issue with only the driver installed (not vmware-esx-provider-arcconf.vib or vmware-esx-provider-arc-cim-provider.vib as they are not required for accessing the data, only management of the controller / array), then you could potentially be having a more serious problem with the array.
Controller hangs / continuously scrolling "Host adapter abort requests" can also occur if the array contains a "badly behaving drive" when ESXi attempts to load the data store. By "badly behaving drive", I mean specifically the drive is not responding to commands from the controller and the controller is (for whatever reason) unable to drop the drive from the array.
If you are using a 5-series or 6-series controller without a SAS expander, you can enable hard drive activity lights from within the controller's bios settings during boot up of the server (CTRL-A). It might be possible to identify the "badly behaving drive" by observing the hard drive activity lights (I believe they are green LEDs on the controller). The bad drive will sometimes illuminate the activity light continuously while others with flash as the drives are accessed.
If you have a 7-series or newer controller, I believe Adaptec had removed the activity LEDs from these controllers, so you would not be able to identify a drive in this fashion. You would need some sort of HDD backplane which would illuminate a drive activity LED when the drive is being accessed by the controller.
If this is the case *AND* if you have some remaining redundancy within your array (eg a RAID-6 array with one failed disk and *ALL* remaining disks are working flawlessly with no undiscovered bad blocks), you could try disconnecting the "badly behaving" drive. *** USE CAUTION *** as if this does not solve the problem, you cannot simply re-add the disk to restore the previously level of redundancy without performing a rebuild. If you have multiple failed/failing disks, you could experience a total loss of your array data by doing this. Proceed at your own risk.
If the above is not the case, you could try booting the server using an Adaptec supported Linux distro or Windows and see if the OS reports similar abort commands. Linux might provide more detailed information as to what is going on with the controller/array. If you are successful in booting Linux or Windows without having the system lock up, you could attempt to download the controller's firmware logs by creating a "support archive" and reviewing those logs. The "support archive" logs often have more detail as to what his happening within the controller/array. ***USE CAUTION*** to NOT write any data to the array while in Windows or Linux as it could permanently damage the VMFS filesystem. Proceed at your own risk.
*** Additional Caution ***
If you have a 5-series or 6-series controller in RAID-6 with >2TB disks, DO NOT attempt to rebuild the array. There is an undocumented firmware bug in these controllers that *WILL* corrupt your data if you attempt a RAID-6 rebuild with >2TB disks while there is background write activity happening on the array (such as VMs running). The issue may also apply to RAID-5 arrays, but has not been confirmed as far as I'm aware. The same applies to older firmware versions of 7-series and 8-series controllers, though newer firmware resolved this issue for these controllers.
*** Additional Caution #2 ***
Don't change the firmware of the controller while the array is in a degraded state. It is recommended to first get the array up to "Optimal" state before changing firmware levels on the controller.
Thank you everybody for the great support. I'm so impressed to get so much attention on a Saturday!!
So long story short: is fixed
All I had to do was to removed the faulty disk from the RAID5 and now it boots properly. Note this was part of a data array and not the boot one ESXi uses.
On one side I'm happy things eventually work because, I don't know you, but after hours trying to make things working I get bored and lose interest... so when it finally works and I can relax on the sofa it does come with a sense of achievement 😄
On the other side (the predominant right now) I just want to say "what the heck!" A RAID controller is made to provide resiliency not to introduce extra issues. I don't have major concerns if now and again one disk fails but why would the OS be affected by this only God and perhaps not even the Adaptec engineers know.
Final note: The last installation/upgrade I made was with the adaptec controllers drivers only (no CIM provider or ARCCONF) this is good enough for my needs as I just need to see the status of the array from Hardware/Storage, however after a quick verification under Manage/Packages if I filter by Adaptec and this is what I can see:
arc-cim-provider | Adaptec CIM Provider for managing Adaptec RAID controllers on ESXi 5.x | 2.05-22932 | Adaptec | Sat Nov 03 2018 15:26:14 GMT+0000 (Greenwich Mean Time) |
arcconf | Adaptec CLI Provider for supporting remote ARCCONF tool to manage Adaptec RAID controllers on ESXi 6.x | 2.05-22932 | Adaptec | Sat Nov 03 2018 15:26:14 GMT+0000 (Greenwich Mean Time) |
scsi-aacraid | Adaptec HBA Driver | 6.0.6.2.1.52040-1OEM.600.0.0.2494585 | Adaptec_Inc | Sat Nov 03 2018 15:26:14 GMT+0000 (Greenwich Mean Time) |
scsi-adp94xx | Adaptec ADP94xx | 1.0.8.12-6vmw.650.0.0.4564106 | VMW | Sat Nov 03 2018 15:26:14 GMT+0000 (Greenwich Mean Time) |
scsi-ips | Adaptec IPS | 7.12.05-4vmw.650.0.0.4564106 | VMW | Sat Nov 03 2018 15:26:14 GMT+0000 (Greenwich Mean Time) |
I'm really puzzled trying to understand where the arc-cim-provider and the arcconf are coming from since this is my custom build process:
PS C:\Users\Fefo\Desktop\vmware> dir vibs3
Directory: C:\Users\Fefo\Desktop\vmware\vibs3
Mode LastWriteTime Length Name
---- ------------- ------ ----
-a---- 11/08/2017 04:45 86002 vmware-esxi-drivers-scsi-aacraid-600.6.2.1.52040.-1.0.6.2494585.x86_64.vib
#############################################################################################
PS C:\Users\Fefo\Desktop\vmware> .\ESXi-Customizer-PS-v2.6.0.ps1 -izip .\ESXi650-201810002.zip -pkgDir .\vibs3 -nsc
Security warning
Run only scripts that you trust. While scripts from the internet can be useful, this script can potentially harm your computer. If you trust this script, use the
Unblock-File cmdlet to allow the script to run without this warning message. Do you want to run C:\Users\Fefo\Desktop\vmware\ESXi-Customizer-PS-v2.6.0.ps1?
[D] Do not run [R] Run once [S] Suspend [?] Help (default is "D"): R
This is ESXi-Customizer-PS Version 2.6.0 (visit https://ESXi-Customizer-PS.v-front.de for more information!)
(Call with -help for instructions)
Logging to C:\Users\Fefo\AppData\Local\Temp\ESXi-Customizer-PS-12704.log ...
Running with PowerShell version 5.1 and VMware PowerCLI version 11.0.0.10336080
Adding base Offline bundle .\ESXi650-201810002.zip ... [OK]
Getting Imageprofiles, please wait ... [OK]
Using Imageprofile ESXi-6.5.0-20181004002-standard ...
(dated 10/10/2018 09:59:40, AcceptanceLevel: PartnerSupported,
Updates ESXi 6.5 Image Profile-ESXi-6.5.0-20181004002-standard)
Loading Offline bundles and VIB files from .\vibs3 ...
Loading C:\Users\Fefo\Desktop\vmware\vibs3\vmware-esxi-drivers-scsi-aacraid-600.6.2.1.52040.-1.0.6.2494585.x86_64.vib ... [OK]
Add VIB scsi-aacraid 6.0.6.2.1.52040-1OEM.600.0.0.2494585 [OK, replaced 1.1.5.1-9vmw.650.0.0.4564106]
Exporting the Imageprofile to 'C:\Users\Fefo\Desktop\vmware\ESXi-6.5.0-20181004002-standard-customized.iso'. Please be patient ...
All done.
Good to hear your issue is fixed. Thanks for sharing details.
That's Adaptec.
They add all sorts of BS that you won't see from AMCC/3ware/LSI.
To each their own.
Out of curiosity what is the "best" (or should I say less worst) SATA raid controller as per today on 6.5?
Installation out of the box would be really nice (no custom build needed) and drivers that actually work.
It seems like I'm asking a lot :smileycry:
I would stay away from Adaptec and all those chinese software SATA cards on ebay that are nothing more than adding ports on the mainboard. They barely work in Windows, let alone for hypervising.
Look at the raid controllers that are put with HPE, Lenovo, Dell, and you get an idea of what ESXi likes.
That’s true.