I have 2 HP Servers and each had vSphere 5.0
To upgrade them I did the following:
1. I installed vCenter Server on a physical machine.
2. Moved all the VMs from first ESX to secoond ESX
3. Installed vSphere 5.5 on first server
4. Moved all the VMs to first ESX
5. Installed vSphere 5.5 on the second server
6. Put back VMs on second ESX
For 2 days all was working well!
Then suddenly all the VMs became inaccessible on server 2. vCenter showed a message that there is a problem accessing one of the LUNs and the message next to the VMs that are located on that LUN said inaccessible, but I can't reach any of the VMs on that host.
It does not want to open the console even to those VMs that are supposedly not affected. The only way to solve this is a reboot. After the reboot, the problematic LUN is not listed in the storage view of my host configuration and it takes many tries to
add it back.
And then the next day the same thing happened to server 1.
After reboot and after I added the missing storage I thought that all is well - end of story.
It happened again and again... one day server 1, next day server 2 and so forth (3 times so far)
I don't have any vMotion or other advanced configuration applied.
I don't even know where to start troubleshooting this issue (Log files, where are they? never used the command prompt on vSphere)... today I can't even add the missing Disk/LUN. I go through all the steps as before, I select the storage the system finds but it does not add it!
You're right to go for the logs first.
To find where your logs on 5.5 are the following KB should help
You obviously want to be looking around the time you see the disconnects.
How are you connecting to your LUNs? Fibre channel? iSCSI? NFS?
/var/log/esxupdate.log: ESXi patch and update installation logs.
This log might show any issues that occurred during your upgrade (if you did an upgrade rather than fresh install of 5.5)
The next two key logs for diagnosis would be
/var/log/vmkernel.log: Core VMkernel logs, including device discovery, storage and networking device and driver events, and virtual machine startup.
/var/log/vmkwarning.log: A summary of Warning and Alert log messages excerpted from the VMkernel logs.
These will hopefully show you errors when the LUN "drops off".
Did you use the HP 5.5 ISO to upgrade your servers? as that will have HP specific drivers for NICs/HBAs etc. and may be where you have introduced the instability.
I found out how to download all the log files from vCenter.
Now I'm waiting for one of them to fail again
I did a clean install and the first time I did it I used HP image. The second time I did it I (making sure that it's not the HP's image the fault) I used the generic image and I rearranged the RAID configuration but it didn't help.
My LUNs are all made from local disk.
Now I'm doing some changes in networking part: on my previous configuration VMkernel port was together with on same NIC as Virtual Machine Port Group, so I created another standard switch containing only VMKernel.
Thanks for your input.
I just had another event.
All the VMs looked OK, only the host had an ISSUE.
Of course I was not able to either open the console from vCenter: Unable to connect to the MKS: Could not connect to pipe \\.\pipe\vmware-authdpipe within retry period
Remote Desktop (TeamViewer) is not working either. Basically all the VMs are off line!
I tried to export log files via vCenter but the process failed. After reboot it seems that some of the log files get wiped.
Here are some screenshots of vCenter and of ssh session with host.