lost iSCSI after reboot - VMs down...

JRink · ‎08-28-2007

Bit of a problem here.

Properly shut down all of my VMs. Then did a shutdown -h now on the ESX server. Bringing the server back online ended up getting stuck during the boot process at Restoring S/W iSCSI Volumes....

As of now, all of my VMs that were attached to the iSCSI SAN are showing with names of "Unknown (inaccessable), Unknown 2 (inaccessable)", etc.

My iSCSI initiator is still there under Configuration|Storage Adapter, but it has targets: 0.

Really needing some help on this one... If anyone is able to help, please let me know. Not sure where to begin. Thanks.

Rob_Bohmann1 · ‎08-28-2007

Looks like you lost your storage. For some reason, your host is not seeing the storage, hence the 0 targets. Go into the storage adapter on the host that is affected and check to see that your target info (ip, chap) is still there. If not and you have another host in the cluster you should be able to plug that in or if you have the info documented somewhere. If the target info is there check the logs on the host starting with what you see in /var/log ....

JRink · ‎08-28-2007

I know the SAN is working. I have a non-production ESX server that I use for testing, and that ESX box is still able to see the LUNs on the SAN.

The production server is not able to see them however. I was thinking about re-creating the ISCSI software intiator on the production server from scratch... ?

Can someone calm my fears that my data/VMs are still okay?

per your suggestion, if I go into the storage adapter on the production server host, I see that the target info (ip address, no chap is setup) and the software initiator properties are Enabled. target discovery method is Send Targets, and I have the IP address specified in the Dynamic Discovery tab. the firewall port is still open on ESX as well, I checked that...

Ug. Not good.

JRink · ‎08-28-2007

I'm not sure what log to look at in the /var/log directory...

This is what boot.log is showing... (the end the file)

Aug 28 19:19:53 esx1 network: Setting network parameters: succeeded

Aug 28 19:19:53 esx1 network: Bringing up loopback interface: succeeded

Aug 28 19:19:55 esx1 network: Bringing up interface vswif0: succeeded

Aug 28 19:19:57 esx1 network: Bringing up interface vswif1: succeeded

Aug 28 19:20:02 esx1 esxcfg-swiscsi: Waiting for discovery to finish...

Aug 28 19:22:03 esx1 esxcfg-swiscsi: Scanning vmhba40...

Aug 28 19:22:03 esx1 esxcfg-swiscsi: Discovery timeout

Aug 28 19:22:03 esx1 esxcfg-swiscsi: Not all luns may be visible

Aug 28 19:22:03 esx1 esxcfg-swiscsi: Doing iSCSI discovery. This can take a few seconds ...

Aug 28 19:22:08 esx1 esxcfg-swiscsi: Rescanning vmhba40...

Aug 28 19:22:08 esx1 esxcfg-swiscsi: done.

Aug 28 19:22:08 esx1 esxcfg-swiscsi: On scsi0, removing:

Aug 28 19:22:08 esx1 esxcfg-swiscsi: .

Aug 28 19:22:08 esx1 esxcfg-swiscsi: On scsi0, adding:.

Aug 28 19:22:08 esx1 vmware-late: Restoring S/W iscsi volumes succeeded

Aug 28 19:22:08 esx1 xinetd: xinetd startup succeeded

Aug 28 19:22:08 esx1 ntpd: succeeded

Aug 28 19:22:08 esx1 ntpd: ntpd startup succeeded

Aug 28 19:22:08 esx1 gpm: gpm startup succeeded

Aug 28 19:22:08 esx1 vmware-webAccess: Starting VMware ESX Server webAccess:

Aug 28 19:22:08 esx1 vmware-webAccess: VMware ESX Server webAccess

Aug 28 19:22:11 esx1 vmware-webAccess: ^[[60G

Aug 28 19:22:11 esx1 vmware-webAccess:

Aug 28 19:22:11 esx1 rc: Starting vmware-webAccess: succeeded

Aug 28 19:22:12 esx1 crond: crond startup succeeded

Aug 28 19:22:13 esx1 vmware-vmkauthd: Starting VMware VMkernel authorization daemon succeeded

Aug 28 19:22:13 esx1 pegasus: succeeded

Rob_Bohmann1 · ‎08-28-2007

discovery timeout... from the service console can you ping the san ip address(target) using ping or vmkping?

So are you using a software initiator or an HBA?

Your data is still, this host just does not seem to be able to see the storage.

If your other hosts have the capacity, and you have to get these servers up immediately, you can just use the VI Client and browse thru the datastore, navigate to each vmfolder and right click on the vmx file to register the vm on the new host. You will want to remove the affected host from VC so there is no conflict with what host the vms are registered on and then readd it again after you have re-registered the vms on another host(s).

If you have time, I would personally try to resolve the problem, especially if you have very many vm's on this host.

Also was any maintenance done on the network. Maybe you lost the vlan info and cannot connect to the storage anymore? (Just a guess)

Message was edited by:

Rob.Bohmann

JRink · ‎08-28-2007

Yes, I connect to the console via putty and I can ping the SAN.

This is S/W intiator... No hardware initator.

I have thougth about doing what you said and getting these servers running on the non-production ESX box by using the "Browse the Datastore" option, but I'd really like to get these things up on the proper server.

I thought about the VLAN stuff, but again remember... Everything was working perfect, until I rebooted the server to add memory to it. Turns out, I couldnt add the memory anyways because the memory needs to be installed in pairs and my vendor only shipped me one stick, heh. So, before the reboot, everythign was working great. That's what's really odd here.

I \*DID* find that when I pulled the server out of the rack on the rails, that one of the network cables came loose. However, it wasn't the network for the iSCSI network... and, I've since replugged that other cable back in as well.. No change obviously.

My iSCSI SAN is on it's own VLAN (500). The vSwitch2 (Service Console2, and VMKernal), are also on that VLAN. But, like I said, pinging works great so I know the VLAN is still configured properly. It's just flaked out since the reboot.....

Appreciate any help. I hate being in a pickle here.

JR

Rob_Bohmann1 · ‎08-28-2007

if you can ping the san then try rescanning your adapters to see if that picks up the luns.

JRink · ‎08-28-2007

Definitely tried that one already

No luck. I'm wondering if I should completely blow away the software iSCSI initiator. No idea.

Rob_Bohmann1 · ‎08-28-2007

Yes thats what I would do next. I would completely recreate the vswitch (write down the info first... lol) with both portgroups (service console 2 and Iscsi vmkernel). I've seen that before where I could not get a new server to connect and had to recreate it again.

p.s. your brewers are leading 1-0 ...

Message was edited by:

Rob.Bohmann

JRink · ‎08-28-2007

Problem solved.

The issue was this...

I had (2) ESX hosts. A prodbox and a dev box. Both were set to as iSCSI initiators for the same LUN(s). However, only the prod box was actually accessing the LUNs. Turns out, my Storevault SAN doesn't play nice if there are two initiators attached to a single LUN. So, when the prod box went down, the dev box took control of those LUNs. I never ran ito this before since I hadn't rebooted the box since setting this up. Well, the short of it is, I had to remove the initiator for the dev box to the LUNs. After that the prod box was once again able to see the LUNs.

The reason I had premapped the dev initiator to the LUNs was a proactive attempt to have everything configured and ready to go IF my prod box went down. All I'd have to theoretically do was do a re-scan on the dev box, find my VMs, and start them up in an emergency situation. Good plan, but it caused me this particular problem... sooo... the new plan is, if the prod box goes down, to map the initiator on the dev box to the LUNs at THAT time, but not beforehand. Should solve my issue. That, or getting the HA add-on, heh.

Overall, while it sucks having worked a nice 18 hour day, I learned some pretty cool stuff tonight in regards to have VMWare communicates with the SAN over iSCSI. It was worth the long night.

Thanks for all the tips along the way.

JR