VMware Cloud Community
RobJanssen86
Contributor
Contributor

New ESX 4.1 server screws up storage

The situation here is like this :

We have 2 productions ESX 3.5.2 hosts (in a cluster) running al the virtual machines.

They are connected with 2 HBA’s per server to our IBM DS4700 fiberchannel SAN, via an switch.  This is all working well.

Now I have purchased 2 new servers (ESX 4.1, DL380 G7) with 2 (enterprise) licenses. Both servers are connected to the vCenter server and I build the cluster with HA & DRS. These servers are also through fiberchannel connected to our DS4700 SAN.

I connected the new server through an fiberchannel switch to our DS4700, the new server is able to see the LUN’s and the vCenter server is installed on the shared storage. But when I connect the second DL380 G7 server to the SAN the VM’s that are running on the production server (as described above) are getting the following error in their event viewer: 

dmio: Disk Harddisk1 block 118099311 (mountpoint D:): Uncorrectable read error

So it seems like the second ESX server screws up the storage in some way.

I’ve created a new ‘group’ in the storage manager of the IBM. So all HBA’s are ‘known’.

In some way it seems like I am doing wrong but I ain’t seeing it yet.

Has anyone an idea?

0 Kudos
8 Replies
idle-jam
Immortal
Immortal

i would advise logging a call with vmware support to look into the issue. I for sure will not trial and error when it's something to do with storage. but it's just me.

0 Kudos
RobJanssen86
Contributor
Contributor

I agree it isn't something to try.

The strangest thing is that the first server (in the set of new servers) isn't having any problems with the storage and vice versa.

Just connected the HBA's, rebooted the server, scanned for storage if needed and that's just it. The second server is just messing with the storage.. I will wait a few hours before opening a ticket to VMware, so if anyone has another suggestion; be my guest.

0 Kudos
RobJanssen86
Contributor
Contributor

Is it possible that it has anything to do with zoning on the fiber channel switch between SAN and ESX server?  On the SAN are a few LUN's. 3 of them are VMFS LUN's. The rest are LUN's for the RDM of some VM's. If I connect the specific server, it sees the VMFS volumes..

0 Kudos
a_p_
Leadership
Leadership

FIrst of all, how did you configure the FC zones? Did you setup a single-initiator zoning?

André

0 Kudos
RobJanssen86
Contributor
Contributor

After I checked the zoning on the Fiber channel switches, one thing is really clear: it's a mess.

Both switches have a different switch zoning configuration.

I think the most easy step would be to remove the zoning configurations. The storage ports (port 0) are Initiator+Target.

What's the risk of removing the zones?

0 Kudos
a_p_
Leadership
Leadership

What's the risk of removing the zones?

Never remove all the zones! Without any zones configured, the FC switch would allow all systems to everything.

What I would probably do is to configure additional new zones for each initiator (host) port. After this is done and you are sure the zoning is correct, delete the old, messed up zones from the configuration. Then activate the cleaned up configuration. If you have multiple fabrics, do this for one fabric and if everything looks good, do it for the second fabric.

I usually create an alias with all the target ports and then create the zones with names like Z_<initiator-name>_<storage-system>. Where an initiator is one single HBA port.

André

0 Kudos
Saturnous
Enthusiast
Enthusiast

Sounds for me like wrong "hostmode", the DS is not aware that the new host is a VMWare Server. I solved such issue some day but cant found the correct name (some A?? switch). Another idea would be checking the Multipathing, its a A/P Storage and you mix ALUA aware ESX and not ALUA aware - so check if the active targets (esxcfg-mpath output) are on all the hosts the same and correct it. Google for path trashing.

0 Kudos
RobJanssen86
Contributor
Contributor

The zoning on one switch looks like this :

"Each HBA can access certain storage device. And each storage device can be accessed by certain HBA.

Also there is no offline device found in current zoning database."

This can't be good......   

Although this switch is working 'right'.

I think the solution of my problem is to just set the zoning correctly..

0 Kudos