We have a new installation of two ESX 3.5 servers, hosting Windows 2k3 VM's. The VM's are stored on Netapp storage, via VMFS. We are trying to get snapmanager for exchange running via the microsoft software iscsi initiator.
If you are familiar with snapdrive and SME, we have installed all the necessary add-ons and hotfixes.
After snapdrive is installed it slows the VM down to a crawl.
One error we consistently get in the event log when this occurs is event id: 55.
We have this exact same setup running just fine on ESX 3.02.
We have had this setup for quite some time. Recently deploying a 3 node x64 Exchange 2007 cluster, 2 physical with 1 VM passive node. The passive node is connected via iSCSI with microsoft initiator. Snapmanager for Exchange 4.0 doing DB verifies against passive VM node.
We're running Snapdrive 5.0 against a FAS6080A. VM has 3 NICs, 1 public, 1 cluster and 1 for iSCSI VLAN.
I'm not familiar with error 55, can you paste error text...
Sorry the event id from the application log is 155.
"In-band SCSI command from LUN (FCP) to Storage System returned invalid data."
In the system log: even id 15
"The device, \Device\Scsi\symmpi1, is not ready for access yet."
Just to add. It takes over half an hour to snap the exchange DB and logs with NO verification. The exchange db is not in production and is very small.
still need help!
We have 5 VMFS FC LUNs for our VM Exchange server. One for boot, pagefile, the EDB, logs, and snapinfo directory. Only the pagefile is an RDM and the rest are VMFS.
We were running into major problems when we would go into snapdrive to configure the virutal disks. Anytime we would click on snapdrive it would take about 20 minutes for snapdrive to come back and give us the dialog box on a right click. When it did finally come back we were able to configure the disks.
The next thing we would do is go into SME to configure the server and move the datastores to the new LUNs. Once we started the configuration wizard again it would take 20 sometimes 30 minutes for the mmc to become responsive again. Once finally configured it took over a half an hour to run an SME backup with no verification.
The error in the application log-
Event Id: 155
In-band SCSI command from LUN (FCP) to Storage System returned invalid data
The error is the system log-
Event id: 15
The device, \Device\Scsi\symmpi1, is not ready for access yet.
We fought with this for several days trying numerous things, installing pieces in a different order, different ESX servers, etc.
Today we finally pinpointed the problem. It was the RDM pagefile LUN. We thought this might be the problem so we moved the pagefile back to the boot LUN. We still had the problem. Then we just completely removed the RDM and everything worked as it should have.
This is not the only time we have seen this happen. At a recent customer install we saw the exact same symptoms.
This issue only seems to happen on ESX 3.5. We have RDM's on a 3.02 install running just fine.
Do you have your 'preferred storage system ip address' defined for your Snapdrive configuration?
For our FC attached physical boxes, we have to define this or Snapdrive times out forever.
You situation was likely caused because Snapdrive saw the device id's since the pagefile was an RDM and couldn't differentiate between that and a LUN presented via ISCSI or FC. It was trying to pass FC/scsi command over what it concidered to be an HBA device but in fact was the LSI controller. The problem would be worsened if the preferred storage system IP isn't defined.
I've not ran into this because we don't do anything with RDM files. Netapp suggests putting the pagefile on a different LUN/volume for snapshot purposes, but we've found that if you give your VM's enough RAM, they page VERY little and consequently don't create a very big snapshot.
well if i get the chance i will try some more tests....ON A FLEXCLONE! What is strange though is that this only happens on esx 3.5. We have several installations of this running on 3.02
I'd like to add that I experience the EXACT same symptoms with ESX 3.5, SME, and using a RDM LUN for the pagefile. Once SME kicks off the server becomes completely unresponsive and the Event Logs are filled with those two Event IDs you describe. I disconnect the RDM and it's all better. Anyone know what gives?
can someone explain, how to use rdm's with snap drive. What does it even mean.
with snap drive to create a lun it let's you connect to the netapp filer, where can you specify to connect to an rdm ?