VMware Cloud Community
HarisB
Contributor
Contributor

ESX 3.5i and iSCSI disk corruption case? Very weird :-(

Hi all,

Having hard time finding the source of a strange problem, I hope you guys can help. Here is the situation:

Few days ago I've built a couple of ESX 3.5i servers (latest version available for download), PXE booted / configured, diskless / USB stickless). Connected them to a fresh installation of (fully patched)Microsoft Windows Storage Server configured as an iSCSI target, with 2 LUNs of 900 GB each. Created few test VMs with 10GB VMDKs, and without power them or installing any OS into them, moved them around from one LUN to another, between servers etc. to test throughput over dedicated iSCSI gigabit adapters / network. This worked well, I was satisfied with hitting 60% utilization of my gigabit network, without going into jumbo frames.

Then I moved some non-critical production VMs from few 3.5 servers with local storage to new 3.5i servers and their associated iSCSI LUNs, using VMware Converter. Old servers were in a different subnet so Converter seemed to be an easy way of moving VMs around. Everything worked ok, no problems there. Until I powered up those VMs. They were all over the place - from blue screens, ntoskernel and simular boot files missing, to VMs powering up with "One or more registry files were restored..." and jumpy mouse indicating VMTools were messed up even though they were installed and visible in system tray.

So I figured something went horribly wrong with Converter, and after some network reconfiguration I was able to VMotion VMs (powered down) from 3.5 servers to 3.5i servers. Again same problem, VMs all over the place. At this time I figured something is wrong with disk access, transfer, data corruption, something on that subject.

Next I tested couple of VMs that came up "wounded" but functional, with registry files restored, mouse jumpy and older version of VMtools installed. Inside one VM I copied drivers.cab several times in a test folder (to reach 500MB in files) and copied those files over the network to another VM running on another 3.5i and different LUN. This worked just fine, no problems there. Very weird.

I then tried another test - creating a brand new VM on iSCSI LUN and installing W2K3 on it. I thought if there was a problem in copying VMs from 3.5 local storage to 3.5i iSCSI, I must be able to have a "clean start" with this. After all, I have created VMs and moved them powered down between LUNs and 3.5i servers without problems in the test phase, so this must work fine. So I created a default VM, didn't change any options, kept clicking next until VM was created. 8GB disk on LSI, 256/1 VCPU, network, the usual. Power up VM, connected CD/DVD to client device, which was in turn connected to known good W2K3 CD1 ISO, used million times for installations or CD burning. Installation started, VMDK was recognized, all 8 GB of it, I selected it for installation and full NTFS format went to 100%, then windows setup reported "Setup was unable to format the partition. The disk may be damaged." I've got same result on another 3.5i server and another LUN.

I realized I needed to get get back to basics to be able to troubleshoot this properly.

So I torched the entire setup and installed a single 3.5i on local storage rather than through PXE. I did this because I was unable to initiate installation of VMtools on migrated VMs, see for more info. I realized PXE installation didn't copy / create windows.iso on 3.5i server and that this is likely the cause of trouble (still to confirm I'm right on this one). All went ok with this install, I connected to (deleted / recreated) iSCSI LUNs, which were recognized as "hard disk is blank" and formatted with VMFS, confirming they were properly toasted after all tests above. Connection is made using crossover cable straight to iSCSI storage server to avoid any possibility switch can be at fault.

Now the tests - I was able to create a brand new VM, install 2003 using the same process describe above, on local storage. Went through the first part of 2003 setup and after reboot got to the second part, I left it sitting there as I accomplished my test, everything working just fine. Then I created another VM, this time on iSCSI, and tried the same thing. As I suspected I hit exactly the same problem as above, "Setup was unable to format the partition...".

Another test - I have Storage VMotion plugin installed on my VC (which is also latest available for downloads), so I VMotioned my partially installed 2003 server while running from local storage to iSCSI, and this completed without errors. Partial installation was sitting there, windows setup waiting for me to insert installaiton CD to continue setup, and VM was alive on iSCSI. So I pulled the plug on it to see what happens after reboot. It came back with "ntoskrnl.exe is missing or corrupt, please reinstall the file", indicating corruption of disk. I powered it off, migrated VM back to local storage, powered up, and got the same error message, confirming powering up VM on iSCSI screwed it up for good.

From what I can tell ESX 3.5i servers (PXE or local install) have no problem accessing LUNs, sharing it, moving files around etc. The problem is in windows (tried both 2003 and XP in both local and iSCSI scenarios, local works and iSCSI gives me same problem) not being able to access its VMDK properly. It can't format it, can't communicate with it, can't read / write data reliably or at all. The only consistent thing is that problems always appear on the same spot / stage, so I know that once I figure out what is wrong there is a good chance of everything working fine.

I have worked with ESX 3.5 in FC SAN environment, with similar setup - servers were booted from small LUNs, all VMs were on shared LUNs, no VMs had direct access to LUNs. This iSCSI setup is very similar in that regard, yet something is very wrong.

Another observation - I Storage VMotioned one XP VM while powered up from local storage to iSCSI. Once there, I copied some files from C:\windows to C:\test, to get 500 MB for a test. This worked fine. But then I couldn't open any network shares (getting "The specified server cannot perform the requested operation"), attempting to start task manager, IE or any other application simply caused that application to disappear from screen if it ever got up on it (splash screen for example). Starting task manager actually leaves task manager icon in the system tray, but when I roll mouse over it it disappears, as if the process was killed. Attempt to access this VM from the network didn't get anywhere, even though I can ping it / ping from it. You may guess what happened on restart of this VM, windows could not start : "load needed DLLs for HAL".

I've spent days on this and am not any wiser. I hope you'll have a suggestion or two because I'm out of options, have no idea where to look.

Thanks

0 Kudos
1 Reply
stevieg
Enthusiast
Enthusiast

Did you ever get to the bottom of this?

I've seen your other post regarding the NIC and wondered if you got a solution in the end. I'm experiencing a very similar issue and until I read your post thought I was going crazy!

Steve

0 Kudos