I am coming to the group for support as a last option, since VMware support has been unable to solve this one and it is causing major issues for me. Our database server is sitting on a single RAID 10 LUN, by itself, with no other servers. About two days ago the performance dropped off to the point where it was almost unuseable. To transfer a 1GB file from the internal C: drive to 😧 took over two hours, where on another server it takes less than 15 seconds. I tried to do a Storage VMotion to another datastore, but that process failed after vCenter lost connectivity with the database on the SQL server. After that, I could no longer VMotion the server off because it would throw errors. At this point all VMware support could recommend was to do a cold migration, which I am currently trying to do, but is moving at a snale's pace (only 7% in almost 9 hours, transfer rate is 6.12MB/s). Any ideas on what I can check either on the SAN side or on the VM side to help speed this along? I was able to do a 24 hour downtime, but have to have this system back up prior to Monday. I appreciate any help you could provide!
6 x Dell 2950 III Servers w/32GB RAM, Dual Quad Core X5470 3.33GHz Processor, and Dual Emulex 4GB Fiber HBAs
Dell AX4-5 Fiber SAN w/36TB Storage running Navisphere 6.28
Dual Brocade 300 Fiber Switches
ESXi 4 updated to the latest patches on all ESX servers
vCenter 4 Standard
I have eight total datastores. All of them are on RAID 5 arrays and shared with other virtual servers except for the one this machine is on. This datastore is a 4MB block size and a RAID 10 array. The machine was running perfectly for months until a few days ago. I received a message from the SAN about soft media errors on one of the disks. I have since replaced that disk, but performance slowed noticeably after that. The array is made up of four 1TB 7200RPM disks.
Virtual Machine Details:
The VM has for vCPUs, 16GB RAM, and two virtual hard disks attached to it. The primary drive is located on a 40GB hard drive running Windows 2003 Enterprise (32-bit). The second drive is a 1TB drive where all of the databases are stored. The first disk has about 22GB of free space while the second has 950GB of free space.
The only option I can think of is trying to do a hot clone to another VM, but I'm not sure if that would be any faster. But since I can do it while the machine is running, it might be my only option if I can't get it working this weekend. Let me know if you need any further information about my environment. I appreciate any and all help you can provide!
Its Saturday night and Halloween so I wanted to at least throw a few bones your way since most people are probably not in front of their computer at the moment (stateside anyway)
One thing I noticed about the info you provided is that you didn't mention whether or not you had completely cold booted/restarted any devices. I have run into problems with vSphere and iSCSI SANs LUNs where an interruption can be severe enough that only completely power selected units down and then rescanning the bus would help. Interestingly enough I was doing a 1TB datastore as well with the larger block size.
Yea, I appreciate the help since it is Halloween! Are you talking about shutting down a SP on the SAN or the ESX server? I haven't tried shutting down the ESX server since I automatically assumed it was a problem on the SAN, but I guess it could be an issue with the server. Can I still reboot the ESX server and continue my cold migration? Would the cold migration still be slow if it was an ESX issue?
I would start with the SAN first. If you can do a cold boot on that and then rescan the bus on the ESX server that might correct the problem. If your cold migration is in progress though, I don't know what you could do about that.
I haven't done the math but have you figured out at the rate its currently transferring if it would complete before Monday anyway? (assuming it was successful) If not, you're better off canceling that if possible and try some cold boots on both SAN and ESX to at least take those approaches out of the equation.
From there if you can power on and operate any other VMs on the ESX-SAN combo without any issues, then there is probably a corruption in the vmdk file (or associated files) within that particular VM. At that point it goes to lower level troubleshooting so I wouldn't be much help there.
Everything else on the SAN is working perfectly, just this one VM on this one LUN. But since I don't have any other VMs on that LUN, I don't have anything else to compare it against. This is something I feel like I should know, but just to be sure, if I reboot one SP on my SAN, everything should failover to the other SP correct? This is a live environment and I can't bring all of my VMs down to reboot the SAN without a bit more change management fun.
Okay, try reversing the troubleshooting.
"just this one VM on this one LUN."
Is there a VM (test or otherwise) that you could transfer TO the troubled LUN and power it on to see if functions okay?
If its successful then that would further isolate the actual troubled VM as the problem and not the hardware itself.
are the disks SATA? i would look at performance graph in SAN for the RAID group. Queue length, response time and I/O throughput per sec. This would help to nail down the issue.
Hope this helps!
I would advise the same thing, i am not that familiar with the AX4-5 but I do know the clariion range quite well, in there is a piece of software called Navi Analyzer, this should allow you to see the performance of hat particular LUN. Check if the issue is on the SAN unit or in VMware, so also look at the performance graphs on your host to see if they match the ones on the SAN. How long ago did you change the drive ? Because if the SAN is rebuilding the raid group this will impact the performance as well.
Make sure you also check the cache settings on both SP's, EMC will disable cache for a lot of reasons (recharge batteries, failure of disk in "vault area" etc) this will impact the performance most. LAst but not least make sure all luns are distributed over the SP's maybe you have another heavy workload which is working on the same SP, which means they are both hammering the same cache. You should be able to move a lun to another SP in the properties, this can be done online.
I looked at my virtual disks and all of them are owned by SPB except for 4. The only option I see to change LUN ownership is to change the default owner, but when I make that change, it doesn't change the current owner to the other SP. Any ideas?
right click and select fail over, it should do the trick... Im not saying it will fix the issue but if the SP is very busy it could be hammering the cache.
Did you ever discover the source of this problem?
I ask because we are evaluating an AX4-5i and I have not seen many negative comments about them. I'm curious if you eventually traced the fault here to the SAN or if it was elsewhere.
In EMC case the block size shouldn't be over 1Mb since it disables the cache on the controllers when using vmfs.
Check if you gain performance for the db on a vmfs with lower blocksize.