urgrue
Contributor
Contributor

What's with VMFS?

After experiencing lost luns, lost VMs, manually editing .vmx files to help a VM find it's disks, manually setting iscsi nodenames, enable resignature (and having to do the remove from inventory/re-import back to get VMs working again), fdisking luns to get ESX to see them again, etc etc etc, I've come to the conclusion that in my experience, VMFS is by far the weakest link in the whole VI3 environment. ESX is rock solid, VC is, though buggy, mostly only buggy in non-production-critical things. But with such a flakey storage technology underneathe it it undermines the whole thing.

It seems to me that vast majority of the problem's I've encountered (and by the looks of these forums I'm not nearly the only one) would all be avoided if ESX did two things:

-use LUNs like most good RAID's use disk: write metadata onto the LUN that positively identifies it and use this (and only this) as the datastore identifier - don't give a hoot if the iscsi target name changes, the LUN id changes, etc.

-VM to disk mapping should be more relative, not absolute as it is now. For example if ESX expects to find a VMDK on, say, /vmfs/volumes/123123123/, but 123123123 isn't there, don't just give up, allow me the possibility of saying "my_datastore_1 which was previously on 123123123 is now gone, it is now 456456456 - find/replace every occurence of 123123123 with 456456456."

Excuse me if I'm venting but this is because I just spent about 15 hours straight manually trying to get ESX to see lost LUNs again and having to resignature/rescan/remove/edit vmx/add manually for over 100 VMs.....(and its not the first time!)

Other than that, I really dig vmware Smiley Wink

0 Kudos
6 Replies
uslacker99
Expert
Expert

Why are your LUN Ids changing? What kind of iSCSI SAN do you have?

0 Kudos
dalepa
Enthusiast
Enthusiast

I agree, VMFS is the VMware Achilles heal... Try Why VMware over Netapp NFS instead of VMFS...

dp

0 Kudos
Jae_Ellers
Virtuoso
Virtuoso

Interesting. I've never had a disk issue that wasn't caused by human error (Lun ID mismatch, zoning, etc) on the HP EVA 8000s or NetApps.

I had HP stomp on my luns during an EVA firmware update once which required me to migrate my vms around, but never lost luns before.

-=-=-=-=-=-=-=-=-=-=-=-

Check my blog:

-=-=-=-=-=-=-=-=-=-=-=-

-=-=-=-=-=-=-=-=-=-=-=-=-=-=- http://blog.mr-vm.com http://www.vmprofessional.com -=-=-=-=-=-=-=-=-=-=-=-=-=-=-
0 Kudos
RParker
Immortal
Immortal

Well for starters consider that VM Ware has probably 100,000 customers, perhaps close to a million, and we ALL use VMFS with very little problems.

Then consider that LUNS have nothing AT ALL to do with VMFS, that is your SAN configuration.

This is like blaming Windows for your CD drive changing to a different drive letter and finding out LATER that you install some Roxio or other utility that bumped the CD rom drive as a result, NOT a Windows problem.

A LUN is nothing more than a pre-assigned disk drive on the network, which is managed completely separate from VM Ware.

Now think about HOW VM Ware manages their VM's and their file systems, and their utilities, they can't rely on some simply file system (which they didn't create) to do EVERYTHING they need. This is simply not a good knowledge of how VM Ware works at all.

From your rant, I can tell you have virtually ZERO formal training and have done almost NO reading on the use of VM Ware file systems, Linux AND SAN management. You can rant all you want, but this would be the same as a person yelling at the teacher because they didn't do their homework. If you want to be lazy, that's fine, but VM Ware is NOT going to do the work for you, they give the tool, its up to YOU to learn how it works, because 99.99% of the people on here will NOT have any issue like you experienced, and the other %0.01 may but even those can be attributed to a hardware problem, NOT a VM Ware issue.

ESX and VMFS is a very stable product, and the problems you describe are adminstrative problems or your SAN is not configured properly, so don't go blaming VM Ware for your problems. You need to learn about the product, and read/follow documentation.

0 Kudos
urgrue
Contributor
Contributor

Well for starters consider that VM Ware has probably 100,000 customers, perhaps close to a million, and we ALL use VMFS with very little problems.

I'm glad you feel you can speak for 100,000-1,000,000 customers, you must have lots of friends.

Then consider that LUNS have nothing AT ALL to do with VMFS, that is your SAN configuration.

My LUNs have never been the problem (my inclusion of them in my rant was somewhat inaccurate). When I have lost them (a couple occasions), the problem has always been something "forgivable"...once a true storage failure, another time human error ("oversight" might be a better term). The real problems have been getting ESX to realize a) that there is a VMFS on a certain LUN b) that said VMFS is a certain configured datastore that contains certain VMs. In a nutshell, the ways in which VMFS/ESX react and recover from problems, congestion, heavy load, etc, are the real problem.

ESX and VMFS is a very stable product, and the problems you describe are adminstrative problems or your SAN is not configured properly, so don't go blaming VM Ware for your problems. You need to learn about the product, and read/follow documentation.

I think you jump to conclusions. I never explained the specific technical details of my problems and you're making blind assumptions that they have been some kind of trivial misconfiguration issues. I am a VCP and use fully supported solutions every step of the way. I've done nothing in our setup that hasn't been overseen and/or configured by official NetApp and VMware support services. Every problem I've had I've sent off to our support even if I've already solved it.

Like I said, our SANs & LUNs etc are not the problem and are properly configured. A few problems I've encountered:

My LUNs are snapmirrored to our other netapp filers. In a storage system disaster, I cannot just activate the backup and all is well. ESX can and does completely successfully connect to the backed up LUNs over iscsi, but ESX won't realize that the LUN contains the same VMs that were on the previous LUN and pod along, it will keep the VMs as "inaccessible" because the iscsi target name is different. I have to go to advanced settings and enable resignature just to get the VMFS to appear - then I have to manually "remove from inventory" on all my VMs, and add them back again. THEN, i have to manually edit the .vmx files of every VM that contains vmdk's on different LUNs in order to replace the now-outdated paths with the new one. This huge waste of time is completely, entirely, vmware's fault : it should realize this LUN is the same one that was previously elsewhere, or at least allow me the possibility to say so somehow, like at the moment when its says i cant rename "name (1)" to "name" because "name" already exists, it could allow me to say "yes because its the same VMFS thanks (notice how the one that "already exists" doesnt exist at the moment??)". Any good RAID solves this by implicitly trusting the metadata on the disk - it doesnt matter which shelf/bay you shove the disk in physically. Same logic could/should apply to a VMFS partition. And to VMs, as a matter of fact.

And it's happened not once, nor twice, but three times that in situations of heavy loads, my VMFS gets somehow corrupted and I have to resort to the fdisk trick mentioned quite a few times in these forums to get it working. Nothing is wrong with the SAN or IP networks, nor on the storage devices, none are even under any load level worth mentioning - and we've got tens of terabytes of data running around on those without a hitch in years - but ESX/iscsi/vmfs has been nothing but trouble and downtime since day one.

ESX's software iscsi is absolutely unusable in anything more than extremely lightweight and/or small environments - even vmware representatives have admitted this to me - yet the manuals do not in any way stress this fact. Basically, under any vaguely heavy load and/or in situations where you have lots of VMs on a single host, sw iscsi will crawl to a halt. I've literally watched simple mkfs operations take close to an HOUR when they should (and normally do) take a minute or two, at the same time as other systems are accessing the storage without any trouble and the storage system is yawning cause it has so little work to do. Remove ten VMs from the ESX host (all of which were idle all the time) and mkfs works just fine again.

Thankfully good old SAN has been working far more reliably (though I havent yet used it under very heavy loads), and I'm currently testing NFS, which is looking good so far.

0 Kudos
wila
Immortal
Immortal

RParker can be a little confronting in his replies sometimes, don't take that personal.

He's a good guy really Smiley Wink

When i read your posts they do make a lot of sense to me as to how vmware can finetune their product and make it even nicer. So when you're done venting maybe you can take the best parts out of it and post it here:

http://communities.vmware.com/community/vmtn/suggest/product

--

Wil

| Author of Vimalin. The virtual machine Backup app for VMware Fusion, VMware Workstation and Player |
| More info at vimalin.com | Twitter @wilva
0 Kudos