Hi,
I'm trying to get some feedback from others regarding the stability of the MSA1500cs controllers with MSA30's and MSA20's. We have quite a few of these and have had quite a few problems, even though we have the latest hardware revision (A) and latest firmware (5.20b). We're in the process of moving to a fully-redundant configuration with the 7.x firmware, but with 5.20b, we still occasionally have the MSA1500cs fiber channel connection stop responding. The controller responds on the console port and we can interact with it, but ESX shows the LUNs as unavailable. "show this_controller" reports that the controller is "not failed" when the problem appears. There are "no" events in the event log other than the power-up entries. A power off/on is required to bring the units back online after this happens... not exactly a good situation. The ESX hosts re-connect to the LUNs after the controller finishes booting and all is well afterwards.
Maybe we have a bad fiber channel switch? It's one of the typical HP StorageWorks 4/8 fiber channel switches (Silkworm 200E?). Profile has always been set to Linux. The last controller failure happened after 23 days of almost continuous use.
Anybody else have continuous problems with these? Anybody have "great" success with them? I'd like to hear from everyone possible. I'm talking to HP about the problem, so I'd like to give them some feedback from others.
Thanks!!
Eric K. Miller, Genesis Hosting Solutions, LLC
- Lease part of our ESX cluster!
Just an update on this case... we had one of our MSA1500cs have the same issue, one that has 2 MSA20's connected, and only a single VM on it running an FTP server for backups. What's odd about this is that we thought this issue would be related to conflicts among our 4 hosts, but only a single host is running this single VM, so you wouldn't expect too many conflicts. During the time of the failure, the backup system was pruning the backup set to remove old files, not actually performing a backup, so lots of "small" I/O's with FTP commands.
During the problem, the fiber channel switch reported "Loss of sync". Resetting the fiber channel switch port did "no" good. Neither did forcing the port to 2Gbps (it was set to auto). Only resetting the MSA1500cs controller fixed the problem (for now).
VMware is still looking at the issue as is HP.
Anybody have news about the MSA2000? One of our vendors indicated they sold their first one the other day due to the lack of availability from HP. HP seems to love to release products without actually having any available for months. Not a great way to build customer confidence.
Eric K. Miller, Genesis Hosting Solutions, LLC
- Lease part of our ESX cluster!
Someone on HP's forums recommended upgrading to firmware 7.00, so we've tried this on an internal MSA1500cs with only a single VM on it, even though the MSA1500cs only has a single controller in it.
The upgrade worked, however, it caused volume signaturing problems (a known problem with this upgrade). I'm not sure how best to deal with the resignaturing since it's a real pain that all VMs on a resignatured volume need to be updated to point to the new LUN UUID. Since this is a single-VM scenario, it's quite simple to do.
We're going to throw everything we have at the new 7.00 install and see if we can break it. It's got 2 fully-populated MSA20's attached. If we can't, and this is the answer, maybe we can simply Storage vMotion away from the production 5.20 environment to a new 7.00 environment without having to deal with resignaturing (since all hosts are now 3.5 U1).
Eric K. Miller, Genesis Hosting Solutions, LLC
- Lease part of our ESX cluster!
Ok, so far so good. However, the problem doesn't appear to be related to firmware, but rather the fiber channel switch port speed and type. All ports were set to auto-negotiate, which worked fine as far as we could tell, negotiating at 2Gbps (they are 4Gbps SFP's). These ports also auto-negotiate the type of port... whether it is an F-Port, E-Port, or U-Port. We forced all ports on the switch, for all hosts and MSA1500cs', to 2Gbps and allow F-Port only. So far we have not seen one error in the /var/log/vmkwarning file on any of the ESX hosts nor any errors on the fiber channel switch (previously we would see Loss of Sync errors).
One MSA1500cs has been up for almost 8 days (5.20 firmware) and the other for about 3.5 days (7.00 firmware). The latter is the one being beaten up with continuous I/O from 4 VMs on separate hosts running I/O meter against the same LUN with about 2,200 I/O's per second and has been doing this for over 3 days.
I hope this is the true solution to this problem. I'll post back with the news after a few more days of testing.
Eric K. Miller, Genesis Hosting Solutions, LLC
- Lease part of our ESX cluster!
Erick
Truly an great bit of work you're doing testing this trying to get it it to work properly. There definitely are people that have had excellent results with the MSA1x00, but also people that have had dismal failures. Any work that sheds light on why this is so is fantastic, in my opinion.
Thanks for posting your results back to the forum so far.
No problem at all. I hope it helps some desperate person out there (like us) fighting with this problem.
An update... we have had NO problems whatsoever since the port speed lock (to 2Gbps) and port type lock (to F-Port) on all host ports and MSA1500cs ports "on the fiber channel switch". We still have auto-negotiate turned on on the hosts, and the MSA1500cs, I believe, only has auto-negotiate.
One controller has been up for 14 days and the other for 9.5 days since we made this change with no errors on the fiber channel switch and no errors in /var/log/vmkwarning on all ESX hosts.
Eric K. Miller, Genesis Hosting Solutions, LLC
- Lease part of our ESX cluster!
What HP said to us is that they support on ESX 3.5 active/active firmwares ONLY on all storage types!
Michael
One more quick update.
We have still had "no" errors, and the controller uptimes of the 2 MSA1500cs' is 28.5 days and 24 days. So, still humming along even with a LOT of disk I/O. One drive failure occurred, but fail-over to a hot-spared worked perfectly.
Eric K. Miller, Genesis Hosting Solutions, LLC
- Lease part of our ESX cluster!
i too have had major performance issues on my 1500cs. its a new purchase (about 3 months old) last i checked the firmware was 5.2 (not sure about the b or not) its got a single msa20 disk shelf with 10 750gb sata drives (again new purchases)
the major problem i have appears to be some sort of contention. 3-4 VMs and everything is happy. get to about 15 and systems can take upwards of 15 minutes to boot. restarting the machines often fixes this though, which is the wierdest part.
I made the same changes, to force the F config, and set the fixed path and forced the data rate to 2g. I'm watching the performance now to see if it helps.
I could also use some advice on configuration. these are all low importance servers (test/dev) with mostly low I/O. there are 5 ESX 3.0 hosts, each seeing the LUNs. i currently have the LUNs defined as 2TB raid 5 ADG, and i have 2 of them presented. I also have a 500GB raid 1+0 that im testing with as well to see if it helps the performance.
Hi ZKrieger,
I suspect the restarting of the VMs is faster due to caching, but that's just a guess.
If you are using ADG (Advanced Data Guarding which is RAID 6), then I can imagine the contention of 15 VMs causing performance problems only because the MSA1500cs doesn't have super-speedy CPUs in it, so the processing power isn't really adequate to deal with 10 disks in RAID 6.
We've always used RAID 1+0 across many spindles since it is much less CPU intensive. Performance has always been very good in this configuration. Of course, it is more expensive from a "disk investment" perspective, but much better suited for any production system than RAID 5 or 6.
A quick note regarding disk speed... the first LUNs created on an array are the fastest because they are created on the outer-edge of the disks. Not sure if this helps in your situation, but thought I'd mention it.
We never saw any performance degradation before forcing the F-Port and data rate, except relatively soon before the MSA1500cs stopped responding. Thankfully, we have yet to have a problem since forcing the F-Port configuration and data rate.
I was doing some tests the other day with one of our MSA20's that has 12 250GB disks in it connected to a single-controller MSA1500cs running 7.00. The disk configuration was 10 disks in RAID 10 and 2 hot-spares. 5 LUNs were configured evenly (about 250GB per LUN). Storage vMotioning a number of VMs from our SCSI storage (also on a MSA1500cs) was storing data onto the MSA20 at 60MBytes/sec according to a "show perf" from the MSA1500cs! Not bad! It was almost continuously doing this for 30 minutes, so caching was not involved. So, the MSA20 can definitely perform well under the right circumstances.
I'm about ready to perform some testing on a dual-controller MSA1500cs with 8 fully-populated MSA20's connected (250GB disks). I'll report on what type of performance I get. Testing will be done with ESX 3.5 Update 1 on 4 dual-proc quad-core Opteron machines.
Eric K. Miller, Genesis Hosting Solutions, LLC
- Lease part of our ESX cluster!
thanks for the feedback Eric. I am attempting to migrate hosts to the 1+0 LUN that i created as we speak (unfortunantly, getting the vmdk is corrupted errors constantly though). I think that i can probably work this into a more usable state, but i think i've got myself too performance bound at the moment. Hopefully i can make time this coming weekend to down the cluster.
I have heard mention of the SAN guide that contains specific settings for storage arrays, beyond the fport and fixed port speeds, do you know if there are any other settings specific to the MSA1500cs?
I have not seen any documents referring to anything more specific than using the latest firmware (5.20 or 7.00) and some minimum firmware on the fiber channel switches (which was relatively old compared to the most recent firmware available).
Eric K. Miller, Genesis Hosting Solutions, LLC
- Lease part of our ESX cluster!
Hi,
I have been getting this issue also at 2 deployments with an MSA1500. Uptime seems to vary but upto about 90 days.
I have taken your advice and upgraded the firmware from v5.0.1b to v6.1.0c, fixed port speeds (switch only) at 2gb for the MSA link and 4gb for the HBA's. Upgraded HBA firmware to latest (Emulex cards). (MSA and HBA's left to Auto still, HBA boot bios now disabled - something I read on antoher post). Also ESX 3.5 with all updates on. Enough changes to make some good I hope!!
I also had the setup configured badly.
Dual controller MSA with v7 A/A firmware
1 HBA per host
2 FC switches.
4 hosts and 1 MSA per FC switch, bad!
As the MSA is not true Active/Active I think that would have induced LUN thrashing??? My vmkwarning log's are full of Resync problems anyhow.
Im ditching the 2nd FC switch until I can install dual HBA's into the hosts. I will then configure a perferred path for Lun 0 to MSA CLI1 and Lun 1 to MSA CLI2. My question now once this is done would you use MRU or Fixed within ESX? MRU seems to be what the HP and VM manuals suggest for an A/A MSA.
Has your's still been stable since your last post? And how often did it fail before?
Thanks,
Rich.
just wanted to add a note, since converting the msa to raid 1+0 from raid 5 adg, the performance is a lot more acceptable.
i currently have 14 VMs working on this storage, including a complete MOSS install with 2 front end systems and a sql backend, and even on that resource pig, the performance is good.
msa1500cs, single controller with 512mb, 10x750gb sata drives raid 0+1, five dl380 esx 3.0 hosts.
Hi Rich,
We have had "no" failures since these changes in a very heavily used environment. Uptime on 2 of our MSA1500cs' that I've been gauging this test against are 49 days 18 hours and 45 days 3 hours.
One has 4 MSA30's connected (the 49 days one) and the other has 2 MSA20's connected (the 45 days one).
We recently added another 4 hosts to another cluster and gave access to these 2 MSA1500cs' for moving data between clusters and no problems at all with 8 hosts hitting everything.
Neither MSA1500cs has active/active configurations yet. We will be Storage VMotioning all VMs off of these onto other SANs and will be upgrading to A/A afterwards.
Previously, we had our MSA1500cs units fail anywhere between 3 days and 18 days depending on how utilized they were, not a good track record.
Eric K. Miller, Genesis Hosting Solutions, LLC
- Lease part of our ESX cluster!
That's great! I'm not too familiar with the performance of many of the other very high-end SAN solutions, including clustered storage solutions like LeftHand Networks or SAN aggregators like DataCore, but RAID 5 and 6 are definitely computationally "expensive" and the MSA1500cs just doesn't have the capacity to do well in this area, relative to the performance of RAID 1+0.
I'd be interested in hearing from others who have used the MSA2000 to see how speedy it is compared to the MSA1500cs.
Eric K. Miller, Genesis Hosting Solutions, LLC
- Lease part of our ESX cluster!
i can give you an example of "good" to "great"
the physical servers that i VMed were normally compaq 3000-7000 series and older g1-g3 DL360s and 380s. the best of those being the g3 380s which were quad 3ghz machines with 4gb ram.
under our first VM test we had an old 1TB eva 3000 using 16 146gb 15k rpm FC disks. the 5 new hosts are g5 380s with 16gb ram as i mentioned. the performance moving the server to the VM actually increased most of its performance numbers. the 3000-7000 servers all showed amazing improvement, as is to be expected going from sub ghz processors.
the original use if the MSA 1500 with adg was giving me worse performance than even the old pentium III based systems. boot times of up to 15 minutes once more than 5 or so VMs were on the msa. now that its back to 0+1. the msa based machines are a little laggy in the interface, but still equal to g2-g3 server performance over all. not bad considering im getting full vmotion and DR on those same servers.
if anyone asks, physical sql servers, and virtualise the IIS servers for MOSS. the SQL side is simply hit too regularly. there is a constant disk activity at all times even when the system is unused. for my sata disks, that is unfortunantly causing some contention. thankfuly, this is a temporary solution until our new san arrives in a few weeks.
Sounds promising then.
I'm now running all hosts and both SP's from the MSA on the 1 upgraded FC switch. Port speeds locked at 4gb and 2gb for MSA. Looking at the switch stats there has not been any loss of sync or errors at all. Before on the old firmware and auto ports speed there were hundreds of errors per port and sync issues. Ive checked a few hosts aswell and vmkwarning doesnt show any scsi or sync probs either now
(My boss has ordered HP in anyhow (thats trust and respect for you!) just to check over our setup anyhow).
We have 7 hosts, 55 VM's (1 big SQL, 2 small SQL, 10 Citrix servers etc) across 2 Raid 5 Luns on our MSA. I would say on the whole performance (disk access) seems fine. Im going to test iometer out and see what it comes back with,
Cheers
Well, so much for stability... 64 days of uptime and one of our MSA1500cs'es locked up. The other one is going strong, but the lock-up was typical... the unit came to a crawl, the CLI responded, but virtually no activity. However, "show tasks" indicated that there were connections from 5 hosts, yet no running tasks. I have a dump of "show tech_support" and a few other items, but I'm not sure if anything will show what's really going on.
I started a forum here:
I'm trying to get as many people to post complaints and questions about the MSA1500cs as possible to get HP's attention. Would everyone mind signing up and posting their experiences?
Thanks!
Eric K. Miller, Genesis Hosting Solutions, LLC
- Lease part of our ESX cluster!
Erik,
I wanted to thank you for your diligence in sharing your experiences and updates. I'm also experiencing the same issue. Odd part is that this MSA1500 has been running fine for well over a year that I'm aware of and likely much longer with problems starting about 2 months ago. You're saving me time in attempts to track down the root of this issue.
Cheers.
Beste Mailer, Dank voor uw e-mail.
Van 15 aug. tot 11 sept. ben ik afwezig, voor dringende zaken kunt u contact met mijn collega’s opnemen (070 8508705)
Mvg
Iwan van der Wekken