VMware

jamieorth

jamieorth's Profile

  • Name: Jamie Orth 
  • Email: jamie.orth@pefcu.com
  • Member Since: Apr 28, 2005
  • Last Logged In: Nov 20, 2009 1:03 PM
  • Status Level: Expert Expert (1,485 points)
  • Location: Bartow, FL
  • Occupation: Consultant / Sr. Systems Engineer Publix Employee's Federal Credit Union
  • Homepage: http://communities.vmware.com/blogs/jamieorth
  • Signature: Regards... Jamie If you found this information useful, please consider awarding points for "Correct" or "Helpful". Remember, if it's not one thing, it's your mother...

jamieorth's Latest Content


Wow, I haven't written anything in several months!!! To say we have been busy at work is an understatement - we have been very busy at work over the past few months. Eight months into the new job and I think I have settled in by now although there are still a few applications / systems I haven't got to touch just yet. One of them has been our SAN environment which is a EMC DMX4-950. We are replicating to a DR site using SRDF/A.

After solving our backup issues I went straight into SRM. To test SRM we setup a completely separate lab environment at our main datacenter and our DR facility as well. The lab has really been nice to have - it is allowing us the proper flexibility to test new installs such as SRM. We are using old hardware so I am hoping we don't have any failures becuase there is no maintenance on any of this equipment, but with some spare servers laying around we should be ok. I have requested to up the RAM in the hosts and hopefully that will get approved here soon.

SRM was pretty straightforward - following the guides from vmware and EMC alike the install was easy. We created a few test vm's and placed them on a LUN that was being replicated just for the lab enviroment. Once we had everything configured we did some failover tests and then recreated the setup on the DR side to failback. Since we are a small environment this was not too big of an issue for us but I wouldn't want to do this with 100's of vm's. I hope in future versions of SRM that automating of a failback will be an option.

So once we knew it would work we set out to re-engineer the SAN LUN layout so we could maximize our flexibility in using SRM. Well, this is when we found out that a couple factors were going to prohibit us from doing what we wanted. One of the issues was the layout of our SAN. There was a lot of RAID6 storage, some RAID1, and the BCV's. The layout of the RAID6 ended up using too much cache and many of the LUNs would never replicate. Anything on the RAID1 was fine. So after a lot of calls with EMC we talked with Duane Olson from EMC. He is an absolute SRFD guru. So long story short we are adding some spindles, doing a complete config change, moving a lot of data in the form of VM's around (thank you storage vmotion).. After all said and done we should be able to replicate all of our critical vm's and most of the rest if we need them.

We are also toying around with VIEW. Our plan is to use it first in our DR testing to bring up about 30 workstations if needed. Our prelim testing is going well. I see that SRM 4.0 was just released but for now EMC has not released an SRA for our DMX. Like I said it has been busy but as you can see vmware is a big part of that!!

2 Comments Permalink


You know, I have used a lot of different software products over my 20 year career in IT. Some good, some not so good, and I am sure that you all have been there. I have always found that what makes a good product is the support and staff that you have behind it. I have used VMware since the early 2.5 days and have never looked back. If you have read all of my blog posts then you know that I have been at 4 different financial institutions in the past year after being with the first for over 18 years. One common theme they have all had was that they were a VMware shop (with the exception of Colonial Bank - I was trying to get them there). Also, I have been a fan of Vizioncore for some time. I have had success with vRanger, was an early beta tester of vReplicator, and my latest deployment from them was vFoglight.

Now, when I arrived at Publix Credit Union they were having a horrible time with vRanger - close to a 50% failure rate. Now, since my past experience tells me if the infrastructure is sound then the product just works. So I checked everything, and then checked it again. We changed things to isolate different parts of the infrastructure. Some of the changes made the issue less prevelant, however I was not satisfied till we start seeing 100% success each and every night.

Our System Administrator already had tickets open with Vizioncore, VMware, and DataDomain and was not getting anywhere, but he was not taking an active approach to solving this problem. I dug deep and saw NFS errors and warning in the vmkernel logs. Finally a breakthrough - perhaps. Our Cisco Engineer noticed something that he didn't see before. He could see a Pause Request coming from the interface connected to the DataDomain appliance. Were we perhaps sending too much data at one time? Perhaps, but we had seen failures even if only one backup job was running. I wondered if it was just that the NFS mounts were timing out when the Pause Request occurred. It sounded logical, and it matched up with the warnings in the vmkernel logs.

At the same time I have been anticipating the release of vRanger 4.0 DPP - totally new architecture and some features that may help us out, like the ability to restart a failed job. So I was reading Jason Mattox's blog about the upcoming features and I made a comment to him. I also had typed up from start to finish everything we had done and seen, along with what I thought could be the issue. I posted this out to several places, including Vizioncore's forums. Here is the link to that - http://supportforums.vizioncore.com/forums/thread/12245.aspx

Jason responded by having us try some NFS settings in the Advanced Settings of the ESX host. At this point we were ready to try anything. Well after making those changes we have had 4 nights in a row of 100% successful backups. Now, I am not holding my breath, I would like to see about a month before I call this the fix, but it sure looks good for now. Also, we are not seeing the warnings in the logs any longer. That has to be a good sign, so could this be it?? Stay tuned.

0 Comments Permalink


Ok, so our backup issues still exist. I am now convinced that this is in part due to the known memory leak in 3.5 Update 2 and beyond. There are several threads in the forums with users having different and random issues - some as bad as the host rebooting. One thread in particular that I read had some of the same things going on that my systems had - http://communities.vmware.com/thread/187927?start=0&tstart=0

I have done some of the things mentioned in that thread but I still get a failure, but now the rate is even lower, again which is better but not solved. I have also migrated the physical connections back to our Cisco 6509's. I don't think the connection had anything to do with the problem. I now need to learn how to monitor hostd (and memory in general) in real time as the log files from the vmware side of things don't show anything going on now. Also, last Friday we had two different hosts have the issue 1 second apart. I know this is probably a random occurrence, but it is odd that they would be so close together. Could this issue be with eth3 of the DataDomain appliance? Another company had an issue with eth3 in particular on one of their DDR systems.... Next week I plan on changing the NFS traffic to a different interface on the DDR. If that doesn't help my last shot is to rebuild one host at a time, not add it to the cluster, and monitor backup progress until this is resolved.

Speaking of the cluster, is anyone else having HA messages (or errors) about the PropertyProvider failing???

3 Comments Permalink

Write your own drafts, invite selected collaborators, or leave it open for all to pitch in.

Communities