Skip navigation

When using Update Manager to patch the vSphere hosts in my cluster I found that I had two hosts that would not update. The process would fail immediately with the following message:

 

Remediate Entry - There are errors durring the remediation operation.

 

 

After some Google and Community searches I didn't find anything that would help. Being stubborn and not wanting to place a service call, I decided to figure this one out on my own. I looked at the logs, but they gave the same message. Then I noticed that when I clicked on the Hardware Status tab for the host(s) in question that I would get a pop up box with a message - 'Hardware monitoring service not responding, the host is not powered on".. This was very strange, because my hosts were indeed powered on. Now there were some lingering posts about this. So after working on my hosts I was able to resolve both issues. Since this was the first time this happened in my environment I had to document the procedure to fix it, and here it is for you if you have the same issue:

 

 

 

 

 

 

Wow, I haven't written anything in several months!!! To say we have been busy at work is an understatement - we have been very busy at work over the past few months.  Eight months into the new job and I think I have settled in by now although there are still a few applications / systems I haven't got to touch just yet.  One of them has been our SAN environment which is a EMC DMX4-950.  We are replicating to a DR site using SRDF/A.

 

 

After solving our backup issues I went straight into SRM.  To test SRM we setup a completely separate lab environment at our main datacenter and our DR facility as well.  The lab has really been nice to have - it is allowing us the proper flexibility to test new installs such as SRM.  We are using old hardware so I am hoping we don't have any failures becuase there is no maintenance on any of this equipment, but with some spare servers laying around we should be ok.  I have requested to up the RAM in the hosts and hopefully that will get approved here soon.

 

 

SRM was pretty straightforward - following the guides from vmware and EMC alike the install was easy.  We created a few test vm's and placed them on a LUN that was being replicated just for the lab enviroment.  Once we had everything configured we did some failover tests and then recreated the setup on the DR side to failback.  Since we are a small environment this was not too big of an issue for us but I wouldn't want to do this with 100's of vm's.  I hope in future versions of SRM that automating of a failback will be an option.

 

 

So once we knew it would work we set out to re-engineer the SAN LUN layout so we could maximize our flexibility in using SRM.  Well, this is when we found out that a couple factors were going to prohibit us from doing what we wanted.  One of the issues was the layout of our SAN.  There was a lot of RAID6 storage, some RAID1, and the BCV's.  The layout of the RAID6 ended up using too much cache and many of the LUNs would never replicate.  Anything on the RAID1 was fine.  So after a lot of calls with EMC we talked with Duane Olson from EMC.  He is an absolute SRFD guru.  So long story short we are adding some spindles, doing a complete config change, moving a lot of data in the form of VM's around (thank you storage vmotion)..  After all said and done we should be able to replicate all of our critical vm's and most of the rest if we need them.

 

 

We are also toying around with VIEW.  Our plan is to use it first in our DR testing to bring up about 30 workstations if needed.  Our prelim testing is going well.  I see that SRM 4.0 was just released but for now EMC has not released an SRA for our DMX.  Like I said it has been busy but as you can see vmware is a big part of that!!

 

 

jamieorth Expert

Could this be it??

Posted by jamieorth May 9, 2009

 

You know, I have used a lot of different software products over my 20 year career in IT.  Some good, some not so good, and I am sure that you all have been there.   I have always found that what makes a good product is the support and staff that you have behind it.  I have used VMware since the early 2.5 days and have never looked back.  If you have read all of my blog posts then you know that I have been at 4 different financial institutions in the past year after being with the first for over 18 years.  One common theme they have all had was that they were a VMware shop (with the exception of Colonial Bank - I was trying to get them there).  Also, I have been a fan of Vizioncore for some time.  I have had success with vRanger, was an early beta tester of vReplicator, and my latest deployment from them was vFoglight.

 

 

Now, when I arrived at Publix Credit Union they were having a horrible time with vRanger - close to a 50% failure rate.  Now, since my past experience tells me if the infrastructure is sound then the product just works.  So I checked everything, and then checked it again.  We changed things to isolate different parts of the infrastructure.  Some of the changes made the issue less prevelant, however I was not satisfied till we start seeing 100% success each and every night.

 

 

Our System Administrator already had tickets open with Vizioncore, VMware, and DataDomain and was not getting anywhere, but he was not taking an active approach to solving this problem.  I dug deep and saw NFS errors and warning in the vmkernel logs.  Finally a breakthrough - perhaps.  Our Cisco Engineer noticed something that he didn't see before.  He could see a Pause Request coming from the interface connected to the DataDomain appliance.  Were we perhaps sending too much data at one time?  Perhaps, but we had seen failures even if only one backup job was running.  I wondered if it was just that the NFS mounts were timing out when the Pause Request occurred.  It sounded logical, and it matched up with the warnings in the vmkernel logs.

 

 

At the same time I have been anticipating the release of vRanger 4.0 DPP - totally new architecture and some features that may help us out, like the ability to restart a failed job.  So I was reading Jason Mattox's blog about the upcoming features and I made a comment to him.  I also had typed up from start to finish everything we had done and seen, along with what I thought could be the issue.  I posted this out to several places, including Vizioncore's forums.  Here is the link to that - http://supportforums.vizioncore.com/forums/thread/12245.aspx

 

 

Jason responded by having us try some NFS settings in the Advanced Settings of the ESX host.  At this point we were ready to try anything.  Well after making those changes we have had 4 nights in a row of 100% successful backups.  Now, I am not holding my breath, I would like to see about a month before I call this the fix, but it sure looks good for now.  Also, we are not seeing the warnings in the logs any longer.  That has to be a good sign, so could this be it??  Stay tuned. 

 

 

jamieorth Expert

Close....but no cigar......

Posted by jamieorth Apr 26, 2009

 

Ok, so our backup issues still exist. I am now convinced that this is in part due to the known memory leak in 3.5 Update 2 and beyond. There are several threads in the forums with users having different and random issues - some as bad as the host rebooting. One thread in particular that I read had some of the same things going on that my systems had - http://communities.vmware.com/thread/187927?start=0&tstart=0

 

 

I have done some of the things mentioned in that thread but I still get a failure, but now the rate is even lower, again which is better but not solved.  I have also migrated the physical connections back to our Cisco 6509's.  I don't think the connection had anything to do with the problem.  I now need to learn how to monitor hostd (and memory in general) in real time as the log files from the vmware side of things don't show anything going on now.  Also, last Friday we had two different hosts have the issue 1 second apart.  I know this is probably a random occurrence, but it is odd that they would be so close together.  Could this issue be with eth3 of the DataDomain appliance?  Another company had an issue with eth3 in particular on one of their DDR systems....  Next week I plan on changing the NFS traffic to a different interface on the DDR.  If that doesn't help my last shot is to rebuild one host at a time, not add it to the cluster, and monitor backup progress until this is resolved.

 

 

Speaking of the cluster, is anyone else having HA messages (or errors) about the PropertyProvider failing???

 

 

Well, as you may have read from my last post we have been having random failures in our VMware backups. Our setup is this -

 

  1. Vizioncore vRanger (latest build) currently configured to backup each vm to a NFS mount point.

  2. The NFS mount point is a DataDomain DD530 appliance. The NFS traffic has a dedicated port on the appliance.

  3. The NFS traffic from the ESX host is dedicated on a vSwitch with a GB Nic going to an independent GB switch. There is no other traffic on this switch. This was not the original setup but we have isolated everything we could while we were having the problems.

 

So, the problem was this - durring a backup of a VM the NFS mount point would drop which would kill the backup. If you use vRanger you know that in the current version there is no way to restart a failed backup, so it was a matter of monitoring the jobs and then restarting them manually. Unfortunately this was not being done proactively before my arrival. Logs were being checked the day after but that was about it. Calls to VMware and DataDomain were not making a lot of progress. The original network config had the NFS traffic, Console traffic, and vmotion traffic all on the same vSwitch but with 3 vmnics assigned each dedicated to a particular port group for the type of traffic. This pNics were then going to a Cisco 6509. All traffic had a particular VLAN.

 

It was a very odd problem becuase there were no patterns that were obvious, and a failed job might run successful the very next time. Timeframes did not matter. It happened across all hosts, but if one failed, the others continued which pointed us away from a failure at the DD530. The logs from the DD530 indicated that the host was dropping the connection. VMware pointed us to network problems. This is when we changed the vSwitch config and isolated the traffic to a separate switch. The failures continued, although not as often it seemed. In fact the first night after this change we had 100% success. We thought the problem was solved, until the next night we had two failures.

 

 

So I started looking at every log at the ESX hosts. The vmkernel log had some interesting items - we could see when the NFS mount would drop by a warning message, like the one below.

 

 

Apr 1 18:14:10 lkldesx2svr vmkernel: 3:23:06:21.862 cpu3:1032)WARNING: NFS: 257: Mount: (Backup_ESX_DDR) Server (10.1.42.10) 10.1.42.10 Volume: (/backup/lakeland/vrangerpro/nfs) not responding

 

 

Apr 1 18:14:19 lkldesx2svr vmkernel: 3:23:06:30.368 cpu3:1145)WARNING: NFS: 281: Mount: (Backup_ESX_DDR) Server (10.1.42.10) 10.1.42.10 Volume: (/backup/lakeland/vrangerpro/nfs) OK

 

 

Now, if we saw this message durring a backup there were messages like this:

 

 

Mar 25 22:41:27 lkldesx2svr vmkernel: 15:08:45:41.532 cpu7:1037)BC: 2672: Failed to flush 6 buffers of size 131072 each for object b00f 36 1 63f29ef0 cdcc0000 1 1cdcd 5f343da6 a95e2cb0 0 0 0 0 0: No connection

 

 

So in my mind I was wondering - was the flush buffers the failure, or was that a side effect of the NFS mount point dropping? Did we really have bad network issues? I called upon others running the exact setup and they were not having any failures. Another note, if we changed the jobs to CIFS we never had a failure.

 

 

About the same time looking at the logs I found the authentication errors that occured every 5 minutes. There were 2 admin accounts that were attempting to login every 5 minutes but would fail. The user names associated with the accounts were valid, but each user did not know of anything they were doing, so it looked like a common service was attempting something with both of their accounts. After a TCPDump I was able to trace the traffic back to a server that runs our EMC Control Center software. Now, I haven't been able to get to this in my two months here yet, so I was in unfamiliar grounds. I was able to find that ECC does have agents on each ESX host. Do we use the functionality? Probably not, so I want to get them off the hosts. Now, under ECC these agents do different tasks, and one of them runs, you guessed it, every 5 minutes. Since it was failing anyway I disabled the task and the events went away. That was one week ago. Since then, the odd thing, is that there has not been one backup failure. I can't imagine that these two things are related, but I can imagine that an agent loaded on each host could cause problems. I will monitor the backups for another week and if there are no failures I am going to say it was the agent. What would you say?

 

 

 

Well, it's been a while since my last post - which involved a job change and a fresh start.  I was really looking forward to getting hands on with SRM and all the exciting things that were happening at MIDFLORIDA Federal Credit Union.  Fast forward to today - I left MIDFLORIDA becuase I was basically handcuffed and not allowed to use the skillset that I bring to the table.  I am now a Sr. Systems Engineer at the Publix Employee's Federal Credit Union - PEFCU for short.  PEFCU is luckily a vmware shop.  This is what we have going on - EMC DMX4(s) using SRDF to replicate to a DR center.  We are also a DataDomain shop for backup to disk and deduplication.  While this sounds all great there are some issues.

 

 

1) I am not sure I like the setup of the VI here - but I don't wan't to rock the boat too much at first.

 

 

2) The backups to the DataDomain appliance fail randomly through NFS... Trying to figure that one out.

 

 

3) SRM is purchased, but we are not using it yet... Why not?  I need to dive into that subject and see whats the holdup.

 

 

Oh, and I am the backup to the Network Engineer which means Cisco training for me.  I know enough to be dangerous, so off to study......  I like the team and there is a lot of work to keep me busy....

 

 

STAY TUNED>>>

 

 

jamieorth Expert

SRM and Update 2

Posted by jamieorth Aug 12, 2008

 

UGH!!! It's Tuesday afternoon and I hope your day is going better than mine.  What in the world happened?  First of all SRM was ready to be tested when someone (who will not be mentioned, but it wasn't me) installed Update 2 for 3.5.  Guess what, U2 breaks SRM1.0.  So far I don't have a committment time on when that will be updated, but that seems to be the least of the problems.  As the date passes August 10th we notice some strange events happening.  VMotion is not working with HA.  Strange, huh??? Is this something that was configured wrong, after all we are working on 6 new hosts at the primary site and 6 new hosts at the secondary site.  Also, the secondary site is not really the secondary site, it's the new data center that will become the primary and the original site will be the secondary.  So all of production is running at the primary(old primary)... When one of the VM's needed to be shut down and then restarted we received a General Fault issue.  VM won't restart, no matter what we try.  A call to VMware indicated that our timezone file is corrupt.  VMware was able to get the vm restarted and we go about our merry way.  Today, same thing, different VM.  I research the log error and find the thread of threads of late in the forums.  Seems U2 has a bug, and after August 10th the licensing is corrupt.  All running VM's are OK, but don't vmotion or shut them down.  VMware is working on the issue - 36 hours they say.... 36 hours?? Are you kidding me?  Luckily we have a small environment, but what if there was a failure that needed HA?  Financial institutions do not like to be down for any length of time.

 

 

So as we wait for a fix we cautiously watch our VM's and stand ready to roll back time if needed, I think about other things to say in this blog.  Anybody out there really love NetApp?  If I here of another product with "Snap" in front of the name I am going to puke.  Also, did everyone know - NetApp is not a hardware company, they are a software company, as many of you already know based upon the outrageous fees for their different products.... So I rant....until lunch and next time......

 

 

jamieorth Expert

Karma??

Posted by jamieorth Jun 9, 2008

Well, in my last post I was not looking forward to the entire process of looking for a job, especially in an economy that doesn't look good. However, as I write this blog, I find myself employed which is a great thing. MidFlorida Federal Credit Union had a posting for a Systems Engineer - the usual skills, plus VMware. I remembered that I had allowed two of their guys to tour the Citrus and Chemical datacenter when they were considering purchasing a new phone system. Our Siemens rep. had set up the meeting. Come to find out I knew one of the guys who had actually worked at Bennigans of all places back in the day. (I didn't work there, but I did have my name on one of the bar stools.....). I tried to call him when I saw the posting but all I had was a first name. I got his voicemail. I also learned that one of the ladies from marketing (C&C) was the ex-sister-in-law to the head of MIS. She said she would give him a call. Two hours later I had an interview. Well, the interview went well but I had the feeling that I may have been overqualified for what they were wanting. I got a call about 3 days later about coming in to meet the head of HR and the CFO. Seems that they have been pondering having a full time position for MIS that deals mainly with Business Continuity. So, we talked about what I knew about that - currently the head of HR was in that role. Not really what she wanted to do. So, here I am, the Assistant Vice President of MIS in charge of Business Continuity / Disaster Recovery. Also, since I didn't get the Systems Engineer position, they ended up hiring one of my engineers from C&C, Randy Adams.

 

Now, the cool side is that they are just in the early stages with VMware, so in fact that they started with 3.5!! We are going to be setting up replication with 2 new NetApp SANS and be using Site Recovery Manager as well. I hope to stay involved with is much as that as I can, and I should be able to since that ties in with BC / DR... The NetApps are supposed to arrive today. Never seen them but I read good things in the forums about them.

 

 

Well, enough for now. Will keep you up to date.....

 

 

jamieorth Expert

June 30th

Posted by jamieorth Apr 22, 2008

Well, it's official.  My last day at 600 N. Broadway will be June 30th.  I have worked in this same building now for over 18 years.  That's 1/2 my life.  2 companies.  One good, one not so good.  Basically Colonial is not going to use the Bartow facility for DR as once was planned.  It seams that just in the last week some floor space opened in the Orlando facility.  As this facility is ready to go there is no need for Bartow.  I thought that myself and my two employees would at least be safe and could still perform a job function from the location, but I guess not.  2 positions will be moving to Montgomery, and 1 to Orlando.  I have one employee that may consider the Orlando job, but nobody wants to move to Alabama.  So, it looks like I will be looking again for a job.  Not really a great time to do that, but I am young and can still rebound from this.  It is unfortunate because I know there is nothing in the town I live - so that means a commute that will hinder some of the family responsibilities that I have.

jamieorth Expert

Bad news for Bartow

Posted by jamieorth Apr 21, 2008

Well, the IT Director is flying to little ole Bartow in the morning.  Looks as if the DR Site Study doesn't bode well for the Bartow facility.  Plans were to use our building for DR and Test/Dev.  My feelings are that there are some that do not like something about the building, either its location of structure.  Lets recap - a datacenter was run out of this building since 1976 that supported over 26 company payrolll's, the processing for 4 community banks, and at one time the local municipality utility billing.  So, does that mean that it qualifies for a state of the art facility??  NO...Can it serve as a backup site?  Sure it could.  I guess there are too many $$ involved, but of course this is from a company that neglects the possibility of saving more of those same $$ by investing in virtualization.  Sure, they have a small farm of 50 or so VM's.  That pales in comparison to the 500+ physical servers humming along at 5% to 10% utilization, with a power bill going through the roof, and the cost for data center real estate at an all time high...  I guess thats where I come in...Lets see if they are ready..

jamieorth Expert

To upgrade or rebuild....

Posted by jamieorth Apr 14, 2008

 

Well, the past couple weeks I have been working on a presentation that I will be giving to the IT Review Board at work.  This is supposed to be the corporate virtual vision.  Well, thats fine with me, I love the stuff.  Convincing some I am learning is easier said than done.  I also am looking at upgrading the corporate systems from 3.01 to 3.5.  Now, I have never been with a company that was so segregated in duties and had so much change management that effeciency is stiffled.  So, what to do.  I don't have and access to the SQL server that holds the VC database.  So I have to involve the DBA's and the company policy is that all SQL servers have the SQL lockdown tool on them.  Is this going to break VC?  How will it affect the upgrade?  Well, the DBA points out that all other systems have identical test systems.  Well, not VC.  It manages both the production hosts as will as test/dev.  So, I have to have a project plan to install the upgrade, and a roll back contingency.  Now, they keep data for 30 days and purge.  So, is any of the data really needed?  Probably not.  Should I just start fresh?  Some would say yes, but what do we lose?  VC settings, where are they stored at?  In the database or as part of the Program Files VMware DIR?  Well, I go to Bama next week, so I guess I will have my answer after that. Until then,

 

 

Regards....

 

 

 

Ok, blog number one.  ** Brief History ** Worked for Citrus and Chemical Bank, a $850 million community bank in Polk County, FL for 18 years.  Started as mainframe operator, installed their first network in 94', Network Manager in 99', and Director of Network/Operations in 07'.  We were acquired by Colonial Bank at the end of 2007 and like 28 others in the I.T. department I thought I would need a job.  Well, thanks to VMware (and maybe the fact that our building had the data-center they someone at Colonial decided to use for DR) I am now a Systems Engineer III with Colonial.  Seems that they have VMware and the CIO and I.T. Director know thats the way to go as they have 500+ pServers in their datacenter and are running out of room.  Sprawl here is an understatement.

So it looks like I have some work to do...First thing, a healthcheck.  I need to know how we have VI configured, and make sure we are set for growth.  Read about that in the next post....

 

 

Regards...