VMware Communities > Blogs

Blog Posts

Manual Automation

3 Posts
0

There's nothing like going in to work Monday morning only to find that one of your ESX hosts is listed as "not responding" in VirtualCenter. Using HP's iLO, I tried restarting the management network. No change. The VMs were still running and functioning normally. The host was still running - there just seemed to be a communications problem between the host and VirtualCenter. After a quick call to VMware technical support, they had me restart the VirtualCenter server service and voila, communications were restored and the host's status in VirtualCenter returned to normal.

I didn't spend a lot of time doing a root-cause analysis as this environment was not in production yet. But I suspected there was a network interruption from which host-VC communications never recovered.

Now let me just say something about VMware technical support. I've worked support incidents with many hardware and software vendors over the years and have to say that VMware has their act together when it comes to product support. I'm not saying they're perfect, but I've received consistent quality support from these guys going back to my ESX Server 1.5 days. They're worth the money and I wouldn't run a Virtual Infrastructure environment without them.


So a week later, it happens again. I open another case with VMware referencing the previous. This time, restarting the management network or the VirtualCenter server service doesn't work. The support tech reviews some additional logs and is basically stumped. The only thing he had left for me to do was to manually shut down the VMs running on this host and reboot the host. This fixes the problem, but doesn't really explain why it happened in the first place. The tech is going to review a new set of logs I just uploaded and let me know if he finds anything. While there wasn't much more that could be done at this point, this always seems to me like a "don't call me, I'll call you" kind of resolution.


Before he has a chance to call me back, it happens a third time on the same host! Same symptoms, same results. The same tech doesn't find anything in the logs from the previous incident so he escalates to senior-level VirtualCenter support. We discovered a new symptom - the host seems to have lost connectivity with the storage, even though the VMs are still running fine (strange but true).


The senior tech said something that jogged my memory and I remembered that while this server survived our 4-day hardware burn-in test, we had problems connecting to the management console very early on to the point where we had to pull the USB key fob and reinstall it. (Keep in mind we're running ESXi.)


To be safe, I installed a new USB key fob and the problem has not occurred again. It's been about three weeks since writing this entry. Moral of the story: don't automatically rule-out the hardware even when the problem appears to be with the software.

0 Comments Permalink
11

For my employer, this is the year of disaster recovery. Almost all of our major projects tie-in to the goal of performing a successful DR test by the end of the year. Besides the standard IT things that have to get done on a regular basis (asset management, corporate application TLC, etc.), this goal is really driving the work we’re doing.

Sometime before I was hired, the company purchased two EMC Celerra NS352s for NAS/IP storage and two EMC Centeras for file and email archiving. We’re an HP shop so were using DL380 G5s with the little USB key inside running ESX 3i or ESXi or whatever it’s called today. We mostly use Cisco gear for networking and have dedicated switches for iSCSI and VMotion traffic. We have two of everything – our company’s IT services are split across two datacenters. Each datacenter will be a hot recovery site for the other.

So what does our DR solution entail? Well, some fairly advanced technologies:
Virtualization: VMware VI3, release 3.5
DR Automation: VMware Site Recovery Manager
Replication: EMC Celerra Replicator V2
Snapshot consistency: EMC Replicator

Oddly enough, looking at the list, only one of the technologies is shipping and in my possession today (VMware VI3). Hmmm… can you say, “Project risk”?

I first saw VMware Site Recovery Manager at a VMworld 2007 presentation. If it works, it will be impressive. Automating the steps to configure and power on VMs and a central place to store the DR “run book” will be sweet, to say the least.

One of the advantages of having EMC storage equipment is that they own VMware (or at least the controlling interest). This means there’s a pretty good chance that their storage platforms will be among the first certified to work with VMware products. Sure enough, my Celerras will work upon release of SRM with a firmware/code upgrade. The code is shipping on new product; however, EMC has a policy that delays certification by 90 days for installing/upgrading to current product. That puts us in the June time-frame.

The Celerra code upgrade provides a new version of Celerra Replicator that replicates iSCSI LUNs. To ensure application consistency for applications such as Exchange and SQL Server, EMC Replicator must be used.

This is the high-level plan: application-consistent snapshots, SAN/IP storage-based replication and SRM to run it all at the end. Yes, we have some physical servers = HP-UX, AIX, etc. and too bad their story won’t be interesting.

So while I’m waiting for product to GA, I’m trying to get our VI3 platform up and stable. Stay tuned for progress on that front.

11 Comments Permalink
3

Who is John Galt?

Posted by Virtual_JTW May 8, 2008

Ok, now that I have your attention (and no, I’m definitely not John Galt), but since this is my first post, I think it appropriate to give you a little background about myself. After all, why read and/or care about anything I have to say? ;)

I was introduced to VMware like so many others, through Workstation back in 2002. At the time I worked for a 2.8 billion dollar company and had just completed a major Microsoft SMS 2.0 implementation. I was working with a consultant that had it installed on his laptop PC and used it to demo the product developed by his employer.

So then it goes something like this:
That’s a cool product, let me show the boss!
Boss likes it and says, “Why don’t you do some more research on this VMware company and find out more?”
I do the research and discover VMware’s GSX and ESX product lines.
The rest is history. (Don’t you love that line?)

I procured a copy of GSX, set it up and created three virtual machines for one of our internal development teams.
Oh, and I never told anyone what I had done. (Bad boy!)

I used a whitepaper from VMware that describe how to use the same base image for multiple VMs using redo logs. Worked great except when one needed to be rebooted = NTFS no likey.

The dev team never caught on and six months later I finally let them in on the secret. Virtual machine? What’s that? Etc, etc… (I hate giving end-users or customers reason, right or wrong, to blame all of their ills on me!)

Bottom line = pilot successful. From there I got ESX 1.52 approved (now end of 2003 or so), first one host, then two, then four - all with local disk.

Then VirtualCenter 1.0 and ESX 2.0 are released. Please Mr. IT Director, will you approve this requisition for a SAN? No. But please?
No.
Okay then how about you get fired and my team now reports to another director?
Will you, new Mr. IT Director, approve this SAN purchase?
Yes!

Hosts grow to 16 (mixture of IBM 440s, 445s, 3650s, BladeCenter blades); VMs to 500 (powered on) 600+ requisitioned; VMotions = 1000s, SAN (IBM DS4500) = 16TB; every VMworld = attended; VCP2 = achieved. Life if good.

It’s now 2006 and my employer gets acquired by another company. Oh yeah, and the new management doesn’t like any new technology much less virtualization.

Time to move on. So here I am, embarking on a new implementation using ECM Celerra IP storage (iSCSI), HP ProLiant DL380s, VMware VirtualCenter 2.5 and ESX 3i.

I thought I would use this blog to document the good, the bad and the ugly of this next gen VI platform and maybe share some tips along the way. The VMTN forums have been good to me over the years so maybe I can add to the discussion.

That’s it for now. I already have a list of technical stuff to publish. Stay tuned!

3 Comments Permalink