VMware Communities > Blogs

Blog Posts

Manual Automation

14 Posts
0

VMworld must be my destiny! Let me explain...

I've been to every VMworld since the first was held in 2004. I really thought this was the first year I was going to miss the conference. The odds just weren't in my favor: the economy busted right around last year's VMworld and since then my employer has entered Chapter 11 (but should come back out of it relatively quickly) and I can't afford to cover the costs myself. I tried to reassure myself by remembering the last conference held in San Francisco was my least favorite thus far (which didn't really work anyway).

Then in March, my local VMware Systems Engineer, Dave, asked if I wanted to present my Site Recovery Manager (SRM) experience at this year's conference. My first reaction was not only "no", but "hell no". After a little more thought, I quickly changed my mind. I think SRM is a great product that isn't getting the attention it deserves. I'm also proud of what our little team has accomplished here in such a short amount of time. And finally, I enjoy the occasional challenge and believe this helps one to keep growing professionally and keep life interesting.

In case you didn't know, VMware typically covers the cost of the conference for speakers. The only thing left was to convince management to cover the travel costs which was not easy given the current financial circumstances. But in the end I was given approval and just like that, I'm VMworld bound again this year!

If you're considering SRM or are just getting started, check out my session "BC2704: Site Recovery Manager, a real user experience". I promise it will be worth your time. I can talk about this stuff for hours and Dave and I will answer all of your questions - from technical SRM product to general disaster recovery and everything in-between.

Here's the abstract:

"Learn from a customer in the Midwest, all of their experiences implementing, testing and running Site Recovery Manager in a production environment. Hear their challenges and how their SRM implementation has worked for them. Find the facts you need to know to maximize the success of your disaster recovery solution with SRM."

See you there!

0 Comments Permalink
0

Bye, Bye ESXi

Posted by Virtual_JTW May 20, 2009

What a long, frustrating trip it's been! Don't get me wrong, I really like the idea of ESXi: thin, fast install, small foot-print, BIOS-like host configuration, no Console OS (COS) to patch or support, can run from embedded USB key, etc, etc. But, my experience in supporting and managing an ESXi-based VI production environment tells a different story.

I've decided to convert all of my hosts from ESXi to ESX "Classic". There are three primary reasons:

  1. Support
  2. Reliability
  3. Compatibility

Support

Without the COS it's difficult to execute commands and view logs files "real-time". I've had more than one VMware support engineer complain about this during a trouble-shooting session (so it must be true!). There are alternatives: using the unsupported trick to get to the command line from the host's console, hacking SSH to open it up (which is also unsupported), capturing logs/diagnostic bundles via vCenter Server, RCLI, VIMA, etc. But none of the alternatives are as fast/clean/easy as SSH'ing right into the COS and working from there.

Reliability

I purchased eleven Hewlett-Packard USB keys w/unlicensed ESXi to embed in my ProLiant DL380 G5s. When I upgraded them via VMware Update Manger (VUM), the entire installation on the key became corrupted. I was not even able to revert back to the previous ESXi image on some keys. HP has since issued a customer advisory and I have replaced all of the keys: http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?objectID=c01605187

Unfortunately, this experience still leaves me with a less-than-fuzzy feeling for running ESXi on said embedded USB keys in critical production environments.

Compatibility

ESXi seems to lag ESX Classic in updates - specifically when it comes to compatibility. This implies that ESX Classic is developed, tested and certified against first. I manage two VI environments located in different datacenters. I use SRM at the primary for DR/fail-over to the secondary (see previous articles). It took around a month for VMware to release a patch for ESXi Update 3 that made it compatible with SRM 1.0 Update 1. Read your compatibility guides! More on this in a future article.

Many third-party tools and scripts require the COS. There are many examples of this: Snap Hunter, Vizioncore vOptimizer Pro, etc.

Unlike ESX Classic, HA in ESXi requires a ScratchConfig folder created on separate VMFS datastores for each host. This may not be a big deal for smaller clusters, but for a cluster with many servers, many datastores will be required.

Finally, for those of us that run HP servers to host ESXi, we have a specific firmware/ISO that contains the HP management providers. Unfortunately, even with the built-in providers you still can't monitor disk status - which is, of course, the one hardware component that fails the most often(!). As of this writing, here's what I've been ale to determine:

Licensed ESXi

  1. HP only supports ESXi with the proper "management providers" on Update 2 and Update 3 (as evidenced by VMware's downloads section of their web site). ESXi 3.5 Update 4 is not yet supported.
  2. Using VMware Update Manger to upgrade ESXi instances to Update 4 effectively breaks SIM manageability.

Free ESXi

  1. A new installable image is available for Update 4 with the management providers.
  2. Upgrading exiting hosts via the VMware Infrastructure Update tool effectively breaks SIM manageability. These hosts will have to be reinstalled from the ISO.

Once you're sure you have the right ESXi firmware image installed, it's time to add the host to HP Systems Insight Manager (SIM) for hardware monitoring. I was able to add only 3 out of about 20 of my ESXi hosts successfully. HP support wasn't able to help me out. The main suggestions I got were to reinstall(?) and to call VMware. With ESX Classic you install the Insight agent in the COS, add the host to SIM, and you're done. It just works.

Conclusion

Like I mentioned previously, I still really like the idea of ESXi. Once I saw that Hitachi was embedding a virtualization solution in their servers I knew it was only a matter of time before VMware came out with something similar.

I have many free ESXi installable instances. This is a great solution in cases where the budget it tight or non-existent. Utilizing the free ESXi still gives you many of the benefits of virtualization making it a better way to go than bare metal OS installation in most cases.

I think embedded/thin is the future. I hope vSphere 4 embedded improves on the issues described above.


5-20-2009 UPDATE:
Not being one to spread any FUD, I would to add to my comment "Using VMware Update Manger to upgrade ESXi instances to Update 4 effectively breaks SIM manageability". According to this article on the Yellow Bricks blog (http://www.yellow-bricks.com/2009/05/14/updating-an-esxi-server-with-vendor-agents/), VUM has the intelligence to download and apply the correct ESXi firmware image (i.e. the one with whatever OEMs management providers are pre-installed). And I believe it because Ducan is "the man". Note that I used the Virtual Infrastructure Update Tool, not VUM, so that may have made a difference. Regardless, I do know that after updating the image HP Insight Manager was no longer seeing all of the hardware components and in some cases failed to communicate to the server completely. This hasn't to me happened on my ESX "classic" hosts.

0 Comments Permalink
0

By the time I was twelve I knew I wanted to be in the IT business. I loved my Commodore 64, my 1541 disk drive and 300 baud modem! I loved dialing into bulletin board systems and typing in games with all of those peeks and pokes. I didn't know it at the time, but I was really learning a lot about computer hardware, storage media, telecommunications and software programming.

Responsibilities at my first corporate job included managing and maintaining LANtastic on a 10base2 network. Okay, I was mostly maintaining it - not very reliable as it ran on top of MS-DOS and the transceivers failed all of the time but hey, this was before the whole dot com boom era so you could expect much back then. I also had a Novell NetWare 3.x NOS running for one department (remember NetWare Loadable Modules, NLMs?). Once NetWare 4.11 came out I convinced management to dump LANtastic and 10baseT and run upgrade to NetWare 4.11 and Ethernet (remember NetWare Directory Services, NDS - can you say "AD"?). I didn't realize it at the time, but I was really learning a lot about network operating systems (NOS), networking hardware and standards.

It was around this time that I started to realize that I wanted to work with multiple technologies, especially newer technologies that were interesting because they solved a complex problem or saved businesses money. I didn't want to become a walking product manual for one piece of software or hardware component and focus on that for the rest of my life (or until I retired, which-ever came first)!

My next opportunity involved leading a team in the design, implementation and management of MS Systems Management Server (SMS) 2.0 to potentially hundreds of locations around the US. While the tie-in may not be oblivious, working with SMS allowed me to learn more about MS SQL Server 2000, Windows management technologies (CIM, WBEM, etc.) and data synchronization across slow links (anyone heard of Starburst?). Not-to-mention it was a great opportunity to work for a 3 billion dollar publicly traded company - a different culture than my previous employer to say the least.

Unfortunately, management made a short-sided, mostly political decision and shut down the SMS implementation after it was successfully deployed to the first remote site. However, this turned-out to be good news: I discovered virtualization around this same time - 2001/2002 timeframe via VMware Workstation. Shortly there-after I setup a production GSX server hosting three VMs that shared the same base image with each VM having a unique redo log (remember that VMware whitepaper?). The users couldn't tell the difference and were a bit surprised when I finally let them in on the secrest! And as they say: the rest is history.

To expand on a theme, virtualization has allowed me to work with and learn more about operating systems, enterprise-level server hardware (such as how CPUs work), enterprise-level storage and all of its related technologies (SAN hardware, iSCSI, fiber switches, etc).

I don't pretend to have the most unique career path in the world - I know there are many other Systems Administrators and Engineers that have lived similar experiences. And for that reason I'm identifying a new breed of IT professional. Systems Analyst, Systems Administrator and Systems Engineer titles are all less meaningful in this context. I've had all of these titles at some point in my career.

Regardless of title, we're IT professionals that stand apart based on our past experiences and constant passion to always be working with current and newer technologies that ultimately allow businesses operate faster, smarter and more efficiently. Virtualization, especially VMware Virtual Infrastructure and the coming vSphere products, has allowed us to take businesses to that next level.

And we won't stop there. We're constantly keeping an eye on cloud-based technologies, standards and initiatives. We'll be beta testing these products - and not just download, install, use for five minutes and throw-away. We're excited about the product and want to see it succeed so we'll provide feedback. We'll keep an eye on the bleeding edge and maybe sometimes participate in a limited way - such as a product install in a lab for evaluation, but we won't bet the company's business on it (at least those of us that have been around for awhile and have made that mistake before).

To bring us full circle I have to ask, what is the next NetWare? The next SMS? The next VMware VI? Is it more VMware? Quite possibly. Or maybe it's something we haven't thought of yet. Maybe its mainframe 2.0 with pervasive high-speed wireless connectivity brought about by a technology such as WiMAX? It's hard, if not impossible to predict. But one thing is certain, it will be "cool" and we'll be among the first of the light bulbs popping on throughout the IT industry and the businesses we work for will thanks us for it.

0 Comments Permalink
0

I just read Eric Siebert’s Open Letter to VMware which inspired me to write this article. I have to disagree with his last suggestion: relaxing VCP requirements. His argument is that many admins can’t afford to take the training which is a prerequisite to taking the exam. I have to wonder why? I know it is not cheap costing around $3000 but most IT professionals can come up with money. There are several ways to accomplish this:

1. Paid for by employer
2. As a bonus for hiring on with a different employer
3. Use savings: set aside as much as you can from each paycheck and pay for it yourself
4. Use debt: get a credit card/get a loan


I’m sure there are others, but these are methods I’ve used over the years for every certification exam I’ve completed. Obviously, the first choice is ideal. Consider this: if your employer won’t pay for the training and the exam (if you pass), doesn’t that say something about them? I consider it one of those life lessons I’ve learned along the way: ask about the support you’ll get to continue your education during the interview! My experience is that businesses that are managed well and/or have a more mature IT organization know that it’s in their best interest to encourage the life-long learning of their employees. If your current or potential employer has no desire to do this, that should throw a mental red flag you should carefully consider. Note that I do not recommend immediately quitting your job(!), but this is one more piece of information that should be considered within the whole.

I know many will shutter at the thought of going into debt to pay for a training course. But one must consider the full scope of the situation. Why do most IT professionals what to achieve certification? To have some acronyms listed after their names on a business card? No, it’s to convey value they bring to an organization based on their skill-sets. It may help hiring managers narrow down potential candidates to interview for a position. It may help Joe Admin get recognized by his peers and earn him a promotion. The point is that the training should be worth $3000 to the student in that it helps him further his goals at some point in the near future. Financially, it should also help him pay back the loan or pay off the credit card much faster than would be otherwise possible.

An administrator should only get the training and take the exam if it is required for them to accomplish their long-term goals. This at least implies that the exam should be no less easy to take just as not every admin needs or wants to take it.

Which brings me to my next point: I think the exam should be harder, not easier! I enjoy the exclusivity that passing the exam brings. With the “multiple answers” format, it’s too easy for dishonest people to publish “brain dumps” with exact questions and answers they saw on the test. It’s even worse that thugs sell PDFs with many of these questions to losers that use these exclusively to pass the exam. It really brings down the value of the certification for everyone. Ever heard of “paper MCSE”? I worked with one many years ago I can tell you the guy didn’t know how to do his job – even when I asked him to carry out the simplest of tasks. And there are many more of those folks out there today. What a waste.

I can’t say I have a solution to this problem. I dream of a day when testing methods become more robust in that they are better able to signify that the certificate holder knows their stuff. The VCDX may go a long way towards achieving this dream. In the meantime, I can highly recommend taking the course and the exam as the VCP is probably the hottest certification in the IT industry.

My VCP number is 001711.

0 Comments Permalink
2

In a previous blog post, Virtualizing Virtual Center, I discussed the benefits of virtualizing Virtual Center and why I done this in my environment. I recently heard another argument against this decision from my VMware SE. The argument is that if all of the ESX hosts in the cluster where the Virtual Center VM is in crashes, then you have to logon to each host to find where it’s at to power it back up. The recommendation is to then configure DRS such that it keeps the VM on the same host so you know where it’s at. Let’s consider this a little further…

For those environments that have multiple ESX hosts and are using HA, the VC VM should be powered-on to another host if there are enough hosts left standing. So this argument really only applies to a scenario where all hosts have crashed, in which case you may have a bigger problem on your hands anyway(!).

But let’s say this does happen. How can I find out where any particular VM was when my DRS/HA cluster crashed? Well besides logging on to every host you could just query the Virtual Center database. Here’s a quick little query that will give you a list of VMs and the host they’re currently on.

SELECT VPX_VM.DNS_NAME AS
VirtualMachine, VPX_HOST.DNS_NAME AS Host
FROM VPX_VM INNER JOIN
VPX_HOST ON VPX_VM.HOST_ID = VPX_HOST.ID

You could set this up as a scheduled task and save the results to a text file (or better yet, a SharePoint server if you’re organization uses that for document sharing/management – this is a little further down my task list). Of course you should save this information on a system “outside” your virtual infrastructure such as a NAS-based CIFS share.

I’m sure you could do something similar with PowerShell and the “get-vm” command but I haven’t really looked into it. There are other tools that can help you track VMs as well such as Veeam Reporter.

The bottom line is that if you remove your VC VM from DRS it will not enjoy the load-balancing benefits that DRS brings to the table in the first place. I’m not running the Distributed Power Management (DPM) feature but I’d also have to wonder how this might impact environments where this feature is enabled.

As with many technical decisions, I think this is largely a matter of personal preference and what you’re comfort zone will allow. We VI admins have notoriously large comfort zones so I’m guessing many are virtualizing their Virtual Center instances! If you’re okay with downsides previously mentioned, by all means assign the VM to a specific host. I’ve lived through enough hardware and power-outages that crashed my VI over the years so I’ve learned that hard way: track your VMs regularly regardless of how you’ve implemented Virtual Center. You’ll thank me for it someday.

2 Comments 0 References Permalink
0

This thing must be beta! It inserts additional lines, doesn't convert Word-formatted documents very well, etc.

I hope they keep improving this resource.

0 Comments 0 References Permalink
4

To give you a little background, I now have 6 ESX hosts with 58 VMs. Each host has dual-iSCSI HBAs with 1GbE connections. All Exchange 2007 roles have been virtualized, however we currently only have 1 out of 5 mailbox servers running as a virtual machine. We have a number of other workload types virtualized including file, print, SQL, web servers, etc.

Management has decided to stop virtualizing Exchange servers. Why? Fear generated by the FUD that surrounds the performance characteristics of various storage transports - in this case iSCSI via GbE. The only way to fight FUD is with facts. Towards this effort I have performed some calculations in an attempt to answer 2 questions:

1. How well is our storage transport performing given current virtualized workloads?
2. How much "performance capacity" do we have remaining?

I added up the average bandwidth utilization of all 6 of my ESX hosts which totaled 11008KBps. This converts to 0.09Gbps out of 2Gbps or 4.5% bandwidth utilization. I then added up the maximum utilization of all 6 ESX hosts. This would be the high-point of the peaks or bursts in utilization. The result was 0.48Gbps.

Assuming we can get 800Mbs of actual bandwidth per connection we have 1.6Gbps useable bandwidth remaining. Note that based on VMware's testing we should be able to reach near wire-speed (2Gbps) if the environment is configured correctly making 1.6Gbps a conservative assumption.

So even if I use the maximum bandwidth measurement of 0.48Gbps, that leaves 1.1Gbs useable. Another way to state it is that my environment is reaching a max of 30% bandwidth utilization.

The results seemed unbelievable to me at first so I digged a little deeper:

  1. I found this in a EqualLogic presentation from 2005: "With 2 iSCSI connections and free NIC teaming, payload equals approx. 234 MB/s (1.96Gb/s) or 823GB/Hour. We found 2Gb FC delivers 196 MB/s which equals approx. 689GB/Hour payload." http://communities.vmware.com/servlet/JiveServlet/downloadBody/1806-102-1-1554/VMUG.ppt
  2. I found this in an iSCSI Virtualization whitepaper from 2007: "For high-performance, mission-critical servers, the cost of Fibre Channel is often justified, because Fibre Channel provides higher bandwidth (4 Gbps vs. 1 Gbps) and lower latency than IP networks. However, many environments are over-served by 4Gbps Fibre Channel links. This is particularly true for hosts running applications characterized by random traffic, such as database applications and Exchange."
    http://www.dell.com/downloads/global/products/pvaul/en/iscsi_virtualization.pdf
  3. And here's one from Netapp: "...based on deployments, Netapp has proven over the past 3 years that a scalable, simple to use array with enterprise class reliability can safely be the iSCSI platform for mission-critical applications. Exchange is a perfect
    example of a mission critical application that is routinely deployed over iSCSI these days."
    http://storagefoo.blogspot.com/2006/05/iscsi-performance-and-deployment.html
  4. Finally, VMware's own testing of storage protocols and their corresponding physical medium from this year: "This paper demonstrates that the four network storage connection options available to ESX Server are all capable of reaching a level of performance limited only by the media and storage devices."
    http://www.vmware.com/files/pdf/storage_protocol_perf.pdf
It's important to note that I'm leaving 2 things out of this consideration:
1. I typically read how FC has lower latency than IP. My somewhat empirical belief is that IP's additional latency will not be a big factor when added to the equation.
2. I've read different sources that state disk IOPS are more important with regards to system performance than storage transport bandwidth utilization.
I'm still looking for a way to quantify these factors to better predict the performance characteristics of our IP storage implementation. This is the first part of what I'm sure will be an on-going investigation. It sure would be nice to have a tool that did all of this for me! I have yet to find something that's comprehensive enough on any given storage platform I've managed (IBM DS, EMC Celerra, et al).

Also note that I've been monitoring my bandwidth utilization more closely using Vkernel's Capacity Analyzer and can safely say that 11008KBps is high. It's dropped 30-40% over the last two months for various reasons.

Next month I hope to enable jumbo frames in this environment and expect to see some additional performance gain at some level. I'm considering capturing before/after snapshots of various performance metrics and posting the results in a future blog.

In conclusion, this analysis makes me even more confident about the performance of our ESX hosts and virtual infrastructure backend storage transport even if/when I get to virtualize the remaining Exchange mailbox servers.

4 Comments Permalink
3

A Little History

In a previous article (http://communities.vmware.com/blogs/ManualAutomation/2008/05/15/the-big-plan-business-continuity) I discussed why I was looking closely at SRM and what I needed to get done before I could implement the product. Now that I've successfully tested the product I'd like to give an update.

The Celerra Code Upgrade
The code of both of my Celerras was upgraded to 5.6 in mid July. It wasn't pretty - no fun being in the data center until 3:00am. To EMC's credit, their CE hung in there with me, got the problems escalated and ultimately we got the VMware data stores working again. We were bit by the LUN resignaturing "bug". EMC knows the code upgrade causes this but for some reason we were surprised and found out the hard way at about 12:00am.

It took another month to recover other services such as CIFS and iSCSI replication. When I was young, my father insisted was that when I handled someone else's property, I should always return it in the same or better condition than when I first received it. My main problem with EMC in this respect is that they left me with a system that didn't work like it did before they upgraded it. I'm past the CIFS and iSCSI replication problems now, but I'm still experiencing problems with CAVA that didn't exist before. Luckily, I don't think it's anything too difficult or serious and I will be calling EMC support to get this last problem resolved.

While I've given feedback on this event to EMC support, note that I still am a fan of their unified storage product. It's not right for all companies or all situations but it is for my environment. Also, to be fair, many Celerra customers may never need to experience a code upgrade event. The only reason to do this is if you need some feature or improved capability that the upgrade provides. I've had an EMC CE tell me that they have retired EMC hardware that had the original code installed making it over three years old! This says volumes about the code's stability and reliability. In my case, I needed the expanded functionality of iSCSI LUN replication and compatibility with VMware Site Recovery Manager.

The Evaluation
Anyone semi-familiar with installing VMware products will have no problem getting SRM installed. Note that you'll need to obtain the Storage Replication Adapter (SRA) from your storage OEM and install it in the proper sequence per the documentation. In my case I used documentation from EMC and VMware to install and configure the product. See the "Additional Resources" section at the end of this article.

One of things that's awesome about VMware is the amount of attention they've given me regardless of whether I was working for a large $3 billion enterprise or a mid-sized $500 million dollar company. In this case, my sales rep offered to have a local VMware systems engineer (we'll call him "Dave") come out on-site and work with me to complete a proof-of-concept.

I had SRM and the SRA components installed. I wanted a technical resource in case I needed it while performing that first test. Well, I needed it and got it. Keep in mind I hadn't purchased the product yet(!). Dave was able to help me work through a couple of issues we ran into during that first session such as file system sizing and licensing issues. It only took 2-3 hours but when finished, I had 4 VMs running in my remote data center 325 miles away! (Thanks to Dave and Ken!)

Another tip I learned during this session: review the SRA log. In the case of the Celerra's SRA, it documents every command it executes and the results. It's a great way to learn what SRM is really doing behind the scenes with your storage in order to get the LUN(s) setup and ready to be used as a data store by ESX.

Subsequent Test Results
I have more testing to do but can report that I'm starting 4 VMs from a single replicated LUN in 8 minutes. And I'm not talking about from the time of just powering on, I'm talking about pressing the "big red (test) button" - powering-up the VMs - starting the Windows services - and the recovery plan completion. Try that using physical servers! Sorry, but even restoring servers from a B2D solution that's replicated to your DR site won't be as fast.

I demonstrated SRM for the DR team and initially got a "that's all?" kind of reaction. I quickly realized that SRM, with the combination of array-based replication, +worked too well+! Meaning, it did such a good job of hiding the complexity and number of steps required to get from A to Z that my non-technical DR teammates didn't understand what SRM was really bringing to the table. If there's only one thing you take away from this article, make sure it's that you're better off explaining in simple terms the steps SRM is executing in the background before running a demonstration.

Talking about the virtues of SRM is one thing (the recovery run book, the steps it automates, the testing capabilities (which are awesome by-the-way), etc.), demonstrating these product features for your DR team is another. If your experience is like mine, you'll find it dramatically influences the discussions on the project plan. In my case, we will be significantly changing the testing phases - actually streamlining those thanks to SRM.

I wouldn't declare SRM to be a perfect specimen of engineering excellence; I reserve that title for Windows ME (yes that's a joke). But there are a couple of things that could be improved. I would like finer-grained control over when my VMs are powered on - I'd like to be able to specify dependencies between VMs. It seems like VMware is bent on specifying everything as "High", "Medium" and "Low". What if I want six groupings instead of just three? There are also a number of folks complaining about the lack of fail-back. Yes, there's no "big red button" to press to perform a fail-back but most storage OEMs including EMC are providing documentation describing how to get this done. Finally, I'd like VMware to consider non-array-based replication capabilities. I don't think you'll replicate 20 VMs this way, but it sure would be nice for those one or two one-offs for which you don't want to replicate an entire LUN. I can also image customers with smaller implementations or those with non-supported back-end storage using this feature.

Because the POC exercise was a success it was easy to convince management to purchase the product. I think purchasing Site Recovery Manager is the best endorsement I can give it and VMware. Now I can't wait to see what the next version brings!

Additional Resources
SRM Product Site: http://www.vmware.com/products/srm
SRM Product Documentation: http://www.vmware.com/support/pubs/srm_pubs.html (The Getting Started PDF is particularly useful and pay attention to the compatibility matrix.)
SRM VMTN Forum: http://communities.vmware.com/community/vmtn/mgmt/srm
SRM Book: http://www.rtfm-ed.co.uk/?p=584 (Mike's blog is also a good one to watch.)
Storage OEM Docs: The EMC documentation can be obtained by registering on their Powerlink (http://powerlink.emc.com/) site and searching for "Site Recovery Manager". For other OEMs, contact your sales representative, search their web site or call support.

3 Comments Permalink
0

Home Lab Build – Part 1

Posted by Virtual_JTW Sep 8, 2008

My home lab has changed dramatically over the years – driven mostly by what I was working on at the time and the availability of hardware. I hadn’t updated my lab in quite a while so I decided it was time. I was also inspired by Chad’s post as to how cheaply I could build a server or two: http://virtualgeek.typepad.com/virtual_geek/2008/06/building-a-home.html

The Hardware

Motherboard: $30; ECS NFORCE6M-A rev 3.0 (http://www.newegg.com/Product/Product.aspx?Item=N82E16813135083). This thing is sweet – capable of 32GB! Not sure I’ll ever need that much but, wait… what am I saying; of course I’ll need that extra memory some day. ;)
CPU: $64; CPU AMD Athlon 64 X2 4800+ Dual-Core 2.5GHz AM2 purchased from local computer shop.
RAM: $94.50; 4GB = 2x2GB DDR2-800 (PC2-6400) 2GB Supertalent also purchased from local computer shop. I prefer to do this when the price is the same or close to NewEgg within a few bucks.
Case w/PS: $39.50; ATX RAIDMAX Elite Black ATX/Micro ATX Case 380 watt power supply also purchased from local shop. Cheap case, thin metal – you get what you pay for especially when it comes to computer cases.
Video: $0; I had 2 cheapo SiS PCI cards lying around.
HDD: $0; I’m using ESXi so local storage is not necessary.
NIC: $0; I had 4 Intel 1000MT Server NICs I repurposed from other systems I’m not using. I put 2 in each server.
USB Key: $0; I had 2x2GBers I wasn’t using.
Total Spent = $228 per server. Not bad!

I decided to go with AMD to keep the costs down. Note that the motherboard has since been delisted at NewEgg. ECS has a similar model but it’s more expensive. Now to be fair, this means that all of your storage is going to have to be on a third server. I already had a storage server in my home lab but it needed some updating:

Motherboard: $0; P4 2.6GHz – repurposed from an older PC I wasn’t using.
CPU Heatsink: $14; purchased from local computer shop. Need to replace original since the fan exhaust was directed in the wrong direction per the design of the original case.
Power Supply: $67; 580 watt from local shop.
HDD: $64; purchased another 250GB SATA3 drive to fill out my SATA RAID 4 port PCI adapter with 3 other drives.
Total Spent = $145

Hey, this is getting expensive! I sold some older systems and parts I wasn’t using on EBay to help cover some of the costs. The dominos finally stopped falling.

The Install

I installed Windows Server 2008 as the storage server OS on 2 RAID1/mirrored 160GB IDE drives. The 4 250GB HDDs are setup in a RAID5 logical drive.

For ESX, I used ESXi installed on a USB key per these instructions: http://communities.vmware.com/blogs/Knorrhane/2008/01/21/installing-esx-3i-on-usb-stick

Okay great, so now I have two ESX servers up and running and the backend storage running. Now I just need to create a datastore on the first ESX server. ESX supports Fibre, iSCSI and NFS storage types. Microsoft provides good NFS support in Windows so that seems like the easiest way to go. I installed the File Storage role in Windows Server 2008 and included the Server for NFS feature.

Snag! Microsoft no longer provides User Name Mapping for NFS – you basically need to install the Unix integration component for Active Directory. Well, my domain controller was going to be installed in a VM. I can’t create a VM w/o storage, so now what?

Stay tuned and for the answer revealed in Part 2!

0 Comments Permalink
0

Break Like the Wind

Posted by Virtual_JTW Aug 26, 2008

References to Spinal Tap's great album aside, it's ironic: I'm working on VMware Site Recovery Manager product setup and configuration and the day I'm scheduled to fly out to Las Vegas for VMworld a mini-disaster strikes! It's Sunday, September 14^th^ around noon and all is normal. However, the remnants of hurricane Ike are heading this way. No big deal - a little rain, maybe a strong thunderstorm but nothing we haven't seen before.

I'm packing for a week at VMworld and need to hit the road by 3:30PM. Around 2:00PM we start to hear the whirling sound of wind racing across the roof. At about 3:15PM I'm packing up the car and debris is getting blown down the street. Before I leave I have to remove a large piece of cardboard from the front of my car. I've never seen anything like this!

Despite the high winds, I make it to the airport safely and notice planes are still taking off and landing. Listening to the radio on the way there I learned winds were reaching in excess of 80MPH and knocking down trees and power lines all across the state of Ohio. Dayton was impacted especially hard. I'm not sure how or why, but my plane took off successfully and it was a smooth ride once we were above the atmosphere.

My house was without power for 4 days. Others had it worse with the outage lasting over 9 days. This kind of weather event hasn't happened in 200 years. Some very special, one-in-a-million chance conditions came together thanks to, in part, hurricane Ike to cause extraordinarily high winds in our region that none of us had seen before.

So something that will never happen happened - a disaster occurred to our data center causing a multi-day loss in power. We have a natural-gas generator to cover a power outage. It kicked in and life is good right? Wrong! We also have redundant AC units but only one works with the generator and the automatic fail-over didn't work due to a bug in the system (which has since been corrected). The room starts heating up and servers start shutting off as the temperature reached 90 degrees Fahrenheit. We reached 95-96F before a co-worker showed up and manually switch the AC units over (I can't do it - I'm on a plane, remember?). It took him twice as long to get there because of downed power lines and trees that closed roads.

He then starts powering up servers again. Luckily the outage for most systems is an hour or less on a Sunday when most of our users don't care or are being distracted by the tree that's landed in their living room. The ESX hosts and virtual machines all power-up successfully thanks in part to the hardware sensors on the servers that powered them off before the CPU, memory or I/O components fried in the heat.

While the outage was bad, it brought to light several interesting points:

  1. Test the equipment, but test the fail-over of the equipment.
    Testing the actual fail-over is the hardest part of disaster recovery because it impacts production. However, regardless of whether it's AC units or virtual machines, this is the only way to be 100% certain you DR plan will work as designed and implemented.
  2. The quality of built-in server hardware sensors has increased dramatically in the last 7 years.
    This is the third time I've had servers in a room that overheated due to an AC outage. The previous two events were lab servers that did not recover very well. The hardware didn't shutdown cleanly. Many systems were blue-screened if they were still running. When AC service was restored, some servers wouldn't power back up; others threw strange hardware-related errors months after the fact. Heat does bad things to electronics and I've seen too much of this first hand.
  3. Additional data center environmental monitoring and sensor devices are critically important.
    I have the fortune of working for a data center manager that had the foresight to install a Sensaphone remote monitoring device (http://www.sensaphone.com/). I'm sure there are other products on the market but this one works very well for us. It can call a list of numbers and speak the alert condition over the phone. The admin can then enter a code to stop it from calling the next number. It can monitor various conditions but in this case it called us to warn about the temperature. We also have an ADT monitoring unit but it doesn't seem to work as well.
  4. Data center protection is important in a disaster but also consider supporting non-data center work-related processes.
    This "mini-disaster" put us without power for days, yet the business needed to continue to function. We needed to process sales orders, purchase raw materials, process payroll, etc. Have you ever worked for a company that couldn't meet payroll for any reason? To say that employees get upset is an understatement. So when no-one has power, where does the accounting staff go to get their job done? Plan to provide facilities for personnel to process these kinds of essential functions. After-all, what good is making sure the payroll system is running when nobody can access it anyway?
  5. Consider specific disaster scenarios and plan accordingly.
    This maybe the hardest things to accomplish when planning for a disaster. Put two people in a room and they will have very different opinions on which scenario is more important than the other. The bottom line is you'll have choose some number, say the top three, and plan for those. You should plan for something - define it but don't let it stall the progress of the project.

The power outage lasted around 72 hours, the service outage lasted less than an hour. Not bad overall! Now I'd better get VMware Site Recovery Manager working - had that generator stopped running...

0 Comments Permalink
0

Well, nothing much really, but I'll make a connection. Just bare with me...

I was walking through the toys section with my kids at Target yesterday when one of my sons spotted a toy he really wanted - a set of four trucks (they love trucks!). On the front of package it read, "for ages 5 to 95". Now really, so a 96 year-old shouldn't play with these trucks?

I tend to find discussions on virtualization candidates just about as rational and definitely as funny. The debate on whether application XYZ can/should be virtualized is over. Sure there are still exceptions (unique hardware requirements, for example). And yes it depends on your environment (I wouldn't virtualize 3 Exchange 2003 mailbox servers across 2 ESX hosts sporting Pentium 4 CPUs with 1GB of RAM each). But for Virtual Infrastructure (VI) environments running on modern servers and back-end storage systems, there are very few physical servers that can't be virtualized.

If you buy into this "virtualize your datacenter" principle like I do, then are there really no applications off-limits? What about VMware's own products such as VirtualCenter? I know there are VI administrators out there that still refuse to virtualize the VirtualCenter Management Server (VCMS). I usually hear one of two reasons:

  1. "I'm freeing-up all of these physical servers and have one or two that I have to use for something."
  2. "VirtualCenter is becoming so critical that I can't afford it to go down or lose access."
But that's all wrong - 96 year olds can play with trucks! You virtualize the VCMS for the very same reasons you virtualizes all of the other physical servers in your datacenter: to realize all of the benefits of VI. You know what they are but if you're not sure, please go to vmware.com to find out more.

To answer the above concerns: deploying a physical server to host a VI component sort of defeats the purpose, doesn't it? Won't deploying yet another physical server increase cooling cost? Power consumption? System maintenance? Etc, etc. And what about availability? I sometimes wonder if these administrators really understand VMware HA or the power of VMotion - virtualizing the VCMS should increase its availability compared to hosting it on a physical server.

Once VMware announced they fully supported running VirtualCenter in a virtual machine with the release of 2.5, I haven't looked back. I've implemented and supported VI environments for two different companies now with the VCMS running in a virtual machine. It's been two years and I have not heard any of what I would call "deal-killers" to this design decision. However, there is a short list of things that I you should be aware of:

  • * If you need to shut down the entire VI environment, you'll need to save the ESX host(s) that VC and its database server are running on for last. Then you'll need to log on to the hosts directly to complete the shutdown. This doesn't happen too often, but I've had to do this 2 or 3 times, usually due to a storage-related outage.
  • * I've experienced brief 1-2 second pauses in the VMware Infrastructure Client (VIC) when the VCMS VM gets VMotioned from one host to another. Again, this rarely happens.
  • * And here's a new one: As of Update 2, there's a new feature called Enhanced VMotion Compatibility (EVC). To enable this in my environment, VC requires all virtual machines in the cluster be powered-off. It might be hard to enable this feature in VC if the VCMS is powered-off(!). The solution to this isn't too-painful, however: temporarily move the ESX server that hosts the VCMS VM out of the cluster, enable the feature then move it back.

What if your VCMS VM does crash? If VirtualCenter does become unavailable, your VMs will continue to run. HA runs as an agent on each host, so that service will continue to run. Since your probably running the FlexNet licensing service on the same VM as the VCMS, you'll have a grace period of 14 days to get the VM back up and running. If it takes you more than 14 days to get that VM back up and running, it's not very critical in your environment anyway.

For more information on this topic straight from the horse's mouth, please see: http://www.vmware.com/pdf/vi3_vc_in_vm.pdf

Still not convinced?
Leave a comment and let me know why.

0 Comments Permalink
0

There's nothing like going in to work Monday morning only to find that one of your ESX hosts is listed as "not responding" in VirtualCenter. Using HP's iLO, I tried restarting the management network. No change. The VMs were still running and functioning normally. The host was still running - there just seemed to be a communications problem between the host and VirtualCenter. After a quick call to VMware technical support, they had me restart the VirtualCenter server service and voila, communications were restored and the host's status in VirtualCenter returned to normal.

I didn't spend a lot of time doing a root-cause analysis as this environment was not in production yet. But I suspected there was a network interruption from which host-VC communications never recovered.

Now let me just say something about VMware technical support. I've worked support incidents with many hardware and software vendors over the years and have to say that VMware has their act together when it comes to product support. I'm not saying they're perfect, but I've received consistent quality support from these guys going back to my ESX Server 1.5 days. They're worth the money and I wouldn't run a Virtual Infrastructure environment without them.

So a week later, it happens again. I open another case with VMware referencing the previous. This time, restarting the management network or the VirtualCenter server service doesn't work. The support tech reviews some additional logs and is basically stumped. The only thing he had left for me to do was to manually shut down the VMs running on this host and reboot the host. This fixes the problem, but doesn't really explain why it happened in the first place. The tech is going to review a new set of logs I just uploaded and let me know if he finds anything. While there wasn't much more that could be done at this point, this always seems to me like a "don't call me, I'll call you" kind of resolution.

Before he has a chance to call me back, it happens a third time on the same host! Same symptoms, same results. The same tech doesn't find anything in the logs from the previous incident so he escalates to senior-level VirtualCenter support. We discovered a new symptom - the host seems to have lost connectivity with the storage, even though the VMs are still running fine (strange but true).

The senior tech said something that jogged my memory and I remembered that while this server survived our 4-day hardware burn-in test, we had problems connecting to the management console very early on to the point where we had to pull the USB key fob and reinstall it. (Keep in mind we're running ESXi.)

To be safe, I installed a new USB key fob and the problem has not occurred again. It's been about three weeks since writing this entry. Moral of the story: don't automatically rule-out the hardware even when the problem appears to be with the software.

0 Comments Permalink
11

For my employer, this is the year of disaster recovery. Almost all of our major projects tie-in to the goal of performing a successful DR test by the end of the year. Besides the standard IT things that have to get done on a regular basis (asset management, corporate application TLC, etc.), this goal is really driving the work we’re doing.

Sometime before I was hired, the company purchased two EMC Celerra NS352s for NAS/IP storage and two EMC Centeras for file and email archiving. We’re an HP shop so were using DL380 G5s with the little USB key inside running ESX 3i or ESXi or whatever it’s called today. We mostly use Cisco gear for networking and have dedicated switches for iSCSI and VMotion traffic. We have two of everything – our company’s IT services are split across two datacenters. Each datacenter will be a hot recovery site for the other.

So what does our DR solution entail? Well, some fairly advanced technologies:
Virtualization: VMware VI3, release 3.5
DR Automation: VMware Site Recovery Manager
Replication: EMC Celerra Replicator V2
Snapshot consistency: EMC Replicator

Oddly enough, looking at the list, only one of the technologies is shipping and in my possession today (VMware VI3). Hmmm… can you say, “Project risk”?

I first saw VMware Site Recovery Manager at a VMworld 2007 presentation. If it works, it will be impressive. Automating the steps to configure and power on VMs and a central place to store the DR “run book” will be sweet, to say the least.

One of the advantages of having EMC storage equipment is that they own VMware (or at least the controlling interest). This means there’s a pretty good chance that their storage platforms will be among the first certified to work with VMware products. Sure enough, my Celerras will work upon release of SRM with a firmware/code upgrade. The code is shipping on new product; however, EMC has a policy that delays certification by 90 days for installing/upgrading to current product. That puts us in the June time-frame.

The Celerra code upgrade provides a new version of Celerra Replicator that replicates iSCSI LUNs. To ensure application consistency for applications such as Exchange and SQL Server, EMC Replicator must be used.

This is the high-level plan: application-consistent snapshots, SAN/IP storage-based replication and SRM to run it all at the end. Yes, we have some physical servers = HP-UX, AIX, etc. and too bad their story won’t be interesting.

So while I’m waiting for product to GA, I’m trying to get our VI3 platform up and stable. Stay tuned for progress on that front.

11 Comments Permalink
3

Who is John Galt?

Posted by Virtual_JTW May 8, 2008

Ok, now that I have your attention (and no, I’m definitely not John Galt), but since this is my first post, I think it appropriate to give you a little background about myself. After all, why read and/or care about anything I have to say? ;)

I was introduced to VMware like so many others, through Workstation back in 2002. At the time I worked for a 2.8 billion dollar company and had just completed a major Microsoft SMS 2.0 implementation. I was working with a consultant that had it installed on his laptop PC and used it to demo the product developed by his employer.

So then it goes something like this:
That’s a cool product, let me show the boss!
Boss likes it and says, “Why don’t you do some more research on this VMware company and find out more?”
I do the research and discover VMware’s GSX and ESX product lines.
The rest is history. (Don’t you love that line?)

I procured a copy of GSX, set it up and created three virtual machines for one of our internal development teams.
Oh, and I never told anyone what I had done. (Bad boy!)

I used a whitepaper from VMware that describe how to use the same base image for multiple VMs using redo logs. Worked great except when one needed to be rebooted = NTFS no likey.

The dev team never caught on and six months later I finally let them in on the secret. Virtual machine? What’s that? Etc, etc… (I hate giving end-users or customers reason, right or wrong, to blame all of their ills on me!)

Bottom line = pilot successful. From there I got ESX 1.52 approved (now end of 2003 or so), first one host, then two, then four - all with local disk.

Then VirtualCenter 1.0 and ESX 2.0 are released. Please Mr. IT Director, will you approve this requisition for a SAN? No. But please?
No.
Okay then how about you get fired and my team now reports to another director?
Will you, new Mr. IT Director, approve this SAN purchase?
Yes!

Hosts grow to 16 (mixture of IBM 440s, 445s, 3650s, BladeCenter blades); VMs to 500 (powered on) 600+ requisitioned; VMotions = 1000s, SAN (IBM DS4500) = 16TB; every VMworld = attended; VCP2 = achieved. Life if good.

It’s now 2006 and my employer gets acquired by another company. Oh yeah, and the new management doesn’t like any new technology much less virtualization.

Time to move on. So here I am, embarking on a new implementation using ECM Celerra IP storage (iSCSI), HP ProLiant DL380s, VMware VirtualCenter 2.5 and ESX 3i.

I thought I would use this blog to document the good, the bad and the ugly of this next gen VI platform and maybe share some tips along the way. The VMTN forums have been good to me over the years so maybe I can add to the discussion.

That’s it for now. I already have a list of technical stuff to publish. Stay tuned!

3 Comments Permalink