VMware Communities > Blogs > Manual Automation > Tags

Blog Posts

Manual Automation

6 Posts tagged with the vi tag
4

To give you a little background, I now have 6 ESX hosts with 58 VMs. Each host has dual-iSCSI HBAs with 1GbE connections. All Exchange 2007 roles have been virtualized, however we currently only have 1 out of 5 mailbox servers running as a virtual machine. We have a number of other workload types virtualized including file, print, SQL, web servers, etc.

Management has decided to stop virtualizing Exchange servers. Why? Fear generated by the FUD that surrounds the performance characteristics of various storage transports - in this case iSCSI via GbE. The only way to fight FUD is with facts. Towards this effort I have performed some calculations in an attempt to answer 2 questions:

1. How well is our storage transport performing given current virtualized workloads?

2. How much "performance capacity" do we have remaining?


I added up the average bandwidth utilization of all 6 of my ESX hosts which totaled 11008KBps. This converts to 0.09Gbps out of 2Gbps or 4.5% bandwidth utilization. I then added up the maximum utilization of all 6 ESX hosts. This would be the high-point of the peaks or bursts in utilization. The result was 0.48Gbps.


Assuming we can get 800Mbs of actual bandwidth per connection we have 1.6Gbps useable bandwidth remaining. Note that based on VMware's testing we should be able to reach near wire-speed (2Gbps) if the environment is configured correctly making 1.6Gbps a conservative assumption.


So even if I use the maximum bandwidth measurement of 0.48Gbps, that leaves 1.1Gbs useable. Another way to state it is that my environment is reaching a max of 30% bandwidth utilization.


The results seemed unbelievable to me at first so I digged a little deeper:


  1. I found this in a EqualLogic presentation from 2005: "With 2 iSCSI connections and free NIC teaming, payload equals approx. 234 MB/s (1.96Gb/s) or 823GB/Hour. We found 2Gb FC delivers 196 MB/s which equals approx. 689GB/Hour payload." http://communities.vmware.com/servlet/JiveServlet/downloadBody/1806-102-1-1554/VMUG.ppt
  2. I found this in an iSCSI Virtualization whitepaper from 2007: "For high-performance, mission-critical servers, the cost of Fibre Channel is often justified, because Fibre Channel provides higher bandwidth (4 Gbps vs. 1 Gbps) and lower latency than IP networks. However, many environments are over-served by 4Gbps Fibre Channel links. This is particularly true for hosts running applications characterized by random traffic, such as database applications and Exchange."
    http://www.dell.com/downloads/global/products/pvaul/en/iscsi_virtualization.pdf
  3. And here's one from Netapp: "...based on deployments, Netapp has proven over the past 3 years that a scalable, simple to use array with enterprise class reliability can safely be the iSCSI platform for mission-critical applications. Exchange is a perfect
    example of a mission critical application that is routinely deployed over iSCSI these days."
    http://storagefoo.blogspot.com/2006/05/iscsi-performance-and-deployment.html
  4. Finally, VMware's own testing of storage protocols and their corresponding physical medium from this year: "This paper demonstrates that the four network storage connection options available to ESX Server are all capable of reaching a level of performance limited only by the media and storage devices."
    http://www.vmware.com/files/pdf/storage_protocol_perf.pdf
It's important to note that I'm leaving 2 things out of this consideration:
1. I typically read how FC has lower latency than IP. My somewhat empirical belief is that IP's additional latency will not be a big factor when added to the equation.
2. I've read different sources that state disk IOPS are more important with regards to system performance than storage transport bandwidth utilization.
I'm still looking for a way to quantify these factors to better predict the performance characteristics of our IP storage implementation. This is the first part of what I'm sure will be an on-going investigation. It sure would be nice to have a tool that did all of this for me! I have yet to find something that's comprehensive enough on any given storage platform I've managed (IBM DS, EMC Celerra, et al).

Also note that I've been monitoring my bandwidth utilization more closely using Vkernel's Capacity Analyzer and can safely say that 11008KBps is high. It's dropped 30-40% over the last two months for various reasons.


Next month I hope to enable jumbo frames in this environment and expect to see some additional performance gain at some level. I'm considering capturing before/after snapshots of various performance metrics and posting the results in a future blog.


In conclusion, this analysis makes me even more confident about the performance of our ESX hosts and virtual infrastructure backend storage transport even if/when I get to virtualize the remaining Exchange mailbox servers.

4 Comments 0 References Permalink
0

Break Like the Wind

Posted by Virtual_JTW Oct 6, 2008


References to Spinal Tap's great album aside, it's ironic: I'm working on VMware Site Recovery Manager product setup and configuration and the day I'm scheduled to fly out to Las Vegas for VMworld a mini-disaster strikes! It's Sunday, September 14^th^ around noon and all is normal. However, the remnants of hurricane Ike are heading this way. No big deal - a little rain, maybe a strong thunderstorm but nothing we haven't seen before.

I'm packing for a week at VMworld and need to hit the road by 3:30PM. Around 2:00PM we start to hear the whirling sound of wind racing across the roof. At about 3:15PM I'm packing up the car and debris is getting blown down the street. Before I leave I have to remove a large piece of cardboard from the front of my car. I've never seen anything like this!

Despite the high winds, I make it to the airport safely and notice planes are still taking off and landing. Listening to the radio on the way there I learned winds were reaching in excess of 80MPH and knocking down trees and power lines all across the state of Ohio. Dayton was impacted especially hard. I'm not sure how or why, but my plane took off successfully and it was a smooth ride once we were above the atmosphere.

My house was without power for 4 days. Others had it worse with the outage lasting over 9 days. This kind of weather event hasn't happened in 200 years. Some very special, one-in-a-million chance conditions came together thanks to, in part, hurricane Ike to cause extraordinarily high winds in our region that none of us had seen before.

So something that will never happen happened - a disaster occurred to our data center causing a multi-day loss in power. We have a natural-gas generator to cover a power outage. It kicked in and life is good right? Wrong! We also have redundant AC units but only one works with the generator and the automatic fail-over didn't work due to a bug in the system (which has since been corrected). The room starts heating up and servers start shutting off as the temperature reached 90 degrees Fahrenheit. We reached 95-96F before a co-worker showed up and manually switch the AC units over (I can't do it - I'm on a plane, remember?). It took him twice as long to get there because of downed power lines and trees that closed roads.

He then starts powering up servers again. Luckily the outage for most systems is an hour or less on a Sunday when most of our users don't care or are being distracted by the tree that's landed in their living room. The ESX hosts and virtual machines all power-up successfully thanks in part to the hardware sensors on the servers that powered them off before the CPU, memory or I/O components fried in the heat.

While the outage was bad, it brought to light several interesting points:

  1. Test the equipment, but test the fail-over of the equipment.
    Testing the actual fail-over is the hardest part of disaster recovery because it impacts production. However, regardless of whether it's AC units or virtual machines, this is the only way to be 100% certain you DR plan will work as designed and implemented.
  2. The quality of built-in server hardware sensors has increased dramatically in the last 7 years.
    This is the third time I've had servers in a room that overheated due to an AC outage. The previous two events were lab servers that did not recover very well. The hardware didn't shutdown cleanly. Many systems were blue-screened if they were still running. When AC service was restored, some servers wouldn't power back up; others threw strange hardware-related errors months after the fact. Heat does bad things to electronics and I've seen too much of this first hand.
  3. Additional data center environmental monitoring and sensor devices are critically important.
    I have the fortune of working for a data center manager that had the foresight to install a Sensaphone remote monitoring device (http://www.sensaphone.com/). I'm sure there are other products on the market but this one works very well for us. It can call a list of numbers and speak the alert condition over the phone. The admin can then enter a code to stop it from calling the next number. It can monitor various conditions but in this case it called us to warn about the temperature. We also have an ADT monitoring unit but it doesn't seem to work as well.
  4. Data center protection is important in a disaster but also consider supporting non-data center work-related processes.
    This "mini-disaster" put us without power for days, yet the business needed to continue to function. We needed to process sales orders, purchase raw materials, process payroll, etc. Have you ever worked for a company that couldn't meet payroll for any reason? To say that employees get upset is an understatement. So when no-one has power, where does the accounting staff go to get their job done? Plan to provide facilities for personnel to process these kinds of essential functions. After-all, what good is making sure the payroll system is running when nobody can access it anyway?
  5. Consider specific disaster scenarios and plan accordingly.
    This maybe the hardest things to accomplish when planning for a disaster. Put two people in a room and they will have very different opinions on which scenario is more important than the other. The bottom line is you'll have choose some number, say the top three, and plan for those. You should plan for something - define it but don't let it stall the progress of the project.

The power outage lasted around 72 hours, the service outage lasted less than an hour. Not bad overall! Now I'd better get VMware Site Recovery Manager working - had that generator stopped running...

0 Comments Permalink
0

Home Lab Build – Part 1

Posted by Virtual_JTW Sep 8, 2008

My home lab has changed dramatically over the years – driven mostly by what I was working on at the time and the availability of hardware. I hadn’t updated my lab in quite a while so I decided it was time. I was also inspired by Chad’s post as to how cheaply I could build a server or two: http://virtualgeek.typepad.com/virtual_geek/2008/06/building-a-home.html

The Hardware

Motherboard: $30; ECS NFORCE6M-A rev 3.0 (http://www.newegg.com/Product/Product.aspx?Item=N82E16813135083). This thing is sweet – capable of 32GB! Not sure I’ll ever need that much but, wait… what am I saying; of course I’ll need that extra memory some day. ;)
CPU: $64; CPU AMD Athlon 64 X2 4800+ Dual-Core 2.5GHz AM2 purchased from local computer shop.
RAM: $94.50; 4GB = 2x2GB DDR2-800 (PC2-6400) 2GB Supertalent also purchased from local computer shop. I prefer to do this when the price is the same or close to NewEgg within a few bucks.
Case w/PS: $39.50; ATX RAIDMAX Elite Black ATX/Micro ATX Case 380 watt power supply also purchased from local shop. Cheap case, thin metal – you get what you pay for especially when it comes to computer cases.
Video: $0; I had 2 cheapo SiS PCI cards lying around.
HDD: $0; I’m using ESXi so local storage is not necessary.
NIC: $0; I had 4 Intel 1000MT Server NICs I repurposed from other systems I’m not using. I put 2 in each server.
USB Key: $0; I had 2x2GBers I wasn’t using.
Total Spent = $228 per server. Not bad!

I decided to go with AMD to keep the costs down. Note that the motherboard has since been delisted at NewEgg. ECS has a similar model but it’s more expensive. Now to be fair, this means that all of your storage is going to have to be on a third server. I already had a storage server in my home lab but it needed some updating:

Motherboard: $0; P4 2.6GHz – repurposed from an older PC I wasn’t using.
CPU Heatsink: $14; purchased from local computer shop. Need to replace original since the fan exhaust was directed in the wrong direction per the design of the original case.
Power Supply: $67; 580 watt from local shop.
HDD: $64; purchased another 250GB SATA3 drive to fill out my SATA RAID 4 port PCI adapter with 3 other drives.
Total Spent = $145

Hey, this is getting expensive! I sold some older systems and parts I wasn’t using on EBay to help cover some of the costs. The dominos finally stopped falling.

The Install

I installed Windows Server 2008 as the storage server OS on 2 RAID1/mirrored 160GB IDE drives. The 4 250GB HDDs are setup in a RAID5 logical drive.

For ESX, I used ESXi installed on a USB key per these instructions: http://communities.vmware.com/blogs/Knorrhane/2008/01/21/installing-esx-3i-on-usb-stick

Okay great, so now I have two ESX servers up and running and the backend storage running. Now I just need to create a datastore on the first ESX server. ESX supports Fibre, iSCSI and NFS storage types. Microsoft provides good NFS support in Windows so that seems like the easiest way to go. I installed the File Storage role in Windows Server 2008 and included the Server for NFS feature.

Snag! Microsoft no longer provides User Name Mapping for NFS – you basically need to install the Unix integration component for Active Directory. Well, my domain controller was going to be installed in a VM. I can’t create a VM w/o storage, so now what?

Stay tuned and for the answer revealed in Part 2!

0 Comments Permalink
0


Well, nothing much really, but I'll make a connection. Just bare with me...

I was walking through the toys section with my kids at Target yesterday when one of my sons spotted a toy he really wanted - a set of four trucks (they love trucks!). On the front of package it read, "for ages 5 to 95". Now really, so a 96 year-old shouldn't play with these trucks?

I tend to find discussions on virtualization candidates just about as rational and definitely as funny. The debate on whether application XYZ can/should be virtualized is over. Sure there are still exceptions (unique hardware requirements, for example). And yes it depends on your environment (I wouldn't virtualize 3 Exchange 2003 mailbox servers across 2 ESX hosts sporting Pentium 4 CPUs with 1GB of RAM each). But for Virtual Infrastructure (VI) environments running on modern servers and back-end storage systems, there are very few physical servers that can't be virtualized.

If you buy into this "virtualize your datacenter" principle like I do, then are there really no applications off-limits? What about VMware's own products such as VirtualCenter? I know there are VI administrators out there that still refuse to virtualize the VirtualCenter Management Server (VCMS). I usually hear one of two reasons:

  1. "I'm freeing-up all of these physical servers and have one or two that I have to use for something."
  2. "VirtualCenter is becoming so critical that I can't afford it to go down or lose access."
But that's all wrong - 96 year olds can play with trucks! You virtualize the VCMS for the very same reasons you virtualizes all of the other physical servers in your datacenter: to realize all of the benefits of VI. You know what they are but if you're not sure, please go to vmware.com to find out more.

To answer the above concerns: deploying a physical server to host a VI component sort of defeats the purpose, doesn't it? Won't deploying yet another physical server increase cooling cost? Power consumption? System maintenance? Etc, etc. And what about availability? I sometimes wonder if these administrators really understand VMware HA or the power of VMotion - virtualizing the VCMS should increase its availability compared to hosting it on a physical server.

Once VMware announced they fully supported running VirtualCenter in a virtual machine with the release of 2.5, I haven't looked back. I've implemented and supported VI environments for two different companies now with the VCMS running in a virtual machine. It's been two years and I have not heard any of what I would call "deal-killers" to this design decision. However, there is a short list of things that I you should be aware of:

  • * If you need to shut down the entire VI environment, you'll need to save the ESX host(s) that VC and its database server are running on for last. Then you'll need to log on to the hosts directly to complete the shutdown. This doesn't happen too often, but I've had to do this 2 or 3 times, usually due to a storage-related outage.
  • * I've experienced brief 1-2 second pauses in the VMware Infrastructure Client (VIC) when the VCMS VM gets VMotioned from one host to another. Again, this rarely happens.
  • * And here's a new one: As of Update 2, there's a new feature called Enhanced VMotion Compatibility (EVC). To enable this in my environment, VC requires all virtual machines in the cluster be powered-off. It might be hard to enable this feature in VC if the VCMS is powered-off(!). The solution to this isn't too-painful, however: temporarily move the ESX server that hosts the VCMS VM out of the cluster, enable the feature then move it back.

What if your VCMS VM does crash? If VirtualCenter does become unavailable, your VMs will continue to run. HA runs as an agent on each host, so that service will continue to run. Since your probably running the FlexNet licensing service on the same VM as the VCMS, you'll have a grace period of 14 days to get the VM back up and running. If it takes you more than 14 days to get that VM back up and running, it's not very critical in your environment anyway.

For more information on this topic straight from the horse's mouth, please see: http://www.vmware.com/pdf/vi3_vc_in_vm.pdf

Still not convinced?
Leave a comment and let me know why.


0 Comments Permalink
0

There's nothing like going in to work Monday morning only to find that one of your ESX hosts is listed as "not responding" in VirtualCenter. Using HP's iLO, I tried restarting the management network. No change. The VMs were still running and functioning normally. The host was still running - there just seemed to be a communications problem between the host and VirtualCenter. After a quick call to VMware technical support, they had me restart the VirtualCenter server service and voila, communications were restored and the host's status in VirtualCenter returned to normal.

I didn't spend a lot of time doing a root-cause analysis as this environment was not in production yet. But I suspected there was a network interruption from which host-VC communications never recovered.

Now let me just say something about VMware technical support. I've worked support incidents with many hardware and software vendors over the years and have to say that VMware has their act together when it comes to product support. I'm not saying they're perfect, but I've received consistent quality support from these guys going back to my ESX Server 1.5 days. They're worth the money and I wouldn't run a Virtual Infrastructure environment without them.


So a week later, it happens again. I open another case with VMware referencing the previous. This time, restarting the management network or the VirtualCenter server service doesn't work. The support tech reviews some additional logs and is basically stumped. The only thing he had left for me to do was to manually shut down the VMs running on this host and reboot the host. This fixes the problem, but doesn't really explain why it happened in the first place. The tech is going to review a new set of logs I just uploaded and let me know if he finds anything. While there wasn't much more that could be done at this point, this always seems to me like a "don't call me, I'll call you" kind of resolution.


Before he has a chance to call me back, it happens a third time on the same host! Same symptoms, same results. The same tech doesn't find anything in the logs from the previous incident so he escalates to senior-level VirtualCenter support. We discovered a new symptom - the host seems to have lost connectivity with the storage, even though the VMs are still running fine (strange but true).


The senior tech said something that jogged my memory and I remembered that while this server survived our 4-day hardware burn-in test, we had problems connecting to the management console very early on to the point where we had to pull the USB key fob and reinstall it. (Keep in mind we're running ESXi.)


To be safe, I installed a new USB key fob and the problem has not occurred again. It's been about three weeks since writing this entry. Moral of the story: don't automatically rule-out the hardware even when the problem appears to be with the software.

0 Comments Permalink
3

Who is John Galt?

Posted by Virtual_JTW May 8, 2008

Ok, now that I have your attention (and no, I’m definitely not John Galt), but since this is my first post, I think it appropriate to give you a little background about myself. After all, why read and/or care about anything I have to say? ;)

I was introduced to VMware like so many others, through Workstation back in 2002. At the time I worked for a 2.8 billion dollar company and had just completed a major Microsoft SMS 2.0 implementation. I was working with a consultant that had it installed on his laptop PC and used it to demo the product developed by his employer.

So then it goes something like this:
That’s a cool product, let me show the boss!
Boss likes it and says, “Why don’t you do some more research on this VMware company and find out more?”
I do the research and discover VMware’s GSX and ESX product lines.
The rest is history. (Don’t you love that line?)

I procured a copy of GSX, set it up and created three virtual machines for one of our internal development teams.
Oh, and I never told anyone what I had done. (Bad boy!)

I used a whitepaper from VMware that describe how to use the same base image for multiple VMs using redo logs. Worked great except when one needed to be rebooted = NTFS no likey.

The dev team never caught on and six months later I finally let them in on the secret. Virtual machine? What’s that? Etc, etc… (I hate giving end-users or customers reason, right or wrong, to blame all of their ills on me!)

Bottom line = pilot successful. From there I got ESX 1.52 approved (now end of 2003 or so), first one host, then two, then four - all with local disk.

Then VirtualCenter 1.0 and ESX 2.0 are released. Please Mr. IT Director, will you approve this requisition for a SAN? No. But please?
No.
Okay then how about you get fired and my team now reports to another director?
Will you, new Mr. IT Director, approve this SAN purchase?
Yes!

Hosts grow to 16 (mixture of IBM 440s, 445s, 3650s, BladeCenter blades); VMs to 500 (powered on) 600+ requisitioned; VMotions = 1000s, SAN (IBM DS4500) = 16TB; every VMworld = attended; VCP2 = achieved. Life if good.

It’s now 2006 and my employer gets acquired by another company. Oh yeah, and the new management doesn’t like any new technology much less virtualization.

Time to move on. So here I am, embarking on a new implementation using ECM Celerra IP storage (iSCSI), HP ProLiant DL380s, VMware VirtualCenter 2.5 and ESX 3i.

I thought I would use this blog to document the good, the bad and the ugly of this next gen VI platform and maybe share some tips along the way. The VMTN forums have been good to me over the years so maybe I can add to the discussion.

That’s it for now. I already have a list of technical stuff to publish. Stay tuned!

3 Comments Permalink