Skip navigation
2017

 

     In vSphere 6, the Platform Services Controller (PSC) contains a number of pieces of functionality including the SSO components, certificates, licensing, and other core parts that make vCenter function. Because of its importance, it is possible to join external PSCs that share the same SSO domain in a replication topology whereby their states are kept synchronized. When a change is made to, for example, a vSphere tag, that change is asynchronously replicated to the other member(s). In the instance where a single PSC failed, vCenter could be manually repointed to another PSC that was an identical copy of the one that failed. Use of a load balancer in front of replicating PSCs can even provide automatic PSC switching so a manual repointing is unnecessary. This is all well and good and tends to work just swimmingly, however what happens when PSC replication fails? Bad things happen, that’s what. The natural and next logical question to ask then becomes, “how can I know and be informed when my PSCs have stopped replicating?” To this, unfortunately, there doesn’t seem to be an out-of-the-box way that is displayed in any GUI present in vSphere 6.5. Although you can setup replication through the GUI-driven installer when deploying PSCs, that’s basically the end of the insight into how replication is actually functioning. And when vCenter is pointed at a PSC with a replica partner, the additional PSC shows up under System Configuration in the web client yet does not inform you of replication success or failure.

 

 

Clearly, there is room for improvement here, and like the majority of my articles I want to try and find a solution where one currently doesn’t exist. In this one, I’ll show you how you can use a combination of Log Insight and vRealize Operations Manager to be informed when your PSCs stop replicating.

 

            In my lab I’ve setup two PSCs in the same site and same SSO domain (vsphere.local) which are replica partners. I have a single vCenter currently pointed at PSC-01. When I make a change to any of the components managed by the PSC through vCenter (or even the PSC-01’s UI at /psc), that change is propagated to PSC-02. This replication happens about every 30 seconds. By logging into PSC-01, we can interrogate the current node for its replication status using the vdcrepadmin tool. Since its path is not stored in the $PATH variable, it’ll have to be called directly.

 

root@psc-01 [ / ]# /usr/lib/vmware-vmdir/bin/vdcrepadmin -f showpartnerstatus -h localhost -u administrator -w VMware1!

Partner: psc-02.zoller.com

Host available: Yes

Status available: Yes

My last change number:             1632

Partner has seen my change number: 1632

Partner is 0 changes behind.

 

From this message we can see its replica partner (psc-02.zoller.com), if it’s currently available, the last numbered change seen by the source PSC, the last numbered change seen by the replica partner, and then how many changes behind that represents. When everything is functioning properly, you should see something like above. If you were to run the same command on PSC-02, you’d get very much the same response minus the change numbers.

 

Now, if I go over to PSC-02 and stop vmdird, the main service responsible for replication, and re-run the vdcrepadmin command on PSC-01, the message would look like the following.

 

Partner: psc-02.zoller.com

Host available: No

 

And if we look at the log at /var/log/vmware/vmdird/vmdird-syslog.log, we see corresponding failure messages.

 

17-11-29T14:48:00.420592+00:00 err vmdird  t@140610837108480: VmDirSafeLDAPBind to (ldap://psc-02.zoller.com:389) failed. SRP(9127)

 

Yet despite replication obviously failing, vCenter shows no change in status. Although there is a pre-built alarm in vCenter called “PSC Service Health Alarm” this only applies to the source PSC (where vCenter is currently pointed) and has no knowledge of replication status. Total bummer as you’d really hope to see something trigger inside vCenter. Maybe one day. (Cue sad face and violins.)

 

Anyhow, if vCenter won’t do it for you, we’ll do it ourselves. Since we know the logged messages that occur when failure is upon us, we can use Log Insight to notify us. And, furthermore, if we integrate Log Insight with vROps, we can send that alert to vROps and have it associated to the correct PSC virtual machine. For this to work, we obviously need both applications, and we also need to integrate them. See some of my other articles that cover this as I won’t spend time on it here, but it’s a pretty simple process.

 

After they’re stitched together, we need to get logs from our PSCs over to Log Insight. Log into the VAMI for each PSC on port 5480. Plug in your vRLI host in the Syslog Configuration portion as shown below.

 

 

Do this for both PSCs.

 

NOTE: Although the Log Insight agent can also be installed on the PSCs to provide better logging, it is not required if you want visibility into the replication process.

 

Verify from your vRLI server that logs are indeed being received from those sources. They should show up in Interactive Analysis and Administration -> Hosts.

 

 

Also confirm that the vmdird log stream is coming in. Since Log Insight already extracts the proper fields, we have the convenience of just asking for that stream directly in a query.

 

 

Change the resolution to a longer period to ensure logs are there. And if you want to watch how logs change based on activities you perform on the PSC, try to create a new user in the internal SSO domain, add a license, or create a vSphere tag. Update the query you’ve built and see what you get.

 

 

 

Starting at the bottom, you might be able to figure out I added a new user, and then about 15 seconds later that change was replicated to the peer PSC as evident by the top line.

 

Once we have verified good logs, we can create an alert based on them. Stop vmdird on the PSC replica and watch the failure logs come rolling in.

 

root@psc-02 [ ~ ]# service-control --status

Running:

applmgmt lwsmd pschealth vmafdd vmcad vmdird vmdnsd vmonapi vmware-cis-license vmware-cm vmware-psc-client vmware-rhttpproxy vmware-sca vmware-statsmonitor vmware-sts-idmd vmware-stsd vmware-vapi-endpoint vmware-vmon

root@psc-02 [ ~ ]# service-control --stop vmdird

Perform stop operation. vmon_profile=None, svc_names=['vmdird'], include_coreossvcs=False, include_leafossvcs=False

2017-11-29T15:12:57.617Z   Service vmdird does not seem to be registered with vMon. If this is unexpected please make sure your service config is a valid json. Also check vmon logs for warnings.

2017-11-29T15:12:57.617Z   Running command: ['/sbin/service', u'vmdird', 'status']

2017-11-29T15:12:57.655Z   Done running command

Successfully stopped service vmdird

 

 

We can see the failure messages now, and from these we can create a new alert. Highlight the “VmDirSafeLDAPBind” part and add it to the query.

 

 

Now, highlight “failed” and do the same. Put them on two separate lines because entries on the same line are effectively an OR statement. Your built query should look like the following.

 

Remove the hostname portion so as to make this alert more general. Now, on the right-hand side, create an alert from this query.

 

Complete the alert definition including description and recommendation as these are fields that will show up when we forward it to vROps.

 

 

 

Check the box to Send to vRealize Operations Manager and specify a failback object. This is the object with which vROps will associate the alert if it cannot find the source. Set the criticality to your liking. Finally, on the Raise an alert panel at the bottom, select the last radio button and choose “more than” with a number less than 20 in the last 5 minutes. Since the PSCs replicate approximately every 30 seconds and two messages match the query every period, the result would produce 20 logged entries in a 5-minute period, so you want to stay under that if your timeline is 5 minutes. Save the query and be sure to enable it.

 

 

 

With the alert configured, saved, and enabled, let’s give it a try and see what happens. On your PSC replica, again stop vmdird with service-control --stop vmdird and wait a couple minutes. Flip over to vROps and go to Alerts.

 

 

Great! The alert fired, got sent to vROps, and even was correctly associated with PSC-01, the PSC from which the errors were observed. And the description and recommendations that were configured for the alert also show.

 

 

Now you’ll have some peace of mind knowing that your replication is working properly, and if not when it occurred. I’m also providing my Log Insight alert definition that you can easily import if you’d prefer not to create your own. So hopefully this makes you sleep a little bit better knowing that you won’t be caught off guard if you need to repoint your PSC only to find out it’s broken.

Introduction

 

     The topic of vSphere upgrades is a hot one with every new release of VMware’s flagship platform. Architectures change, new features are introduced, old ones are removed, and so everyone is scrambling to figure out how to move to the next version and what process they should use. There are generally two approaches when it comes to vSphere upgrades: in-place upgrade or migrate. In the in-place upgrade process, the existing vCenter Server is preserved and transformed into the new version while in the migration method, new resources are provisioned using the new version which then take over from the old resources. Primarily the new resources consist of the vCenter Server and its accouterments while ESXi hosts are simply moved over to it and then upgraded. Therefore, both strategies see ESXi hosts being upgraded in-place. While there are pros and cons to each approach, I want to explore the migration method in particular since this is a question I often get from customers and the community at large. In addition, the in-place upgrade approach is fairly well documented with steps and procedures from VMware while the migration method receives little, if any, attention. Let’s go through the process of the migration method and discuss how it works, what’s involved, and the gotchas of which to be cognizant.

 

Why Migrate?

 

     Upgrading vSphere is no simple task regardless how you go about it. Although VMware has done a good job of making this process easier and more reliable, there are still a number of things you as an engineer or administrator are responsible for doing to ensure it ultimately succeeds. Before deciding if you want to go straight to a migration rather than in-place upgrade, we need to lay out the pros and cons of each. Here’s a table which has the most salient points.

 

In-place Upgrade vs Migrate Pros and Cons

MethodProsCons
In-place Upgrade
  • Preserves config and data
  • Can be quicker
  • Some solutions carry over
  • No reconfig of external apps
  • Convenient utility for vCSA
  • Preserves unoptimal config
  • Config has legacy settings
  • Higher risk of failure
  • Future risk of breaking
  • Can’t change architecture
Migrate
  • New config from scratch
  • Best practice settings default
  • More controlled process
  • Less risk of failure
  • High chance of future upgrade readiness
  • Ability to change architecture
  • More planning time
  • Manual work moving items
  • Lose historical data
  • Must reconfigure apps

 

The in-place upgrade has advantages like preserving performance data because the vCenter database is kept intact. Since it’s the same vCenter, the identity is carried forward as are all the settings. It can sometimes be quicker to upgrade since you’re not standing up a new vCenter, and if you’re moving from Windows to the appliance there’s a handy migration utility that streamlines this process. Lastly, any solutions or other third-party applications you have which rely on vCenter continue to work (if they’re compatible).

 

However, there are some serious drawbacks to consider as well. Going with an in-place upgrade means settings which may not be optimal on the new version are carried forward rather than altered. In preserving the configurations, you may also be moving things along which were mistakes or not according to best practice to begin with. There’s a much higher risk of failure due to things like database issues, which are rampant, underlying OS issues, and the fact that in any enterprise software development, the majority of the efforts in QA are focused on testing net new deployments. It’s only understandable that vendors focus on predictable deployment patterns rather than trying to model millions of possible permutations of different versions crossed with different settings—it’s a matrix from hell. An in-place upgrade has a higher risk of breaking as future patches and updates are made to the software then scabbed on. A combination of legacy settings and non-optimizations create somewhat of a ticking time bomb for any further updates owing again to the possibilities when in the developing and test phases. And last, an in-place upgrade won’t allow you to change your vCenter architecture. It’s very common to see vCenter deployments that, due to time, budget, personnel, or other constraints were slung together and not well planned and thought out. Perhaps the architecture was wrong on day one, or maybe your company has simply grown organically or through investitures and you now find the need for multiple vCenters and a more resilient architecture. In-place upgrades don’t allow you to change what you have, merely stand pat and bump up to the latest release.

 

When it comes to the migration path, you still have some negatives that should be understood. In a migration, since this is a new vCenter, there’s more planning that is involved as you understand dependencies and port elements over. This translates to more time spent on the overall upgrade process. And because this is a lift-and-shift operation, you’ll lose historical data in the vCenter database as well as be required to repoint any external applications that talk to vCenter. More on all this in the Moving to Migrate section.

 

The positive aspects of a migration as opposed to an in-place upgrade are extremely compelling, however. This is a fresh, clean slate, so you have the opportunity to right past wrongs, fix non-optimal settings, and conform to best practices without having to worry about transporting and then readopting a bunch of junk from prior versions. If your present vCenter environment has been upgraded from at least one major version in the past (for example, from vSphere 4.1 to vSphere 5.0), this is usually a clear signal to break with in-place upgrades and opt for a migration. The migration process is much more controlled and so you can take the time to be thorough and fix issues as they arise without worrying about downtime. The risk of failure is very slight because everything is new and fresh so no worrying about database corruption or rogue tables killing your upgrade. Since this is essentially a new environment, future patches and upgrades are much more likely to go without incident because you are on a common, known-good platform. And, lastly, you can learn from prior mistakes, assess the needs of your company, and correct upon earlier architectures by designing a new one and putting it into play. When the time comes and you’re satisfied, you can then begin to bring things over piece by piece until the legacy environment is entirely vacant and deprecated, then dispose of it forever.

 

  Weigh each option carefully to determine if the pros column outweighs the cons column in your case. And for some, an in-place upgrade is the only possibility due to a variety of reasons. However, keep in mind the ultimate goal with any upgrade is not only to satisfy the primary objective of moving to the later version, but to ensure the platform remains stable, reliable, secure, and performant. Pursuant to those goals, it has been my experience that a lift-and-shift migrate, while having some leg work involved, ultimately produces the best result in the long run and sets you up for a more stable vSphere.

 

Moving to Migrate

 

     In a vSphere migration process, there are three large steps that occur and, while they sound simple, are actually complex in the implications that arise from such a movement.

 

  1. Stand up new vCenter on new version
  2. Move ESXi hosts to new vCenter
  3. Upgrade ESXi hosts

 

Leading up to these steps is much planning in figuring out how exactly to do this. The devil, as they say, is in the details. Because this is essentially a new vCenter infrastructure design, we have the opportunity to adjust what might not have worked so well in the past and adopt a clean and new architecture that better suits our needs. Some questions to ask and then answer include:

 

  1. What type of vCenter platform will I use?
  2. What will the size of my inventory be?
  3. How will this grow in the foreseeable future?
  4. Will I use an external PSC?
  5. Do I need to link additional vCenter?

 

Obviously, the answers to these questions will be specific to your needs and that of the business and so are out of scope for this particular article, with one exception being the vCenter platform. Because Windows-based vCenters are going away, the appliance should be the only thing on your radar. The point being that you are planning for a greenfield deployment as if your existing datacenter was a new one entirely. Once you’ve settled on a vCenter architecture, we have to get from the current state to the new state. This is where the next batch of planning comes in. Because of the complexities of vCenter and the various features it enables (which you may be using), there are a whole host of things that must be moved and due diligence done before swinging hosts. An exhaustive list is not possible, but here are the 10 major things you should check and plan to either move or recreate. Keep in mind that although this list is tailored towards a migration, several items are universal irrespective of which upgrade method you elect.

 

10 Things to Check Before Migrating

 

1. Custom roles and permissions

Any roles you’ve cloned and customized in your existing vCenter will not be moved with hosts and so must be recreated. Also, if you’ve applied those custom roles to specific objects in the vCenter hierarchy, those will need to be documented and recreated. Even if not using custom roles, existing out-of-the-box roles that are applied at granular levels inside vCenter will need to be recreated.

 

2. Distributed Switch

The vDS is a vCenter-only construct and will have to be dealt with first. While you can backup and restore that vDS via the web client, hosts will have to be migrated to a vSS first before vCenter will allow you to disconnect them. This is a topic unto itself, but you will need two uplinks as a minimum to perform such a migration as well as some careful planning. It can be done with VMs online, but the point being you have to get to a vSS first, then reverse the process later.

 

3. Folders, Resource Pools, Compute/Datastore Clusters

Once again, these are all vCenter constructs and will not follow the hosts. Any vSphere folders, resource pools, compute or datastore clusters will need to be recreated on the destination. Other vCenter-specific resources include storage policies, customization specs, host profiles, vSphere tags, DRS rules, and licenses. While some of these objects have native, GUI-driven exportation abilities like host profiles as shown below, others like vSphere folders will require you drop down to PowerCLI and do some scripting. In most cases, there are existing PowerShell scripts you can leverage to help, but you’ll need to consider these before swinging hosts.

4. ESXi version compatibility

In vSphere 6.5, for example, vCenter 6.5 cannot manage hosts below 5.5 and so before committing to this process, you need to ensure the existing ESXi hosts will support being connected to the next version of vCenter prior to them being upgraded.

 

5. Hardware support (compute, storage, network)

Further to #4, you must check your hosts, storage, and network against the HCL to ensure they will support being upgraded to the target new version. This is something that is overlooked far too often and leads to major issues. Vendors are the ones who usually do compatibility testing on their platforms, and so not all servers will support the latest version. In order to be in a safe place if you need support, all hardware must be validated against the HCL. Also, don’t forget about your physical network and storage equipment. These must be validated every bit as much as your ESXi hosts.

 

6. Firmware updates

And further to #5 is the matter of firmware updates for the said physical equipment. Although you may have validated that your servers and storage are indeed supported with the latest version of vSphere, they may not be running a compatible or supported version of the underlying firmware. This can be critically important if you wish to avoid outages and instability in your vSphere platform. Every piece of hardware on the HCL contains corresponding validated firmware that forms the support statement.

 

For example, in the image above you can see that the HP Lefthand storage array must have at least SANiQ 12.5 if using the be2iscsi driver to be compatible with ESXi 6.5 U1. Other drivers, which depend on the network adapter in use, may have higher requirements. You must take care to ensure that all combinations of hardware have been validated against the HCL and work with various teams internally to come to an understanding on what, if any, upgrades are necessary prior to upgrading ESXi.

 

7. vSAN, NSX, and other VMware solutions

This is a very broad topic, but if you’re running vSAN or NSX then there are specific validation that must take place there. Any other VMware solutions you may have such as vROps, vRA, SRM, Log Insight, Infrastructure Navigator, Horizon View, etc. must all be checked for their individual levels of support and interoperation with the new version. Use the Interoperability Matrix to check these solutions, and then use the KB for proper upgrade order in the case of vSphere 6.5. For example, if you are using NSX, you may need to upgrade it before you perform the migration. Also, while not so much a concern any longer since vCenter 6.5 now has it baked in, is Update Manager. Some shops are very particular about their VUM installations. This is something else you must leave in the dust, so make preparations to migrate any builds, patches, and baselines to the new vCenter. Lastly, if using Auto Deploy then you’ll want to take that into consideration as well since it has some special requirements.

 

8. Plug-ins

Also a broad topic but any third-party plug-ins you might have, for example with your storage vendor, will need to be validated, possibly upgraded, then migrated or reregistered against the new vCenter. Check vCenter for a list of these under Administration -> Solutions heading at Client Plug-Ins and vCenter Server Extensions. For deeper insight into what is registered and where it is, see William Lam’s article on using the vCenter MOB. Check with each respective vendor to figure out what that process may be and if you’ll need to perform any sort of backup or restore procedure for the data that may have been created or managed by those plug-ins.

 

9. SEAT data

Stats, Events, Alarms, and Tasks (SEAT) data will be left behind in your existing vCenter because this is all stored in the database and does not travel with the hosts. Stats are the performance statistics when you open the performance charts on an object. Events are any event on any object accessible from the Tasks and Events pane. Alarms are any existing, active alarms as well as those you have customized plus those created automatically by other solutions or plug-ins. Tasks are any records of activities performed manually or programmatically and serve as an audit log. If you’re using something like vROps, most of this information will be preserved there, but if not, be cognizant that you must give this up once hosts are swung.

 

10. Backup, replication, and monitoring

Very important and often overlooked. Special applications such as backup, replication, and monitoring will need to be validated for support and functionality, but will also need to be reconfigured or updated once the resources for which they are responsible are moved elsewhere. vCenter tracks objects by several internal IDs, the main one being the MoRef ID (Managed Object Reference). This tracking system assigns a unique ID to each VM, host, folder, etc., and it is very often this ID that such applications key off of when associating their inventories. For example, in the case of Veeam Backup & Replication, when swinging hosts and their VMs over to a new vCenter, each object will have a new MoRef generated for it. If you merely reconfigure the jobs to point to the new vCenter, Veeam will see new IDs and therefore think they are brand new VMs even though they’re actually the same. Veeam has address this challenge specifically in a KB, but you’ll want to understand what will happen in this case and how your monitoring or replication applications will behave. Between points #6 and #10 here are the biggest and most complex things to investigate and can make or break if a migration is right for you. Anything and everything that talks to or through vCenter Server must be accounted for, documented, and investigated.

 

Resources and Links

 

     I’ve covered lots of different material and provided several links, but I want to list the most important ones you can use as reference material when deciding on an upgrade path. Let these links be your guiding star and read them thoroughly and carefully. While several are for vSphere 6.5, they are generic documents that are updated with each major release.

 

Also included are release notes to the latest versions of vSphere as of the time of writing. Something that people rarely do is read release notes and instead plunge head first into an upgrade/migration. I can’t stress enough the importance of reading and then re-reading release notes. Bookmark them and check back frequently when planning your path because VMware always updates them as new issues are discovered and workarounds found.

 

VMware Compatibility Guide

VMware Product Interoperability Matrices

Update sequence for vSphere 6.5

vSphere 6.5 Upgrade Documentation

Best practices for upgrading to vCenter Server 6.5

vCenter 6.5 U1 Release Notes

ESXi 6.5 U1 Release Note

vSphere 6.5 was released at the end of 2016 and so, at this point, has been on the market for about a year. VMware introduced several new features in vSphere 6.5, and several of them are very, very useful however sometimes people don’t take the time to really read and understand these new features to solve problems that might already exist. One such feature that I’d like to focus on today is the new HA feature called Orchestrated Restarts. In prior releases, vSphere High Availability (HA) has served to reliably restart VMs on available hosts should one host fail. It does this by building a manifest of protected VMs and, through a master-slave relationship structure, makes those manifests known to other cluster members. Fun fact that I’ve used in interviews when assessing another’s VMware skill set is HA does not require vCenter for its operation although it does for the setup. In other words, HA is able to restart VMs from a failed host even if vCenter is unavailable for any reason.  The gap with HA, until vSphere 6.5 that is, is it has no knowledge of the VMs it is restarting as far as their interdependencies are concerned. So, in some situations, HA may restart a VM that has a dependency upon another VM which results in application unavailability when all return to service. In vSphere 6.5, VMware addressed this need with a new enhancement to HA called Orchestrated Restarts in which you can declare those dependencies and their order so HA restarts VMs in the necessary sequence. This feature is imminently useful in multi-tier applications, and one such application that can benefit tremendously is vRealize Automation. In this article, I’ll walk through this feature and illustrate how you can leverage it to increase availability of vRA in the face of disaster in addition to covering a couple other best practices with vSphere features when dealing with similar stacks.

 

              In prior versions of HA, there was no dependency awareness—HA just restarted any and all VMs it knew about in any order. The focus was on making them power on and that’s it. There were (and still are) restart priorities which can be set, but not a chain. In vSphere 6.5, this changed with Orchestrated Restarts.

 

 

With special rules set in the web client, we can determine the order in which power-ons should occur. First, let’s look at a common vRA architecture. These are the machines present.

 

 

We’ve got a couple front-end servers (App), manager and web roles (IaaS), a vSphere Agent server (Agent), and a couple of DEM workers (DEM). The front-end servers have to be available before anything else is, followed by IaaS, and then the rest. So, effectively, we have a 3-tier structure.

 

 

And the dependencies are in this order, so therefore App must be available before IaaS, and IaaS must be available before Agent or DEM.

 

Going back over to vCenter, we have to first create our groups or tiers. From the Hosts and Clusters view, click the cluster object, then Configure, and go down to VM/Host Groups.

 

 

We’ll add a new group and put the App servers in them.

 

 

And do the same for the other tiers with the third tier having three VMs. It should end up looking like the following.

 

 

Now that you have those tiers, go down to VM/Host Rules beneath it. Here is where the new feature resides. In the past, there was just affinity, anti-affinity, and host pinning. In 6.5, there is an additional option now called “Virtual Machines to Virtual Machines.”

 

 

This is the rule type we want to leverage, so we’ll create a new rule based on this and select the first two tiers.

 

 

This rule says anything in vRA-Tier1 must be restarted before anything in vRA-Tier2 in the case where a host failure takes out members from both groups. Now we repeat the process for tiers 2 and 3. Once complete, you should have at least two rules in place, possibly more if you’re following these instructions for another application.

 

After they’ve been saved, you should see tasks that kick off that indicate these rules are being populated on the underlying ESXi hosts.

 

In my case, I’m running vSAN and since vSAN and HA are very closely coupled, the HA rules serve as vSAN cluster updates as well. And by the way, here is another opportunity we have to exercise best practice with a distributed or enterprise vRealize Automation stack. We need to ensure machines of like tier are separated to increase availability. This is also done here and we need to specify some anti-affinity rules to keep the App servers apart as well as the IaaS servers and others. My full complement of rules, both group dependency based and anti-affinity, looks like so.

 

 

Now we have the VM groups and the orchestration rules, let’s configure a couple other important points to make this stack function better. In vRA, the front-end (café) appliance(s) usually take some time to boot up because of the number of services that are involved. This process, even with a well-performing infrastructure can still take several minutes to complete, so we should complement these orchestrated restart rules with a delay that’ll properly allow the front-end to start up before attempting to start other tiers. After all, there’s no point starting other tiers if they have to be restarted manually later because the first tier isn’t yet ready for action.

 

Let’s go down to VM Overrides and add a couple rules. This is something else that’s great about vSphere 6.5, the ability to fine-tune how HA restarts VMs based on conditions. Add a new rule and put both App servers in there.

 

 

Three key things we want to change. First, the VM restart priority. By default, an HA cluster has a Medium restart priority where everything is of equal weight. We want to change the front-end appliances to be a bit higher than that because this serves as the login portal, so HA needs to make haste when prioritizing resources to start VMs elsewhere. Next, the “start next priority VMs when” setting allows us to instruct HA when to being starting VMs in the next rule. There are a few options here.

 

 

The default in the cluster unless it’s overridden is “Resources allocated” which simply means as soon as the scheduler has powered it on—basically immediately. Powered On is waiting for confirmation that the VM was actually powered on rather than just attempted. But the one that’s extremely helpful here is what I’d suggest setting which is “Guest Heartbeats detected.” This setting allows ESXi to listen for heartbeats from VMware tools, which is usually a good indicator that the VM in question has reached a suitable run state for its applications to initialize.

 

Then back to the list, an additional delay of 120 seconds will further allow the front-end to let services start before attempting to start any IaaS servers. If, after this custom value, guest heartbeats are still not detected, a timeout will occur and, afterwards, other VMs will be started. Extremely helpful in these situations when you need all members to come up, even at the expense of pieces maybe needing to be rebooted again. Rinse and repeat for your second tier containing your IaaS layer. Using the same settings as the front-end tier is just fine.

 

Great! Now the only thing left is to test. I’ll kill a host in my lab to see what happens. Obviously, you may not want to do this in a production situation

 

I killed the first host (10.10.40.246) at 7:55pm that was running App01, IaaS01, and Agent01. Here’s the state before.

 

 

Now after the host has died and vCenter acknowledges that, the membership looks like the following.

 

 

Those VMs show disconnected with an unknown status. Let’s see how HA behaves.

 

 

Ok, good, so at 8:00pm it started up App01 as it should have once vSAN updated its cluster status. Normally this failover is a bit quicker when not using vSAN.

 

Next, when guest heartbeats were detected, the 2-minute countdown started.

 

 

So at 8:04, it then started up IaaS01 followed by Agent01 similarly. After a few minutes, the stack is back up and functional.

 

 

Pretty great enhancements in vSphere 6.5 related to availability if you ask me, and all these features are extremely handy when running vRA on top.

 

I hope this has been a useful illustration on a couple of the new features in vSphere 6.5 and how you can leverage those to provide even greater availability to vRealize Automation. Go forward and use these features anywhere you have application dependencies, and if you aren’t on vSphere 6.5 yet, start planning for it now!

For those who use vRealize Automation (vRA), you’re probably all too familiar with vSphere templates and how they are the crux of your service catalog. Unless you’re creating machine deployments anew from another build tool, you’re probably using them, and it’s likely you have at least two, sometimes many more. Using the Clone workflow, those vSphere templates become the new VMs in your deployments. That part is all well and good, but do you fall into the category of having several templates? How about multiple data centers across multiple geos? It becomes a real chore when patching time comes around. It’s bad enough having to convert, power on, update, power down, and convert again two templates let alone a dozen only to be forced to multiply that work times the number of sites you have. So, in the words of the dozens of infomercial actors out there hawking questionably useful gadgets and gizmos…

I’m glad to share with you in this new article that there is indeed a better way if you happen to be using Veeam Backup & Replication. And the best part is this won’t even cost you so much as one easy payment of $19.95. Read on for the best thing since the Slumber Sleeve.

 

     Let’s start off with a basic scenario:  You have two different data centers, each managed by a different vCenter with Veeam Backup & Replication able to reach both of them. There is at least one Veeam proxy per site. One site is considered the “primary” site while the other is “secondary.” Your templates are updated only on the primary site but you wish them to be available at the secondary site as well. vRA has endpoints set up for both vCenters with reservations and blueprints created for both locations. This is a very common scenario and can be achieved here, not to mention replication to any N other sites.

 

As the scenario was described, make sure you do have those prerequisites met. If you don’t yet have a second endpoint in vRA with blueprints for the second location, that’s not a problem. But do ensure Veeam is operational, has proxies at both sites, and that those vCenters have been added to Veeam’s Managed Servers inventory list. Also, PowerCLI will be necessary on the Veeam management server. This process as well as the scripts I’ll provide were developed and tested with PowerCLI 6.5.2 against vCenter 6.5 U1 and Veeam 9.5 U2, but presumably they should work on earlier versions of each as well.

 

     To start this process off, we need to make sure there is some level of organization in vCenter on both sites. Place all your vRA templates into a dedicated folder within the source vCenter. That is to say, any template you want replicated needs to go in one specific folder. I just called mine “vRA Templates” and it is a subfolder of “Templates” because I have others that are not for vRA’s use.

Pretty simple with one Windows and one Linux. Also, keep in mind that these templates have the guest/software agent installed on them, and it can only be pointed to one vRA environment, so if you’re thinking of replicating them to another data center for another vRA’s use, you might need to use another method. In the second site, create a folder of the same name.

 

Now, in that second site and within that folder, create new VMs but give them the same configuration as your templates. For example, my CentOS 7.2-vRA template you see there is a 1 x 2 x 6 configuration. Create the same configuration in a new VM in the second location. Feel free to give it another name, or append a portion to the name for uniqueness, however this is not required. Join it to a portgroup of your choice keeping in mind vRA will have the ability to change this when deployed. Here’s the thing, though, you do *NOT* need to power it on or load any sort of OS. Just leave it there as a shell of a VM in a powered-off state. Why is this necessary? Because although Veeam is perfectly capable of creating replicas at the destination, we can’t have that in this case due to the instanceUUID value, also known as the vc.uuid. This is the ID by which vCenter tracks different VMs and each is based off of the vCenter unique ID. The instanceUUID is consequently how vRA tracks VMs and so all must be unique even across sites. If we let Veeam create the VM replica at the destination, those IDs would not be unique thus vRA would not know about both of the templates as different objects. By manually going through this creation process once, we can ensure those IDs are unique and Veeam will honor them going forward with replication. With some simple PowerShell, we can verify those instanceUUIDs and keep track of them for later.

 

$vm = Get-VM -Name "MyVMName"

$vm.extensiondata.config.InstanceUUID

 

With the template shells created at the destination, make sure you convert the templates at the source side to VMs. It’s time to set up the Veeam replication job. Again, before doing so, make sure both vCenters are in Veeam’s inventory and you have proxies available in both locations.

 

1.) Create a new replication job. Before navigating away from that first screen, two very important check boxes we must have in place.

Check “Low connection bandwidth” and “Separate virtual networks”.

 

2.) Select the VM folder at the source containing your vRA templates.

 

3.) At the Destination step, select the necessary resources including the folder at the destination containing those shell VMs.

 

4.) On the Network selection, select the portgroup on the source side where those templates are joined, and then the portgroup on the destination side where your shells are joined.

 

5.) On Job Settings, select the repository of your choice. Clear the selection for the replica suffix as we won’t need that (or assign something else). Change restore points to 1 since you must have at least one. In the Advanced Settings option, leave the defaults set ensuring the two options under the Traffic tab are enabled.

6.) Data Transfer, select options to fit your infrastructure.

 

7.) Seeding. This is the key. We’re going to map our replicas to those shells we created earlier at the destination.

We won’t use the initial seeding functionality, only the replica mapping. Click each template and manually map it to those new shells. Don’t rely on the detect functionality as it probably won’t come up with the right systems.

 

8.) Guest Processing. Don’t need it since these will be powered off.

 

9.) Schedule is up to you. Since Veeam won’t replicate anything if no data has changed on these templates, it’s safe to set it to a more frequent interval than that in which you’ll be updating them.

 

10.) Click Finish to save the job.

 

Now, if you were to run the job now, it wouldn’t process anything if you converted those VMs back to templates at the source. The reason being that Veeam can’t replicate templates natively, so they must be converted. A couple scripts I’ll provide you here are for the pre- and post-job processing of the job. The pre job script automatically converts that folder of templates into VMs. Once done and the Veeam job started, it processes them normally. The post-job script reverses the process. In addition to templates needing to be VMs for replication to function, the added benefit is CBT can be leveraged to find and only move the blocks of data that are different. If you’re familiar with backup jobs, you may know that they are capable of backing up templates, but only in NBD mode and without CBT. So this process of template-to-VM conversion and back actually saves time if that data needs to be moved over a WAN to remote sites.

 

     The last thing to do here is to download the pre- and post-job scripts I’ve provided, modify them slightly, and configure them in the advanced job settings options. Both scripts have been made with ease of use and portability in mind. The only things that need setting by you are the username and password options.

 

Edit the PRE script to change the $username and $password variables, and do the same for the POST script. Since this is obviously sensitive information, you’ll want to keep access controlled to these scripts somewhere local to your Veeam server. Once this is done, edit your replication job and go to Job Settings -> Advanced (button) -> Scripts. Check both boxes to run the script at every 1 backup session.

For the path, specify them as the following:

 

C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe -File "D:\Replicate vRA Templates PRE.ps1" -SourcevCenter "source_vC.domain.com" -DestinationvCenter "dest_vC.domain.com" -FolderName "My vRA Templates"

 

Since the scripts accept parameters for source, destination, and folder, we can just pass these in as arguments in the program path.

 

So after getting that plumbed up, you should be ready to run your replication job. The overall process that happens will be something like this.

 

  1. Pre-job script runs. Converts folder of templates at source and destination to virtual machines.
  2. Veeam replication begins. Replicas are mapped. Disk hashing of destination VMs begins.
  3. Data is replicated from source to destination using available proxies.
  4. Replication complete. Retention applied by merging snapshots (if applicable).
  5. Post-job script runs. Checks destination VMs to see if vCenter reports VMware tools as installed. If no, starts VM, waits for tools status, then shuts down. If yes, converts VMs at destination to templates. Converts VMs at source to templates.

 

Step 5 may look a little strange. Why does it care about VMware tools status on a powered-off VM? This is because customizing a VM using a customization spec uses the VMware tools channel to push the configuration. If VMware tools are not installed, customization cannot happen, and even if tools really are installed but vCenter sees they are not, it will still fail. If customization fails, vRA fails, too, since vSphere machines require vCenter customization specs for things like static IPs and domain joins. So the workaround here is to power on the VM until tools starts. At that time, vCenter will pick up on it and change what it has recorded in its configuration to reflect that tools are installed. This is what will allow customization to succeed. Once this status is updated on the VM, the script will then perform a guest OS shutdown (not a power off) followed by a conversion to template.

 

One possible workaround if you would prefer not to have this script power on your destination VM/template every time is to, after the initial replication completes, power on the VM yourself waiting for VMware tools, shut it down, then merge the snapshot that Veeam just created. Doing so will retain the tools status, but upon the next run of the replication job will trigger another disk hashing run. This disk hashing will have to check all the blocks on the destination VM to ensure the data is as it left it before proceeding with another replication cycle. But it is one possibility if, for some reason, you cannot have your templates being brought up due to run-once scripts or configurations getting updated, etc.

 

     There you go. You now have your templates from your source site at your destination site ready to consume with vRA. Because we performed replica mapping, they should be discrete instances of templates, even if the names are the same, due to different instanceUuids. One last thing is to validate from vRA’s side that these templates are, in fact, separate and we can consume them independently. With IaaS Administrator permissions, login to vRA and go to Infrastructure -> Compute Resources -> Compute Resources. Hover over the compute resource corresponding to your remote site and perform a data collection.

 

 

Request an Inventory data collection.

 

 

Once successful, either create a new blueprint or edit an existing one. Click on your vSphere machine object on the canvas, go to the Build Information tab, and click the button next to the Clone from field. You should now be able to see templates from both sites in the list and eligible for selection on your blueprints.

 

 

Now all that’s left for you to do is consume them in vRA any way you wish!

 

What if you have multiple sites and want to do a one-to-many replication? That’s no problem either. Simply duplicate the process for the second vCenter in a new replication job, and for the scheduling portion, select the “After this job” option and pick the first replica job you created. Also be sure to edit the arguments of the pre- and post-job script configurations so it reflects the correct destination vCenter. Remember, you’ll still need to create one-time “shell” VMs at your other sites and store them in a consistent folder. You can chain as many of these together as you want and vRA will be able to see them independently.

 

     As you can see, replication of your vRA templates can be done fairly easily with the provided scripts and allows great benefits in the form of time savings and consistency. No more having to manually patch and update the same template in multiple sites. No more user error involved in forgetting to do something on one site that you did in another. And no more failing audits because your corporate, hardened template in your other site didn’t get the new security settings. Now you too can be like one of the millions of satisfied customers with your handy-dandy vRA Template Replicator!

 

 

I wish to give special thanks to Luc Dekens for his scripting expertise and assistance in this project. He was very generous in providing some code snippets and reviewing the process of the precursors to the scripts provided here, and he frequents the PowerCLI forum on the VMTN Community where helping others with their PowerShell scripting challenges. Thank you, Luc!