Skip navigation
1 2 Previous Next

amin masoudifard's Blog

27 posts

Many people have confusion sight about how really a host assigns CPU resource to the virtual machines; More precisely I can say how the processing operation of a VM has been executed via the physical CPU resources. In the Intel terminology, the physical processor is a CPU socket, but in this post, I consider the pCPU as a physical core in the existing sockets of servers.

By default, each of the added vCPU to the VMs is assigned to one of the existing pCPUs. So if we configure 8 vCPU for a VM, there must exist at least 8 pCPU in the host. In other words, if there is not enough pCPU for the VM, it cannot be started.

Based on design, VMware ESXi can handle the CPU oversubscription (request of vCPU more than existing processors/pCPU). It means the pCPU~vCPU ratio is not one by one (1:1) anymore. In the vSphere environment, the ESXi host will handle the processing operations to execute requests of every VM, then the host needs to schedule processing time for each of them. But here the question is what ratio should be configured as the best settings? The answer depends on choosing Capacity or Performance aspects, really it can be very different based on the virtualized application requirements ...

Each VM needs the pCPU resources, then implementation of many VMs specially highly-applicable and resource-consumption virtual machines demand more CPU clocking. So if you provision more VMs and also increase the pCPU~vCPU Ratio (1:2, 1:4 or greater) the performance of the ESXi host will be affected.

As the VMware mentioned vSphere ESXi scheduling mechanism prefers to use the same vCPU-to-pCPU mapping to boost performance through CPU caching on the socket. If there is no specific documentary for the CPU design of the Application, you can set it up with a single vCPU, then scale up based on requires. So oversubscription will not have a serious negative impact.

Also, we must consider the CPU Ready Time is another important metric as the  CPU utilization metric is. Generally, vCPU~pCPU ratio is based on many factors like the following:

  1. Version of ESXi host. Each newer version supports more ratio.
  2. Supported features and technologies by physical processor.
  3. Workload rates of critical Applications that are implemented in the virtual environment.
  4. The capacity of existing processor resources in other members of the cluster and their current performance, especially when we require a higher level of hosts fault tolerance in the virtualization infrastructure. Available resources in the cluster will specify each VM that can be placed on which host in front of a host failure.

 

Should we use Hyperthreading or not ?!

Hyperthreading is a great technology that makes a single pCPU act as the two logical processors. In the case of the low-usage of ESXi host, each of those logical cores can handle two independent applications at the same time. So if you have 16 logical processors in the ESXi host, after enabling of HT (In both of the BIOS config and ESXi advanced settings) you will see the host has 32 logical processors. But using HT does not mean performance is increased always and it's highly dependent on application architecture. So in some cases maybe you encounter performance degradation via HT usage. Before enabling of HT in the ESXi hosts, review critical virtualized applications deploy on their VMs.

Source of original post in my personal blog: Undercity of Virtualization: Virtualization Tip1: Relation between physical CPU & virtual CPU

In this post and other series of this title, I will review some great hints of a good datacenter virtualization design. But before anything, I want to ask you some major question:

  1. What are the key components for an ideal virtual structure for different IT environments?
  2. How will you set up the virtual infrastructure?
  3. And what elements are required for attending, before and after deployment and implementation phases?

In this post and other parts of this series, I want to deep dive into the details of good design for the virtual infrastructure based on VMware products.

In the first part, I investigated more about the basic requirements and prerequisites of IT infrastructures to migrate into virtualization. In other parts, I will review VMware's primary services and their impacts to achieve this goal.

 

1. Physical to Virtual

The first step is the estimation of the real needs of physical resources for the service providing. Processor Clock Rate (GHz), Memory & Disk Usage (GB) and also Network Transmission Rate (Gbps) must be calculated separately per each existing service and then we can talk about the required resources for the server virtualization. However, we should consider the hypervisor (ESXi host) overhead and add this measure to the total estimated count.

P2V migration always impacts the service availability and usually needs to operationally downtime of the migrated service/OS. There are also some complexities in this manner, including:

  1. Type of OS and supportability for converter application.
  2. Specific Application dependencies via a hardware-locked.
  3. Software Licensing problems.
  4. SID/GUID changing issue for services like Active Directory.

So in the following, I provided a questionnaire about the P2V operation and you must answer to each of them carefully before executing real migration:

  1. Is it necessary to virtualize everything? And are you really sure about your answer? Why or why not, what’s the reason for keeping them into the physical area? or migrating to the virtual world… The answer to these questions is depended on your infrastructure requirement and you should reply to it correctly for each of your important components and servers in your infrastructure.
  2. Are you organized and prioritized each of the physical servers? Which ones must be on top of this list and which ones are good candidates for the pilot and test phase? I think selecting low-risk and non-critical workload servers is a good option for this state.

At last, you should provide a checklist like the following list to specify the server’s priority orders:

  1. Application servers with low storage resource and simpler network and OS configuration
  2. Web servers with normal demand/request handling rate and also fewer dependencies to/from other servers
  3. Network infrastructure services like VPN, DHCP, NPS
  4. Mission-critical and organizational Application servers
  5. Database servers based on SQL, Oracle and so on
  6. Unified communication services like Mailbox, VoIP, IM servers
  7. Most important services in IT infrastructure like Directory services

 

2. Storage resources… How to provision?

If the physical server attached to a storage device/LUN/volume, there may be two difficulties exist:

  1. Lack of enough space, if all mentioned storage used space must be migrated with the server to the new space provided by the hypervisor local storage
  2. Access to the storage management system for zoning re-configuration and providing storage accessibility for the new deploying VM

On the other-side, in services with high critical transaction log files like Exchange server, migration of mailbox databases needs to consider the rate of the log space suddenly growth. Finally, in every kind of P2V Migration, we need to more attention to temporary and permanent storage resources space.

 

3. Security consideration as the physical and traditional deployment

For choosing the virtualization platform, the selected solution must supply every security technologies that are deployed in the physical networking. It’s recommended that every aspect of physical switch security features like MAC learning, Private VLAN and so on can be supported by virtual switches. Distributed vSwitch technology used in the VMware vSphere platform is an ideal virtual networking solution for supporting many advanced security concepts like port mirroring and NetFlow. Except for VMware distributed switches (VDS), products of many vendors like Cisco, HP, IBM are supported by the vSphere networking platform. For example, Cisco Nexus 1000v is designed just as an integrated distributed vSwitch for the VMware platform. Of course, VDS design and migration from vSphere standard switch (VSS) to the VDS, requires its implementation considerations (that I reviewed in this video playlist on my YouTube channel.)

 

4. Provide suitable physical resources for virtual infrastructure

One of the important characteristics of server virtualization in front of traditional server provisioning is the increasing rate of service availability and this requires the construction of VMware clustering. As a result, comply with the deployment prerequisites like employment of the same CPU generation and technologies in the ESXi members of the cluster is required.

It’s also recommended to use more similar physical servers instead of fewer servers with more physical resources. Thereby the Blade servers are a better choice as the hypervisor physical resources in front of other types of servers like the Tower servers.

 

5. Do not forget cleanup operation

After migration successfully has been done, you should start the post-migration operations, including checking the detected virtual hardware devices into the VM and also remove everything that is not required anymore on the new converted VM. For example in the windows guest OS you can run: devmgr_show_nonpresent_devices=1 and next run devmgmt.msc, then go to the view>show hidden devices and finally you can remove unnecessary or hidden items.

In the next part, I will talk about the power supply used for the computing and storage racks and how to calculate it.

Source of original post in my personal blog: Undercity of Virtualization: Best practice for a good Virtualized Datacenter Design - Part 1

Every part of the virtual infrastructure environment needs a channel to communication and a safe and secure channel always requires a certificate. ESXi Hosts, vCenter Server, NSX Manager, Horizon Connection Server and so on, each one of them has at least a machine certificate or a web-access management portal with a self-signed SSL certificate. After introducing of vSphere6.0 Platform Service Controller (PSC) will handle the vSphere generated certificates with a web access panel that has been called VMware Certificate Authority (VMCA). But in this post I want to introduce some CLI to manage VMware certificates:

  1. VECS-CLI: This is a useful CLI to manage (create, get, list, delete) certificate stores and private keys. VECS (VMware Endpoint Certificate Stores) is the VMware SSL Certificate repository. Pic1 show usage of some of its syntax:

    vecs-cli-exmp1.png

  2. DIR-CLI: Manage (create, list, update, delete) everything inside the VMware Directory Service (vmdir): solution user accounts, certificates, and passwords.

    dircli-p1.png

    dircli-p2.png

  3. Certool: View, Generate and revoke certificates.

    certool.png

There are many types of stores inside the VECS:

  1. Trusted Root: Includes all of the default or added trusted root certificates.
  2. Machine SSL: With the release of vSphere6.0 all communication of VC & PSC services are executed through a reverse proxy, so they need a machine SSL certificate that is also backward compatible (ver 5.x). Embedded PSC also requires Machine Certificate for its vmdir management tasks. 
  3. Solution users: VECS stores for a separate certificate with a unique subject for each of solution users like VPXD. These user certificates are used for authentication with VC SSO.
  4. Backup: Provides revert action to restore (only) the last state of certificates.
  5. Others: Contains VMware or some Third-party solution certificates.

Now let me ask what are the roles of solution users? There are five solution users:

 

  1. machine: License server and logging service are the main acts. It's important to know Machine solution user certificate is totally different from machine SSL certificate that has been required for the secure  connections (like LDAP for vmdir / HTTPS for web access) in each node of VI (VC / PSC instance)
  2. SMS: Storage Monitoring Service.
  3. vpxd: vCenter Daemon activity (Managing of VPXA - ESXi host agents)
  4. vpxd-extensions: Like Auto Deploy and Inventory service
  5. vsphere-WebClient: lol, certainly web client and some additional services like performance chart.

The default paths of certificate management utilities are down below:

     /usr/lib/vmware-vmafd/bin/vecs-cli

     /usr/lib/vmware-vmafd/bin/dir-cli

     /usr/lib/vmware-vmca/bin/certool

 

And for windows type of vCenter server you can go to the:

    "%programfiles%\vmware\vcenter server\vmafdd

 

Surely I will talk about what is the vmafd itself and other useful CLI vdcpromo in this path on another post. Also, I will provide a video about how to work with certificate-manager." is the default path of windows-based vCenter server.

For the last note, always remember that deleting Trusted Roots is not permitted, because if you do it, it can cause some sophistic problems in your VMware certificate infrastructure.

Link of content inside my personal blog: Undercity of Virtualization: Manage VCSA Certificates - Chapter I

In the third part of the VDI troubleshooting series, unlike the last two parts, I want to talk about client-side connection problems. For instance, if there is a dedicated subnet of IP addresses for Zero Client devices, then incorrect setup or miss-configuration of routing settings can be the reason for the connection problem between VDI clients and servers. Same way, wrong VLAN configs (ID, subnet, Inter VLAN Routing) can be the main reason for the trouble. So I provided a checklist of "What to do if you have a problem with your Horizon connection servers?"

 

1. Check the correctness of Zero/Thin client's communication infrastructure (routing, switching, etc) to the VDI servers (Connection Server, Security Server)

2. Check network connection between Connection Server subnet and deployed Virtual Machines of Desktop Pool, if they are separated. Of course, logically there is no need to connect their dedicated Hosts/Clusters to each other, so you can have separate ESXi Clusters, one for Desktop pools and another for VDI Servers.

3. Investigate the vCenter Server is accessible from Connection Server and also its related credential.

4. If you have a Composer Server, check it's Services. So many times I saw the Composer Server service does not start after a server reboot, while it's automated and no warning/error event has been reported. Also, you need to check the ODBC Connection between Composer Server and its Database.

5. Investigate installed View Agent state inside the Desktop Pool's VMs. If you need to provide client redirection to the desktop (without the presence of Connection Server) View Direct Agent is needed too.

6. A TCP connection on port 4001(non-SSL)/4002(SSL-based) between Desktop's View Agent and Connection Server must be established, It's required for connection and you can check it by running netstat -ano | findstr "4001".

7. Review the User Entitlement for provided Desktop Pools, maybe there is a mistake especially when you add AD Groups instead of AD Users. (also check them, are they still available or assigned to the other users?)

8. Type of Virtual Desktop provisioning is also important. Except for Full Clone, on Linked Clone and Instant Clone models, you need to check the status of Virtual Desktops in Inventory\Resources\Machines of the View Admin web page.

9. If there is an interruption in connected sessions, you need to review their states in Inventory\monitoring of the View Admin web page.

10. For the last Note: DO NOT FORGET TO CONFIGURE EVENT DATABASE! I had encountered too many Horizon View deployment that did not configure any event database, so in troubleshooting situations, we had NOTHING to know really what happened.

I hope it can be helpful for you all buddy...

Link to the original post on my personal blog: Undercity of Virtualization: VMware VDI (Horizon View) Troubleshooting - Part III

 

Memory resource provisioning is one of the biggest challenges for IT administrators and virtualization designers. Although the provision of CPU is the major factor in the virtual infrastructure, most of the time there is lesser attention to CPU, because many applications that are deployed in virtualization need more memory, not the prreclaim.jpgocessor. And also today CPU technologies are very powerful, but applications like SQL and Oracle for some of their processes need more memory. Lack of enough memory resources may restrict the development of many virtual infra, so what should we do in this situation, before providing more physical resources for our hypervisors?

There are many technologies to handle the problem of memory over-commitment  such as Swapping and Ballooning and VMware ESXi use some of them to confronting with these issues. in this post I will review act and importance both mentioned mechanism:

1. Virtual Memory Ballooning is just a memory reclaiming technique that lets the VMkernel retrieve idle memory pages. When the ESXi host has less than 6% free memory (actually >=6%), Ballooning will come into the ring to handle out of memory problem! If a VM has many idle pages, its host borrows them to use as the temporary overhead for the VMs with more memory demand because probably they have some memory-intensive processes.

When a virtual machine wants to release some of the old used page files dedicated from host physical memory, does not remove it exactly, just change the address space pointer of the allocated memory list to the free memory list. So VM with Balloon Driver (vmmemctl.sys) can decide which pages (idle pages) can be reclaimed back to the host (up to 65% of VM guest memory) and which ones are needed for itself yet (already used pages) without involving host in this decision procedure. Now time for inflating step to happen If you disable memory ballooning driver inside the guest OS, VM will not be aware of host memory state and amount of available or unused physical memory and hypervisor cannot understand how much memory can take it back for other VMs memory requests.

2. Host Swapping is another mechanism used in low memory status, but unlike the Ballooning, it does not relate to VM guest OS. Host Swapping (or vRAM) has occurred If the host has less than 2% free memory and needs to provide more memory resources for memory-intensive VMs, so hypervisors should have enough swap space. although swapping is a reliable way against memory over-commitment, but it may cause total performance degrading because the efficiency of generated swap files is highly depended on IOPS rate of datastore specified for swapping, So if we don't provision suitable Disks (Like SSD) for swap files, it can lead to low performance.

There is one crucial point: If Memory Compression could happen, there is no need for hypervisor swap procedure execution. (I will discus more about it later on another post)

Remember, ESXi Host's memory will stay in share state until total required memory for them (memory requests) is greater than available physical memory (over-commitment), So if we want to prevent this situation before ballooning activation, we need to do these jobs:

1. Estimate carefully about overall required memory and also provide at least 25% more memory in the design phase.ballooning.png

2. Use VMware DRS in cluster-level resources to distribute VMs between hosts and balance resource usage, then no need to worried about low-memory. It's recommended to provision equal memory for all of ESXi hosts.

3. Use Resource Pools, especially in cluster-level and reserved estimated memory for mission-critical application and high resource-demand VMs and ensure they access to enough resources. Also, reserved memory is protected against memory reclaiming, even contention happened.

4. Do Not disable Balloon Driver, because it's the best intelligent memory reclamation technique. but if you want to do that, you can do it via the vSphere Client setting (Remove sched.mem.maxmemctl entry from .vmx file) or Guest OS Windows Registry (Set value of HKLM\SYSTEM\CurrentControlSet\services\VMMEMCTL from 2 to 4)

 

Link of post on my personal blog: Undercity of Virtualization: Memory Virtualization: Management & Reclaiming - Section I

 

 

 

VMware vCenter Server Appliance Management Interface (VAMI) is a very useful web console environment for VCSA management & maintenance operations,vSAN-Error.png that has been existing on HTTPS://VCSA_URL:5480. One of these tools is Backup and with these tools, you can take a specified backup of vCenter data consists of Inventory & Configuration (Required) and Stats, Events & Tasks (Optional)

VMware published VCSA6.7 (version 6.7.0.14) with these protocols for Backup: FTPS, HTTPS, SCP, FTP, HTTP. But announced NFS & SMB will be supported after VCSA6.7U2 (version 6.7.0.3). We had two big problems with these useful tools,  One of them is related to VSAN Health Service.

 

Whenever the Backup task has been started, it was stopped immediately and generated a warning about VSAN Health Service, because it seems to be crashed. (VCSA Management GUI exactly will tell us this happened.) Sadly if you try to start this service (even with --force option) it leads to another failed attempt and result is something like this:

 

So after many retrying for starting this service, I decided to check the files structure of this service in path of /etc/VMware-vsan-health and compare them with a fresh-installed vCenter Server.

vSAN-files.png

Also, there are two files that could to be related to the cause of this issue: logger.conf  file that has been absolutely empty in the troubled vCenter Server and VI result shows nothing, whereas in the healthy VCSA you can see something like below results:

VSAN-logger.png

When I checked the vsanhealth.properties, it shows communication of this service is worked with HTTPS, so its connections need to have an SSL structure. Then I found the second file: fewsecrets, it contains something like two hash streams. So I decided to risk and remove this file and the logger.conf file too (of course after getting its backup). At last, after some minutes the next try of service start was successful.

vSAN-Started.png

 

Remember that you need always to check DNS (FWD, RVS), NTP, Certificate and Firewall, Especially if you setup the vSphere environment with an External Platform Service Controller. I will explain the second problem in another post.

Link to my personal blog: Undercity of Virtualization: VCSA Backup Failed because of VSAN

Now it's time to investigate more deeply about troubleshooting of VMware Horizon View. In this post, I want to continue speaking about the LDAP structure and data of the VDI server. If you look at the first post of this series, I talked about how to connect to View LDAP with windows MMC: ADSI Edit. Now I will show you which VDI objects belong to which one of OUs in Directory Service hierarchy of VMware View:

 

1. OU=Sevdi1.PNGrver Groups specified a list of desktop pools in the Horizon environment.

2. OU=Servers contain all the VMs (Desktops that have been deployed by every desktop pool.

3. OU=Data Disk listed all of the generated virtual disks belong to each of the desktop.

4. OU=Groups contains all of predefined Admin groups and manually added roles in horizon administration console with their allowed permissions mentioned into the pae-AdminRolePermissions attribute of defined object.

5. OU=Applications is about all added virtual APPs to Horizon environment, for example by an Application Pool of an RDS Farm. Each of the created Apps are listed here.

Now let's review sub_OUs of OU=Properties:

 

 

 

1. If you configured the View event database, you can see the related object in sub_OU of OU=Database as a pae-EventDatabase class. Database server type and instance name, configured TCP Port, Database name and also events longevity are the main attributes of this class of object.vdi2.PNG

2. OU=Server is about Horizon View servers class as the pae-VDMProperties class. OU=Server, OU=LVM contains VDI servers (as same as the last mentioned object class) that are related to Linked-Mode Desktop Pools.

3. OU=VirtualCenter listed configured vCenter servers (VC) and composer servers (SVI) with object class type of pae-VirtualCenter. You can also check specified connection credential and URL addresses of each server: https://VC:443/sdk and https://SVI:18443

4. OU=Global contains some important objects such as:

4-1 CN= Common with some important attributes about VDI management, like Pod Name (or Cluster Name that has been generated from computer name of first / primary installed connection server), timeout of console session and connected desktop desktop, Maximum session time duration, Syslog related configuration, Pre-forced logoff message for Horizon endpoint users, IPSec mode and etc.

4-2 CN=License with hashed-form of imported license key for VMware Horizon View.

4-3 CN=Keys contains RADIUS configs, some session timeouts like RDP, VDM Gateway and Security servers, Security Server Pairing settings and etc.

I tried to mention some useful and critical OUs of VMware Horizon View LDAP structure on this post, if you think I forgot to review another important object of View LDAP, I will be appreciated to tell me about it.

Link to my personal blog's post: Undercity of Virtualization: VMware VDI (Horizon View) Troubleshooting - Part II

Usually we think it's so easy to change most of the server's name! Just a little configuration changes like editing FQDN value maybe destroy all of your configuration and setting. It's going to sounds like a disaster if you change computer name/account of a server without any consideration or checklist about how to do it. But sometimes you may be wrong in initial server name configuration and after service setup and startup, then you will understand what's happened (forget to change the default name or choose a suitable name based on your design worksheets). Now let's check this matter about a VMware Horizon View Connection Server. First of all you should answer some of questions like:

    1. What should we do if we need to change computer account?

   2. What will happen if we change computer account?

   3. What should we do as the post-execution after renaming the server?

As the first step, you have to review the service documentary specially troubleshooting documents. Then investigate side effects in your virtual desktop infrastructure objects like desktop pools or provided and in-used virtual desktops. Naturally none of them cannot connect to the server anymore, especially if you change the primary connection server nor any of replica servers. As the best and safest way to configure server after renaming it, you can uninstall VMware Horizon 7 Connection Server component (and also HTML Access) and install it again without any concern about losing VDI data and structure. Because there is another important application on provisioned connection server: AD LDS Instance VMware VDMDS that as it's name demonstrate, it's the directory service for VMware VDI suite and is a separated component from Horizon View Connection Server

So let me explain about structure of Horizon View. It's fundamental is based on Lightweight Directory Access Protocol (So you cannot install this service on a domain controller). View LDAP is a data repository and consists of all of it's configuration information and it will be created when you install the first View connection server. This repository will be replicated to other View replica servers. Like other LDAP service it has some partitions, objects and attributes of objects and can be edited by ADSIEdit, just remember if you want to do it, type the Distinguished Name like :

dc=vdi, dc=vmware, dc=int

And then you can find connection server object on these Sub OUs: 'Properties\Server' and 'Properties\LVM\Server'.

The second way to check and change VDI configurations on connection servers is doing by windows registry editor. You can see related path (HKLM\ Software\ VMware, Inc.\VMware VDM) about Horizon View on second picture:

  But regardless of these two rough and dangerous methods, VMware recommended vdmadmin CLI for troubleshooting of Horizon View (note using regedit is not a suitable way). If you refer to following path you can also see other useful CLI like vdmexport and vdmimport:

%ProgramFiles%\ VMware\ VMware View\ Server\ tools\ bin\

Absolutely we know all of gathered information from each of these troubleshooting methods must be same, for example if you check the system GUID , all ways must return the same value:

  Reinstalling the connection server component is the fastest and easiest way, but if risks assessment and information security policies of your organization prevents you from that, now what method you will Choose to reconfigure your virtual desktop infrastructure servers? We will review more about Horizon View troubleshooting on another parts of this series.

  Source of content on my personal blog: https://virtualundercity.blogspot.com/2019/02/vmware-vdi-horizon-view-troubleshooting.html

I recorded a video series to illustrate about vSphere Distributed Switch design and configuration, I hope it can be useful for all of you.

Undercity of Virtualization: vSphere Distributed Switch Design & Configuration - Part I: Create & Basic Setup

  Yes exactly, another post about NTP service and important role of time synchronization between virtual infrastructure components. In another post i described about a problem with ESXi v6.7 time setting and also talk about some of useful CLIs for the time configuration, manually ways or automated. But in a lab scenario with many versions of ESXi hypervisors (because of servers type we cannot upgrade some of them to higher version of ESXi) we planned to configure a NTP server as the "Time Source" of whole virtual environment (PSC/VC/ESXi hosts & so on).

   But our first deployed NTP server was a Microsoft Windows Server 2012 and there was a deceptive issue. Although time configuration has been done correctly and time synchronization has occurred successfully, but when i was monitoring the NTP packets with tcpdump, suddenly i saw time shifting has been happened to another timestamp.

   ntp-problem .PNGntpconf.PNG

At the first step of T-shoot, i think it's maybe happened because of time zone of vCenter server (but it worked correctly) or not being same version of NTP client and NTP Server. (to check NTP version on ESXi, use NTP query utility: n

tpq --version) and also change ntp.conf file to set exact version of NTP. (vi /etc/ntp.conf and add "version #" to end of server line) But NTP is a backward compatible s

ervice as and i thought it's not reason of this matter.

So after more and more investigation about cause of the problem, we decided to change our NTP server, for example a Mikrotik router Appliance. and after initial setup and NTP config on the

Mikrotik OVF, we changed our time source. So after setting again the time manually with "esxcli hardware clock" and "esxcli system time" configure host time synchronization with NTP. Initial manual settings must be done because your time delta with NTP server must be less than 1min.

  ntpdsvc.PNG

Then after restart NTP service on the host ( /etc/init.d/ntpd restart) i checked it again to make sure the problem has been resolved.

ntp-check2.PNG

link of post in my personal blog: Undercity of Virtualization: Time differentiate between ESXi host & NTP Server

In one of my projects I had a bad problem with vSphere environment . The issue had been occurred in following situation:

In the first episode VCSA server encountered with a low disk space problem and suddenly crashed. After increase size of VMDK files and fix the first problem, I saw one of the ESXi host belongs to the cluster is unreachable (disconnected and also vCenter cannot connect to it, but both of them is reachable by my client system. In a SSH access I checked the ESXi host is accessible but vCenter server couldn't connect only to this host.

All network parameters and storage zone settings, and all time settings and service configuration were same for each hosts. Sadly syslog settings was not configured and we didn't have access to scratch logs in duration of the issue had been occurred (I don't know why). Trying to restart all management agents of the host was suspended and suppressing to it by running services.sh restart process was stuck and nothing really happened. also trying to restart vpxa and hostd didn't fix the issue.

There was only one error in summary tab of disconnected host that described about the vSphere HA that is not configured and ask to remove and add the host again to the vCenter. But I couldn't reconnect it. My only guess is it's only related to startup sequence of ESXi hosts and storage systems because tech support unit restarted some of them after confronting to the problem, So HA automatically tried to migrate VMs of that offline hosts to other online hosts and this is the moment I want to call it "Complex Disaster". So was stuck decided to disable HA and DRS on cluster settings, nothing changed! problem still existed. After fixing the VCSA problem I knew if we restart that host, maybe the second problem will be solved but because of a VM operation, we couldn't do it. Migration did not work and we were confused.

Then I tried to shutdown some of not-necessary VMs belong to the disconnected host. after releasing some CPU/RAM resources, this time management agent restart was done successfully (services.sh restart operation)

So trying to connect VCSA to that problematic ESXi was possible and the problem was gone forever!

After that I wrote a procedure for that company's IT Department as the Virtualization Checklist:

1. Attend to your VI's assets logs. Don't forget to keep them locally in a safe repository and also in a syslog server.

2. Always monitor used and free process/memory resources of cluster. Never override their thresholds, because a host failure may cause to consecutive failures

3. Control status of virtual infrastructure management services include vCenter Server, NSX Manager and also their disk usage. Execute "df -h" in CLI or check status of their VMDKs in GUI. (I explained about how to do it in this post)

4. In critical situations or even maintenance operations always first shutdown your ESXi hosts and then storage systems and for reloading the system first start the storage, then the hosts.

5. In the end, please DO NOT disconnect vNIC of VCSA from associated Port Group if it is part of a Distributed vSwitch. They did it and it's made me to suffer a lot to reconnect VCSA. Even if you restore a new backup of VCSA, don't remove network connectivity of failed VCSA until the problem is not solve.

Link to my personal blog: Undercity of Virtualization: An Example of Importance of Management and Controlling Virtual Infrastructure Resources

 

In the third part of SDDC Design (based on VMware Validated Design Reference Architecture Guide) we will review about one of the major steps on SDDC Design about physical design and availability. Before any steps on data-center physical design, we should consider an important classification based on availability aspect: Regions & Zones. Multiple Availability Zones form a Region, but what is the A-Zone?

  Unfortunately many disastrous events like earthquakes, massive floods, and large power fluctuations may cause of interruption of IT communication as the service failures or unavailability of network components. So you need to segregate the DC total infrastructure into the regions & zones. A zone of SDDC, now mentioned as A-Zone is an independent area of infrastructure that is isolated as a physical distinct. A-Zones will improve SLA and redundancy factor and must be highly reliable because controlling of network infrastructure failure boundaries is the main reason of their presence. Interruptions may have internal causes, such as power outage, cooling problems and generators failure so each one of the zones should have their own safety teams (HSE and Fire departments).

There is two main factors to distinguish the differences of A-Zone & Region: distance of two site (Primary/Recovery) and network bandwidth of fiber connections between them. Basically A-Zones have metro distances (less than 50km/30mile) that usually connected with dark fiber to each other and there must be low latency as a single-digit and high network bandwidth. So they can act as Active-Active or Active-Passive sites for each other. For more than that mentioned distance range it’s highly recommended to put each A-Zone to different Regions but related workloads must be spread across multiple A-Zones belongs to same Region.

 

SDDC Business Continuity can be improved by operating many technologies and replication techniques such as:

  1. VMware vSphere FT for the VM-level Replication.
  2. VMware vSphere HA to provide VM availability at Host and Cluster-level.
  3. VMware vSphere DRS to act as a VM distributer to prevent load/VM aggregation on Host of a cluster.
  4. VMware vSAN as the Software-Defined Storage solution for better availability on environments without physical storage system.
  5. VMware Replication as an integrated appliance-based replication solution for inside or outside of the site or zone.
  6. Storage-vendor Replication solutions as the third-party solutions replication such as DELL EMC RecoverPoint, NetApp SnapMirror and HPE 3PAR.
  7. Software Replication solution such as Zerto Virtual Replication and Veeam Backup & Replication.
  8. VMware SRM as one of the best options for site replication & recovery solution.

 

Link of post on my personal blog: Undercity of Virtualization: VMware SDDC Design Considerations - PART Three: SDDC Availability

 

 

Generally coredump will be generated whenever the OS kernel sends certain signals to specified process, specially when the process send an access request to the out of address memory space. Often system will be crashed in this situation and generated errors give us related information about hardware faults or application bugs.

Sometimes you may encountered a ESXi host has been crashed, it will try to write diagnostics information on a file that has been name "VMkernel Core Dump". This file contains information about halt experience of host named purple screen state and has a high degree of importance, because in this situation, you don't have access to your system data and logs. So it's necessary to gather and analyze coredump files from all of ESXi host into one or more repositories.

There are two mechanisms for collection of coredump files: DiskDump to saving on specified permitted disk and NetDump to send coredump information by the network. If ESXi can't save coredump information on it's disk, there may be an issue with storage devices or it's connection to the host (Failed Array Controller, RAID Problem, broken physical path to storage, FC/SCSI connectivity problem, SAN switch failure and so on). So you should configure at least one alternative target to save coredump information.

But before that let's check about what is the netdump exactly?

netdump is a protocol for sending coredump information from a failed ESXi to the dump collector service that has these characteristics:

1. Listen on UDP port 6500.

2. Support only IPv4

3. Clear-text network traffic

4. No Authentication /Authorization

To retrieve current configuration for coredump saving location:

# esxcli system coredump partition get

# esxcli system coredump network get  (it can be used by check option too)

 

If the service is not enabled:

# esxcli system coredump network set --enable true

# esxcli system coredump partition set --enable true --smart

 

To set new configuration for coredump:

# esxcli system coredump partition set --partition="mpx.vmhba2:C0:T0:L0"

# esxcli system coredump network set --interface-name vmk0 --server-ipv4 10.10.10.10 --server-port 6500

 

To find-out which storage devices we have on the host:

# esxcli storage core path list

 

For the older version of VMware ESXi:

# esxcfg-dumppart --list

# esxcfg-dumppart --get-active

# esxcfg-dumppart --smart-activate

 

Network Dump Collector is a built-in service within vcenter server that provides a way of host coredump information gathering.But remember that NetDump does not work if aggregation protocols

such as LACP or Etherchannel has been configured for the vmkernel traffic.VMware recommends for segregation of VMkernel networking for NetDump by VLAN or physical LAN separation to prevent traffic interception. (In ESXi 5.0 VLAN tagging configured at the vSwitch level are ignored during network core dump transmission.)

Also the name structure and format of recieved coredump file is something like this: yyyy-mm-dd-hh_mm-N.zdump .

Maximum default size of zdump file is 2GB and older dump files automatically will be deleted. (The Dump Collector service has a non-configurable 60-second timeout and if no information is received in this period, the partial file will be deleted.)

Source of content inside my personal blog: Undercity of Virtualization: What is VMKernel Core Dump - Part I

 

Many times I heard my students ask what is the VCHA really and what is different between this feature and vSphere HA?

vSphere HA is a cluster-level feature that can be enabled to increase total availability of VMs inside the cluster and works whenever an ESXi host has been crashed, then HA will move VMs of that failed host to another available resources inside the cluster and reboot them in the new hosts. HA interacts directly to the ESXi HA Agent and will monitor status of each host of a cluster by investigate their heartbeats, So if an network segmentation / partitioning/ downtime is happened and also ESXi cannot provide its heartbeat to the shared datastore, HA will consider the host is failed and execute VM migration operation.

But vCenter HA is a new feature published after release of vSphere6.5 and directly related to the vCenter Server Appliance. It will create a cluster-state of VCSA VM in a triple-node structure: Active node (Primary vCenter server), Passive (Secondary vCenter acting after disaster) and Witness (act as a quorum). It's just about the VCSA availability factor only. vCenter HA is a new vCenter feature that is enabled only for VCSA (Because of PostgreSQL native replication mechanism), also can provide more availability for this mission-critical service inside the virtualization infrastructure.

As VMware said whenever VCHA is enabled, in case of vCenter failure, operation will be revived after 2~4 minutes depends on vCenter config and inventory size. Also VCHA activation process can be done less than 10 minutes.

Now I want to compare these two feature with respect to each related concept of IT infrastructure:

1. Network Complexity:

vCenter HA configuration needs a dedicated network to work and is totally separated from vCenter management network, Then to run VCHA cluster successfully it's required to have only three static IP or dedicated FQDN for assigning to each of cluster node. (I always prefer to choose a /29 subnet for them) After Active node failure, Passive will be automatically handle the vCenter management traffic and users just need to re-login their connections to the vCenter (VPXD through API or Web Client).

But a good vSphere HA operation is highly depends on cluster settings, so you don't need to do more network configuration especially for HA operation. (Just maybe in some situations you may need to separate host management and vMotion port groups based on network throughput)

2. Network Isolation:

In situation where there is a partitioning between hosts of a cluster, if a host cannot send any heartbeat to the shared datastore, it will be considered as a failed host. So HA tries to migrate and reboot all running VMs of that Host to another healthy hosts. I want to emphasis respect to availability of VMs belong to the host cluster there is two mechanisms of checking failures: network connections (between hosts and vCenter) and storage communication (inside the SAN area).

But if there is a network segmentation between vCenter HA nodes, we must care about what's really going on? I mean separation is happened between which nodes of the cluster? If Active-Passive or even Active-Witness nodes are  connected no need to worry, because the active node is still responsible of VI management operation. But what happened if active node is the isolated node?! Operationally it will get out of the VCHA cluster and stop to servicing, now the passive node will continue its job.

3. Multiple failures:

In the case of consecutive failures, if there is enough resources (RAM & CPU) inside the cluster, it can handle this problem, because vSphere HA will migrate VMs more and more to another available ESXi hosts. Just remember you must check out the Admission Control Policy settings respect to handle multiple ESXi failure.

But in vCenter HA, you should know about VCHA is not designed for multiple failures, So after the second failure, the VCHA cluster is not available and functional anymore.

4. Utilization, Performance and Overhead:

There is a little overhead for primary vCenter when VCHA is enabled, especially every time there is too many tasks to do for vCenter Server.

Witness needs the lowest CPU, because there is only VCHA service. Also it's almostly same for Passive node just for VCHA and PostgreSQL. There is no concern for memory usage.

But if you want HA works in its best mode you must pay attention to remaining resources in the cluster because bad HA configuration can make the cluster unstable, So for best performance in whole cluster you need to calculate availability rate based on remained and used physical resource. Specifying at lease two dedicated failover ESXi hosts to encounter against failure can be a suitable HA config.

 

Source of content inside my personal blog: Undercity of Virtualization: vSphere HA vs vCenter HA

If you want to check your hardware information or more details about your servers to do some operation like changing or increasing physical resources, you need to power-off or reboot the server to check the POST Process information. However because of being operational, you may not be able to do that. So there is a good command to help you in this situation: smbiosDump.

For instance to check your CPU, memory, NIC and Power configurations:

  1. smbiosDump | grep –A 4 ‘Physical Memory Array’  #Total Slot and Max Size of Memory
  2. smbiosDump | grep –A 12 ‘Memory Device’           #Type and Size of each Slot
  3. smbiosDump | grep –A 12 ‘CPU’                             #Processors Detail, Voltage,Clock & Cache
  4. smbiosDump | grep –A 4 ‘NIC’                                #Network Adapters and iLO  Details (HP)
  5. smbiosDump | grep –A 3 ‘Power’                            #Power Supply and Part Number

 

Source of content inside my personal blog: Undercity of Virtualization: Check VMware ESXi Hardware information by smbiosDump