Skip navigation
1 2 Previous Next

amin masoudifard's Blog

30 posts

I provided a document about how to deploy the vRealize Network Insight (vRNI) as a great solution for managing networking components of vSphere and NSX and their networking streams in the virtualization infrastructure. This document includes some of command lines of vRNI Proxy (Collector) appliance for managing and troubleshooting its connectivity to the vRNI primary server. I hope it can be helpful for you all.

Storage infrastructure is one of the main part of IT environment, so good design and principled configuration will cause better and easier troubleshooting of each possible issue related to this area. One of the primary components of storage infrastructure is HBA, connector of servers to the storage area. Then we can consider some of the greatly possible storage-related problems back to the Host Bus Adapter installed in the ESXi host and also its physical connections to the SAN storage or SAN switche. So let's begin how to investigate step by step storage troubleshooting inside the VMware infrastructure.

First situation may be occured for local array of disks that are not detected as a local datastore. You can check status of internal disk controller (for exmaple in a HP Proliant server) via running the following command:

hpsa.png

 

cat /proc/driver/hpsa/hpsa0


The result will be shown like this:

(please be careful when I used the hba word and when the capital form)

 

 

 

 

 

vmkmgmt.png

But if the considered datastore is not local and is a shared volume of existing SAN storage in our infrastructure, then we must check the HBA status:

./usr/lib/vmware/vmkmgmt_keyval -a | less

 

The last mentioned command has been used in the ESXi version 5.5 and higher, so for older versions you must check the following folder for both of HBA most popular vendors:

  •    Qlogic:   /proc/scsi/qla2xxxx
  •    Emulex: /proc/scsi/lpfc

 

 

Also if you don't find the related vmhba adapter in result of the following command, it means the ESXi host did not detect your HBA yet

  • vmkchdev -l | grep hba
  • esxcfg-info | grep HBA

vmkchdev.png

swfw.png

 

 

Also you can run the swfw.sh command and combine it with grep to find related information of the connected HBA devices to the ESXi, include: device model, driver, firmware and also WWNN for FC-HBA (InstanceID value)

./usr/lib/vmware/vm-support/bin/swfw.sh | grep HBA

 

 

 

core-device.png

 

 

In another situation imagine you have deploy a new SAN storage inside a vSphere cluster, but you are not ensure that HBA could detect the provided LUN or not. As the first step run the below ESXCLI:

esxcli storage core device list

 

For the shown result, please check important fields, like these ones: Display Name, Device Type, Devfs path, Vendor & Model.

And next you can run the following command, then it will give you back more information about the HBA adapters and state of each one of them:

esxcli storage core adapter list

 

VMware Definition Tip1: NAA (Network Addressing Authority) or EUI (Extended Unique Identifier)  is the preferred method of identifying LUNs and the number that follows is generated by the storage device itself. Since the NAA or EUI is unique to the LUN, if the LUN is presented the same way across all ESXi hosts, the NAA or EUI identifier remains the same.

core-adapter.png

partition.png

 

Also this command will show you list of available and detected partition by the ESXi host:

esxcli storage core device partition list

 

VMware Definition Tip2: You can see two types of fb & fc. fb is the system ID for VMFS and fc is the vmkernel core dump partition (vmkcore)

There is more useful storage command, like the oldman CLI esxcfg-scsidevs. (-a show HBA devices, -m for mapped VMFS volumes and -l list all known logical devices)

esxcfg-scsi.png

So finally as the conclusion of first part of troubleshooting the problems related to the storage side of vSphere environment, we understood that we need to check the status of HBA, how they are performing and connected disk devices, LUNs & volumes via each one of them. I hope it can be helpful for you all

Link to my personal blog: vSphere Storage Troubleshooting - Part 1: HBA & Connectivity

Many people have confusion sight about how really a host assigns CPU resource to the virtual machines; More precisely I can say how the processing operation of a VM has been executed via the physical CPU resources. In the Intel terminology, the physical processor is a CPU socket, but in this post, I consider the pCPU as a physical core in the existing sockets of servers.

By default, each of the added vCPU to the VMs is assigned to one of the existing pCPUs. So if we configure 8 vCPU for a VM, there must exist at least 8 pCPU in the host. In other words, if there is not enough pCPU for the VM, it cannot be started.

Based on design, VMware ESXi can handle the CPU oversubscription (request of vCPU more than existing processors/pCPU). It means the pCPU~vCPU ratio is not one by one (1:1) anymore. In the vSphere environment, the ESXi host will handle the processing operations to execute requests of every VM, then the host needs to schedule processing time for each of them. But here the question is what ratio should be configured as the best settings? The answer depends on choosing Capacity or Performance aspects, really it can be very different based on the virtualized application requirements ...

Each VM needs the pCPU resources, then implementation of many VMs specially highly-applicable and resource-consumption virtual machines demand more CPU clocking. So if you provision more VMs and also increase the pCPU~vCPU Ratio (1:2, 1:4 or greater) the performance of the ESXi host will be affected.

As the VMware mentioned vSphere ESXi scheduling mechanism prefers to use the same vCPU-to-pCPU mapping to boost performance through CPU caching on the socket. If there is no specific documentary for the CPU design of the Application, you can set it up with a single vCPU, then scale up based on requires. So oversubscription will not have a serious negative impact.

Also, we must consider the CPU Ready Time is another important metric as the  CPU utilization metric is. Generally, vCPU~pCPU ratio is based on many factors like the following:

  1. Version of ESXi host. Each newer version supports more ratio.
  2. Supported features and technologies by physical processor.
  3. Workload rates of critical Applications that are implemented in the virtual environment.
  4. The capacity of existing processor resources in other members of the cluster and their current performance, especially when we require a higher level of hosts fault tolerance in the virtualization infrastructure. Available resources in the cluster will specify each VM that can be placed on which host in front of a host failure.

 

Should we use Hyperthreading or not ?!

Hyperthreading is a great technology that makes a single pCPU act as the two logical processors. In the case of the low-usage of ESXi host, each of those logical cores can handle two independent applications at the same time. So if you have 16 logical processors in the ESXi host, after enabling of HT (In both of the BIOS config and ESXi advanced settings) you will see the host has 32 logical processors. But using HT does not mean performance is increased always and it's highly dependent on application architecture. So in some cases maybe you encounter performance degradation via HT usage. Before enabling of HT in the ESXi hosts, review critical virtualized applications deploy on their VMs.

Source of original post in my personal blog: Undercity of Virtualization: Virtualization Tip1: Relation between physical CPU & virtual CPU

In this post and other series of this title, I will review some great hints of a good datacenter virtualization design. But before anything, I want to ask you some major question:

  1. What are the key components for an ideal virtual structure for different IT environments?
  2. How will you set up the virtual infrastructure?
  3. And what elements are required for attending, before and after deployment and implementation phases?

In this post and other parts of this series, I want to deep dive into the details of good design for the virtual infrastructure based on VMware products.

In the first part, I investigated more about the basic requirements and prerequisites of IT infrastructures to migrate into virtualization. In other parts, I will review VMware's primary services and their impacts to achieve this goal.

 

1. Physical to Virtual

The first step is the estimation of the real needs of physical resources for the service providing. Processor Clock Rate (GHz), Memory & Disk Usage (GB) and also Network Transmission Rate (Gbps) must be calculated separately per each existing service and then we can talk about the required resources for the server virtualization. However, we should consider the hypervisor (ESXi host) overhead and add this measure to the total estimated count.

P2V migration always impacts the service availability and usually needs to operationally downtime of the migrated service/OS. There are also some complexities in this manner, including:

  1. Type of OS and supportability for converter application.
  2. Specific Application dependencies via a hardware-locked.
  3. Software Licensing problems.
  4. SID/GUID changing issue for services like Active Directory.

So in the following, I provided a questionnaire about the P2V operation and you must answer to each of them carefully before executing real migration:

  1. Is it necessary to virtualize everything? And are you really sure about your answer? Why or why not, what’s the reason for keeping them into the physical area? or migrating to the virtual world… The answer to these questions is depended on your infrastructure requirement and you should reply to it correctly for each of your important components and servers in your infrastructure.
  2. Are you organized and prioritized each of the physical servers? Which ones must be on top of this list and which ones are good candidates for the pilot and test phase? I think selecting low-risk and non-critical workload servers is a good option for this state.

At last, you should provide a checklist like the following list to specify the server’s priority orders:

  1. Application servers with low storage resource and simpler network and OS configuration
  2. Web servers with normal demand/request handling rate and also fewer dependencies to/from other servers
  3. Network infrastructure services like VPN, DHCP, NPS
  4. Mission-critical and organizational Application servers
  5. Database servers based on SQL, Oracle and so on
  6. Unified communication services like Mailbox, VoIP, IM servers
  7. Most important services in IT infrastructure like Directory services

 

2. Storage resources… How to provision?

If the physical server attached to a storage device/LUN/volume, there may be two difficulties exist:

  1. Lack of enough space, if all mentioned storage used space must be migrated with the server to the new space provided by the hypervisor local storage
  2. Access to the storage management system for zoning re-configuration and providing storage accessibility for the new deploying VM

On the other-side, in services with high critical transaction log files like Exchange server, migration of mailbox databases needs to consider the rate of the log space suddenly growth. Finally, in every kind of P2V Migration, we need to more attention to temporary and permanent storage resources space.

 

3. Security consideration as the physical and traditional deployment

For choosing the virtualization platform, the selected solution must supply every security technologies that are deployed in the physical networking. It’s recommended that every aspect of physical switch security features like MAC learning, Private VLAN and so on can be supported by virtual switches. Distributed vSwitch technology used in the VMware vSphere platform is an ideal virtual networking solution for supporting many advanced security concepts like port mirroring and NetFlow. Except for VMware distributed switches (VDS), products of many vendors like Cisco, HP, IBM are supported by the vSphere networking platform. For example, Cisco Nexus 1000v is designed just as an integrated distributed vSwitch for the VMware platform. Of course, VDS design and migration from vSphere standard switch (VSS) to the VDS, requires its implementation considerations (that I reviewed in this video playlist on my YouTube channel.)

 

4. Provide suitable physical resources for virtual infrastructure

One of the important characteristics of server virtualization in front of traditional server provisioning is the increasing rate of service availability and this requires the construction of VMware clustering. As a result, comply with the deployment prerequisites like employment of the same CPU generation and technologies in the ESXi members of the cluster is required.

It’s also recommended to use more similar physical servers instead of fewer servers with more physical resources. Thereby the Blade servers are a better choice as the hypervisor physical resources in front of other types of servers like the Tower servers.

 

5. Do not forget cleanup operation

After migration successfully has been done, you should start the post-migration operations, including checking the detected virtual hardware devices into the VM and also remove everything that is not required anymore on the new converted VM. For example in the windows guest OS you can run: devmgr_show_nonpresent_devices=1 and next run devmgmt.msc, then go to the view>show hidden devices and finally you can remove unnecessary or hidden items.

In the next part, I will talk about the power supply used for the computing and storage racks and how to calculate it.

Source of original post in my personal blog: Undercity of Virtualization: Best practice for a good Virtualized Datacenter Design - Part 1

Every part of the virtual infrastructure environment needs a channel to communication and a safe and secure channel always requires a certificate. ESXi Hosts, vCenter Server, NSX Manager, Horizon Connection Server and so on, each one of them has at least a machine certificate or a web-access management portal with a self-signed SSL certificate. After introducing of vSphere6.0 Platform Service Controller (PSC) will handle the vSphere generated certificates with a web access panel that has been called VMware Certificate Authority (VMCA). But in this post I want to introduce some CLI to manage VMware certificates:

  1. VECS-CLI: This is a useful CLI to manage (create, get, list, delete) certificate stores and private keys. VECS (VMware Endpoint Certificate Stores) is the VMware SSL Certificate repository. Pic1 show usage of some of its syntax:

    vecs-cli-exmp1.png

  2. DIR-CLI: Manage (create, list, update, delete) everything inside the VMware Directory Service (vmdir): solution user accounts, certificates, and passwords.

    dircli-p1.png

    dircli-p2.png

  3. Certool: View, Generate and revoke certificates.

    certool.png

There are many types of stores inside the VECS:

  1. Trusted Root: Includes all of the default or added trusted root certificates.
  2. Machine SSL: With the release of vSphere6.0 all communication of VC & PSC services are executed through a reverse proxy, so they need a machine SSL certificate that is also backward compatible (ver 5.x). Embedded PSC also requires Machine Certificate for its vmdir management tasks. 
  3. Solution users: VECS stores for a separate certificate with a unique subject for each of solution users like VPXD. These user certificates are used for authentication with VC SSO.
  4. Backup: Provides revert action to restore (only) the last state of certificates.
  5. Others: Contains VMware or some Third-party solution certificates.

Now let me ask what are the roles of solution users? There are five solution users:

 

  1. machine: License server and logging service are the main acts. It's important to know Machine solution user certificate is totally different from machine SSL certificate that has been required for the secure  connections (like LDAP for vmdir / HTTPS for web access) in each node of VI (VC / PSC instance)
  2. SMS: Storage Monitoring Service.
  3. vpxd: vCenter Daemon activity (Managing of VPXA - ESXi host agents)
  4. vpxd-extensions: Like Auto Deploy and Inventory service
  5. vsphere-WebClient: lol, certainly web client and some additional services like performance chart.

The default paths of certificate management utilities are down below:

     /usr/lib/vmware-vmafd/bin/vecs-cli

     /usr/lib/vmware-vmafd/bin/dir-cli

     /usr/lib/vmware-vmca/bin/certool

 

And for windows type of vCenter server you can go to the:

    "%programfiles%\vmware\vcenter server\vmafdd

 

Surely I will talk about what is the vmafd itself and other useful CLI vdcpromo in this path on another post. Also, I will provide a video about how to work with certificate-manager." is the default path of windows-based vCenter server.

For the last note, always remember that deleting Trusted Roots is not permitted, because if you do it, it can cause some sophistic problems in your VMware certificate infrastructure.

Link of content inside my personal blog: Undercity of Virtualization: Manage VCSA Certificates - Chapter I

In the third part of the VDI troubleshooting series, unlike the last two parts, I want to talk about client-side connection problems. For instance, if there is a dedicated subnet of IP addresses for Zero Client devices, then incorrect setup or miss-configuration of routing settings can be the reason for the connection problem between VDI clients and servers. Same way, wrong VLAN configs (ID, subnet, Inter VLAN Routing) can be the main reason for the trouble. So I provided a checklist of "What to do if you have a problem with your Horizon connection servers?"

 

1. Check the correctness of Zero/Thin client's communication infrastructure (routing, switching, etc) to the VDI servers (Connection Server, Security Server)

2. Check network connection between Connection Server subnet and deployed Virtual Machines of Desktop Pool, if they are separated. Of course, logically there is no need to connect their dedicated Hosts/Clusters to each other, so you can have separate ESXi Clusters, one for Desktop pools and another for VDI Servers.

3. Investigate the vCenter Server is accessible from Connection Server and also its related credential.

4. If you have a Composer Server, check it's Services. So many times I saw the Composer Server service does not start after a server reboot, while it's automated and no warning/error event has been reported. Also, you need to check the ODBC Connection between Composer Server and its Database.

5. Investigate installed View Agent state inside the Desktop Pool's VMs. If you need to provide client redirection to the desktop (without the presence of Connection Server) View Direct Agent is needed too.

6. A TCP connection on port 4001(non-SSL)/4002(SSL-based) between Desktop's View Agent and Connection Server must be established, It's required for connection and you can check it by running netstat -ano | findstr "4001".

7. Review the User Entitlement for provided Desktop Pools, maybe there is a mistake especially when you add AD Groups instead of AD Users. (also check them, are they still available or assigned to the other users?)

8. Type of Virtual Desktop provisioning is also important. Except for Full Clone, on Linked Clone and Instant Clone models, you need to check the status of Virtual Desktops in Inventory\Resources\Machines of the View Admin web page.

9. If there is an interruption in connected sessions, you need to review their states in Inventory\monitoring of the View Admin web page.

10. For the last Note: DO NOT FORGET TO CONFIGURE EVENT DATABASE! I had encountered too many Horizon View deployment that did not configure any event database, so in troubleshooting situations, we had NOTHING to know really what happened.

I hope it can be helpful for you all buddy...

Link to the original post on my personal blog: Undercity of Virtualization: VMware VDI (Horizon View) Troubleshooting - Part III

 

Memory resource provisioning is one of the biggest challenges for IT administrators and virtualization designers. Although the provision of CPU is the major factor in the virtual infrastructure, most of the time there is lesser attention to CPU, because many applications that are deployed in virtualization need more memory, not the prreclaim.jpgocessor. And also today CPU technologies are very powerful, but applications like SQL and Oracle for some of their processes need more memory. Lack of enough memory resources may restrict the development of many virtual infra, so what should we do in this situation, before providing more physical resources for our hypervisors?

There are many technologies to handle the problem of memory over-commitment  such as Swapping and Ballooning and VMware ESXi use some of them to confronting with these issues. in this post I will review act and importance both mentioned mechanism:

1. Virtual Memory Ballooning is just a memory reclaiming technique that lets the VMkernel retrieve idle memory pages. When the ESXi host has less than 6% free memory (actually >=6%), Ballooning will come into the ring to handle out of memory problem! If a VM has many idle pages, its host borrows them to use as the temporary overhead for the VMs with more memory demand because probably they have some memory-intensive processes.

When a virtual machine wants to release some of the old used page files dedicated from host physical memory, does not remove it exactly, just change the address space pointer of the allocated memory list to the free memory list. So VM with Balloon Driver (vmmemctl.sys) can decide which pages (idle pages) can be reclaimed back to the host (up to 65% of VM guest memory) and which ones are needed for itself yet (already used pages) without involving host in this decision procedure. Now time for inflating step to happen If you disable memory ballooning driver inside the guest OS, VM will not be aware of host memory state and amount of available or unused physical memory and hypervisor cannot understand how much memory can take it back for other VMs memory requests.

2. Host Swapping is another mechanism used in low memory status, but unlike the Ballooning, it does not relate to VM guest OS. Host Swapping (or vRAM) has occurred If the host has less than 2% free memory and needs to provide more memory resources for memory-intensive VMs, so hypervisors should have enough swap space. although swapping is a reliable way against memory over-commitment, but it may cause total performance degrading because the efficiency of generated swap files is highly depended on IOPS rate of datastore specified for swapping, So if we don't provision suitable Disks (Like SSD) for swap files, it can lead to low performance.

There is one crucial point: If Memory Compression could happen, there is no need for hypervisor swap procedure execution. (I will discus more about it later on another post)

Remember, ESXi Host's memory will stay in share state until total required memory for them (memory requests) is greater than available physical memory (over-commitment), So if we want to prevent this situation before ballooning activation, we need to do these jobs:

1. Estimate carefully about overall required memory and also provide at least 25% more memory in the design phase.ballooning.png

2. Use VMware DRS in cluster-level resources to distribute VMs between hosts and balance resource usage, then no need to worried about low-memory. It's recommended to provision equal memory for all of ESXi hosts.

3. Use Resource Pools, especially in cluster-level and reserved estimated memory for mission-critical application and high resource-demand VMs and ensure they access to enough resources. Also, reserved memory is protected against memory reclaiming, even contention happened.

4. Do Not disable Balloon Driver, because it's the best intelligent memory reclamation technique. but if you want to do that, you can do it via the vSphere Client setting (Remove sched.mem.maxmemctl entry from .vmx file) or Guest OS Windows Registry (Set value of HKLM\SYSTEM\CurrentControlSet\services\VMMEMCTL from 2 to 4)

 

Link of post on my personal blog: Undercity of Virtualization: Memory Virtualization: Management & Reclaiming - Section I

 

 

 

VMware vCenter Server Appliance Management Interface (VAMI) is a very useful web console environment for VCSA management & maintenance operations,vSAN-Error.png that has been existing on HTTPS://VCSA_URL:5480. One of these tools is Backup and with these tools, you can take a specified backup of vCenter data consists of Inventory & Configuration (Required) and Stats, Events & Tasks (Optional)

VMware published VCSA6.7 (version 6.7.0.14) with these protocols for Backup: FTPS, HTTPS, SCP, FTP, HTTP. But announced NFS & SMB will be supported after VCSA6.7U2 (version 6.7.0.3). We had two big problems with these useful tools,  One of them is related to VSAN Health Service.

 

Whenever the Backup task has been started, it was stopped immediately and generated a warning about VSAN Health Service, because it seems to be crashed. (VCSA Management GUI exactly will tell us this happened.) Sadly if you try to start this service (even with --force option) it leads to another failed attempt and result is something like this:

 

So after many retrying for starting this service, I decided to check the files structure of this service in path of /etc/VMware-vsan-health and compare them with a fresh-installed vCenter Server.

vSAN-files.png

Also, there are two files that could to be related to the cause of this issue: logger.conf  file that has been absolutely empty in the troubled vCenter Server and VI result shows nothing, whereas in the healthy VCSA you can see something like below results:

VSAN-logger.png

When I checked the vsanhealth.properties, it shows communication of this service is worked with HTTPS, so its connections need to have an SSL structure. Then I found the second file: fewsecrets, it contains something like two hash streams. So I decided to risk and remove this file and the logger.conf file too (of course after getting its backup). At last, after some minutes the next try of service start was successful.

vSAN-Started.png

 

Remember that you need always to check DNS (FWD, RVS), NTP, Certificate and Firewall, Especially if you setup the vSphere environment with an External Platform Service Controller. I will explain the second problem in another post.

Link to my personal blog: Undercity of Virtualization: VCSA Backup Failed because of VSAN

Now it's time to investigate more deeply about troubleshooting of VMware Horizon View. In this post, I want to continue speaking about the LDAP structure and data of the VDI server. If you look at the first post of this series, I talked about how to connect to View LDAP with windows MMC: ADSI Edit. Now I will show you which VDI objects belong to which one of OUs in Directory Service hierarchy of VMware View:

 

1. OU=Sevdi1.PNGrver Groups specified a list of desktop pools in the Horizon environment.

2. OU=Servers contain all the VMs (Desktops that have been deployed by every desktop pool.

3. OU=Data Disk listed all of the generated virtual disks belong to each of the desktop.

4. OU=Groups contains all of predefined Admin groups and manually added roles in horizon administration console with their allowed permissions mentioned into the pae-AdminRolePermissions attribute of defined object.

5. OU=Applications is about all added virtual APPs to Horizon environment, for example by an Application Pool of an RDS Farm. Each of the created Apps are listed here.

Now let's review sub_OUs of OU=Properties:

 

 

 

1. If you configured the View event database, you can see the related object in sub_OU of OU=Database as a pae-EventDatabase class. Database server type and instance name, configured TCP Port, Database name and also events longevity are the main attributes of this class of object.vdi2.PNG

2. OU=Server is about Horizon View servers class as the pae-VDMProperties class. OU=Server, OU=LVM contains VDI servers (as same as the last mentioned object class) that are related to Linked-Mode Desktop Pools.

3. OU=VirtualCenter listed configured vCenter servers (VC) and composer servers (SVI) with object class type of pae-VirtualCenter. You can also check specified connection credential and URL addresses of each server: https://VC:443/sdk and https://SVI:18443

4. OU=Global contains some important objects such as:

4-1 CN= Common with some important attributes about VDI management, like Pod Name (or Cluster Name that has been generated from computer name of first / primary installed connection server), timeout of console session and connected desktop desktop, Maximum session time duration, Syslog related configuration, Pre-forced logoff message for Horizon endpoint users, IPSec mode and etc.

4-2 CN=License with hashed-form of imported license key for VMware Horizon View.

4-3 CN=Keys contains RADIUS configs, some session timeouts like RDP, VDM Gateway and Security servers, Security Server Pairing settings and etc.

I tried to mention some useful and critical OUs of VMware Horizon View LDAP structure on this post, if you think I forgot to review another important object of View LDAP, I will be appreciated to tell me about it.

Link to my personal blog's post: Undercity of Virtualization: VMware VDI (Horizon View) Troubleshooting - Part II

Usually we think it's so easy to change most of the server's name! Just a little configuration changes like editing FQDN value maybe destroy all of your configuration and setting. It's going to sounds like a disaster if you change computer name/account of a server without any consideration or checklist about how to do it. But sometimes you may be wrong in initial server name configuration and after service setup and startup, then you will understand what's happened (forget to change the default name or choose a suitable name based on your design worksheets). Now let's check this matter about a VMware Horizon View Connection Server. First of all you should answer some of questions like:

    1. What should we do if we need to change computer account?

   2. What will happen if we change computer account?

   3. What should we do as the post-execution after renaming the server?

As the first step, you have to review the service documentary specially troubleshooting documents. Then investigate side effects in your virtual desktop infrastructure objects like desktop pools or provided and in-used virtual desktops. Naturally none of them cannot connect to the server anymore, especially if you change the primary connection server nor any of replica servers. As the best and safest way to configure server after renaming it, you can uninstall VMware Horizon 7 Connection Server component (and also HTML Access) and install it again without any concern about losing VDI data and structure. Because there is another important application on provisioned connection server: AD LDS Instance VMware VDMDS that as it's name demonstrate, it's the directory service for VMware VDI suite and is a separated component from Horizon View Connection Server

So let me explain about structure of Horizon View. It's fundamental is based on Lightweight Directory Access Protocol (So you cannot install this service on a domain controller). View LDAP is a data repository and consists of all of it's configuration information and it will be created when you install the first View connection server. This repository will be replicated to other View replica servers. Like other LDAP service it has some partitions, objects and attributes of objects and can be edited by ADSIEdit, just remember if you want to do it, type the Distinguished Name like :

dc=vdi, dc=vmware, dc=int

And then you can find connection server object on these Sub OUs: 'Properties\Server' and 'Properties\LVM\Server'.

The second way to check and change VDI configurations on connection servers is doing by windows registry editor. You can see related path (HKLM\ Software\ VMware, Inc.\VMware VDM) about Horizon View on second picture:

  But regardless of these two rough and dangerous methods, VMware recommended vdmadmin CLI for troubleshooting of Horizon View (note using regedit is not a suitable way). If you refer to following path you can also see other useful CLI like vdmexport and vdmimport:

%ProgramFiles%\ VMware\ VMware View\ Server\ tools\ bin\

Absolutely we know all of gathered information from each of these troubleshooting methods must be same, for example if you check the system GUID , all ways must return the same value:

  Reinstalling the connection server component is the fastest and easiest way, but if risks assessment and information security policies of your organization prevents you from that, now what method you will Choose to reconfigure your virtual desktop infrastructure servers? We will review more about Horizon View troubleshooting on another parts of this series.

  Source of content on my personal blog: https://virtualundercity.blogspot.com/2019/02/vmware-vdi-horizon-view-troubleshooting.html

I recorded a video series to illustrate about vSphere Distributed Switch design and configuration, I hope it can be useful for all of you.

Undercity of Virtualization: vSphere Distributed Switch Design & Configuration - Part I: Create & Basic Setup

  Yes exactly, another post about NTP service and important role of time synchronization between virtual infrastructure components. In another post i described about a problem with ESXi v6.7 time setting and also talk about some of useful CLIs for the time configuration, manually ways or automated. But in a lab scenario with many versions of ESXi hypervisors (because of servers type we cannot upgrade some of them to higher version of ESXi) we planned to configure a NTP server as the "Time Source" of whole virtual environment (PSC/VC/ESXi hosts & so on).

   But our first deployed NTP server was a Microsoft Windows Server 2012 and there was a deceptive issue. Although time configuration has been done correctly and time synchronization has occurred successfully, but when i was monitoring the NTP packets with tcpdump, suddenly i saw time shifting has been happened to another timestamp.

   ntp-problem .PNGntpconf.PNG

At the first step of T-shoot, i think it's maybe happened because of time zone of vCenter server (but it worked correctly) or not being same version of NTP client and NTP Server. (to check NTP version on ESXi, use NTP query utility: n

tpq --version) and also change ntp.conf file to set exact version of NTP. (vi /etc/ntp.conf and add "version #" to end of server line) But NTP is a backward compatible s

ervice as and i thought it's not reason of this matter.

So after more and more investigation about cause of the problem, we decided to change our NTP server, for example a Mikrotik router Appliance. and after initial setup and NTP config on the

Mikrotik OVF, we changed our time source. So after setting again the time manually with "esxcli hardware clock" and "esxcli system time" configure host time synchronization with NTP. Initial manual settings must be done because your time delta with NTP server must be less than 1min.

  ntpdsvc.PNG

Then after restart NTP service on the host ( /etc/init.d/ntpd restart) i checked it again to make sure the problem has been resolved.

ntp-check2.PNG

link of post in my personal blog: Undercity of Virtualization: Time differentiate between ESXi host & NTP Server

In one of my projects I had a bad problem with vSphere environment . The issue had been occurred in following situation:

In the first episode VCSA server encountered with a low disk space problem and suddenly crashed. After increase size of VMDK files and fix the first problem, I saw one of the ESXi host belongs to the cluster is unreachable (disconnected and also vCenter cannot connect to it, but both of them is reachable by my client system. In a SSH access I checked the ESXi host is accessible but vCenter server couldn't connect only to this host.

All network parameters and storage zone settings, and all time settings and service configuration were same for each hosts. Sadly syslog settings was not configured and we didn't have access to scratch logs in duration of the issue had been occurred (I don't know why). Trying to restart all management agents of the host was suspended and suppressing to it by running services.sh restart process was stuck and nothing really happened. also trying to restart vpxa and hostd didn't fix the issue.

There was only one error in summary tab of disconnected host that described about the vSphere HA that is not configured and ask to remove and add the host again to the vCenter. But I couldn't reconnect it. My only guess is it's only related to startup sequence of ESXi hosts and storage systems because tech support unit restarted some of them after confronting to the problem, So HA automatically tried to migrate VMs of that offline hosts to other online hosts and this is the moment I want to call it "Complex Disaster". So was stuck decided to disable HA and DRS on cluster settings, nothing changed! problem still existed. After fixing the VCSA problem I knew if we restart that host, maybe the second problem will be solved but because of a VM operation, we couldn't do it. Migration did not work and we were confused.

Then I tried to shutdown some of not-necessary VMs belong to the disconnected host. after releasing some CPU/RAM resources, this time management agent restart was done successfully (services.sh restart operation)

So trying to connect VCSA to that problematic ESXi was possible and the problem was gone forever!

After that I wrote a procedure for that company's IT Department as the Virtualization Checklist:

1. Attend to your VI's assets logs. Don't forget to keep them locally in a safe repository and also in a syslog server.

2. Always monitor used and free process/memory resources of cluster. Never override their thresholds, because a host failure may cause to consecutive failures

3. Control status of virtual infrastructure management services include vCenter Server, NSX Manager and also their disk usage. Execute "df -h" in CLI or check status of their VMDKs in GUI. (I explained about how to do it in this post)

4. In critical situations or even maintenance operations always first shutdown your ESXi hosts and then storage systems and for reloading the system first start the storage, then the hosts.

5. In the end, please DO NOT disconnect vNIC of VCSA from associated Port Group if it is part of a Distributed vSwitch. They did it and it's made me to suffer a lot to reconnect VCSA. Even if you restore a new backup of VCSA, don't remove network connectivity of failed VCSA until the problem is not solve.

Link to my personal blog: Undercity of Virtualization: An Example of Importance of Management and Controlling Virtual Infrastructure Resources

 

In the third part of SDDC Design (based on VMware Validated Design Reference Architecture Guide) we will review about one of the major steps on SDDC Design about physical design and availability. Before any steps on data-center physical design, we should consider an important classification based on availability aspect: Regions & Zones. Multiple Availability Zones form a Region, but what is the A-Zone?

  Unfortunately many disastrous events like earthquakes, massive floods, and large power fluctuations may cause of interruption of IT communication as the service failures or unavailability of network components. So you need to segregate the DC total infrastructure into the regions & zones. A zone of SDDC, now mentioned as A-Zone is an independent area of infrastructure that is isolated as a physical distinct. A-Zones will improve SLA and redundancy factor and must be highly reliable because controlling of network infrastructure failure boundaries is the main reason of their presence. Interruptions may have internal causes, such as power outage, cooling problems and generators failure so each one of the zones should have their own safety teams (HSE and Fire departments).

There is two main factors to distinguish the differences of A-Zone & Region: distance of two site (Primary/Recovery) and network bandwidth of fiber connections between them. Basically A-Zones have metro distances (less than 50km/30mile) that usually connected with dark fiber to each other and there must be low latency as a single-digit and high network bandwidth. So they can act as Active-Active or Active-Passive sites for each other. For more than that mentioned distance range it’s highly recommended to put each A-Zone to different Regions but related workloads must be spread across multiple A-Zones belongs to same Region.

 

SDDC Business Continuity can be improved by operating many technologies and replication techniques such as:

  1. VMware vSphere FT for the VM-level Replication.
  2. VMware vSphere HA to provide VM availability at Host and Cluster-level.
  3. VMware vSphere DRS to act as a VM distributer to prevent load/VM aggregation on Host of a cluster.
  4. VMware vSAN as the Software-Defined Storage solution for better availability on environments without physical storage system.
  5. VMware Replication as an integrated appliance-based replication solution for inside or outside of the site or zone.
  6. Storage-vendor Replication solutions as the third-party solutions replication such as DELL EMC RecoverPoint, NetApp SnapMirror and HPE 3PAR.
  7. Software Replication solution such as Zerto Virtual Replication and Veeam Backup & Replication.
  8. VMware SRM as one of the best options for site replication & recovery solution.

 

Link of post on my personal blog: Undercity of Virtualization: VMware SDDC Design Considerations - PART Three: SDDC Availability