Skip navigation
2019

Now it's time to investigate more deeply about troubleshooting of VMware Horizon View. In this post, I want to continue speaking about the LDAP structure and data of the VDI server. If you look at the first post of this series, I talked about how to connect to View LDAP with windows MMC: ADSI Edit. Now I will show you which VDI objects belong to which one of OUs in Directory Service hierarchy of VMware View:

 

1. OU=Sevdi1.PNGrver Groups specified a list of desktop pools in the Horizon environment.

2. OU=Servers contain all the VMs (Desktops that have been deployed by every desktop pool.

3. OU=Data Disk listed all of the generated virtual disks belong to each of the desktop.

4. OU=Groups contains all of predefined Admin groups and manually added roles in horizon administration console with their allowed permissions mentioned into the pae-AdminRolePermissions attribute of defined object.

5. OU=Applications is about all added virtual APPs to Horizon environment, for example by an Application Pool of an RDS Farm. Each of the created Apps are listed here.

Now let's review sub_OUs of OU=Properties:

 

 

 

1. If you configured the View event database, you can see the related object in sub_OU of OU=Database as a pae-EventDatabase class. Database server type and instance name, configured TCP Port, Database name and also events longevity are the main attributes of this class of object.vdi2.PNG

2. OU=Server is about Horizon View servers class as the pae-VDMProperties class. OU=Server, OU=LVM contains VDI servers (as same as the last mentioned object class) that are related to Linked-Mode Desktop Pools.

3. OU=VirtualCenter listed configured vCenter servers (VC) and composer servers (SVI) with object class type of pae-VirtualCenter. You can also check specified connection credential and URL addresses of each server: https://VC:443/sdk and https://SVI:18443

4. OU=Global contains some important objects such as:

4-1 CN= Common with some important attributes about VDI management, like Pod Name (or Cluster Name that has been generated from computer name of first / primary installed connection server), timeout of console session and connected desktop desktop, Maximum session time duration, Syslog related configuration, Pre-forced logoff message for Horizon endpoint users, IPSec mode and etc.

4-2 CN=License with hashed-form of imported license key for VMware Horizon View.

4-3 CN=Keys contains RADIUS configs, some session timeouts like RDP, VDM Gateway and Security servers, Security Server Pairing settings and etc.

I tried to mention some useful and critical OUs of VMware Horizon View LDAP structure on this post, if you think I forgot to review another important object of View LDAP, I will be appreciated to tell me about it.

Link to my personal blog's post: Undercity of Virtualization: VMware VDI (Horizon View) Troubleshooting - Part II

Usually we think it's so easy to change most of the server's name! Just a little configuration changes like editing FQDN value maybe destroy all of your configuration and setting. It's going to sounds like a disaster if you change computer name/account of a server without any consideration or checklist about how to do it. But sometimes you may be wrong in initial server name configuration and after service setup and startup, then you will understand what's happened (forget to change the default name or choose a suitable name based on your design worksheets). Now let's check this matter about a VMware Horizon View Connection Server. First of all you should answer some of questions like:

    1. What should we do if we need to change computer account?

   2. What will happen if we change computer account?

   3. What should we do as the post-execution after renaming the server?

As the first step, you have to review the service documentary specially troubleshooting documents. Then investigate side effects in your virtual desktop infrastructure objects like desktop pools or provided and in-used virtual desktops. Naturally none of them cannot connect to the server anymore, especially if you change the primary connection server nor any of replica servers. As the best and safest way to configure server after renaming it, you can uninstall VMware Horizon 7 Connection Server component (and also HTML Access) and install it again without any concern about losing VDI data and structure. Because there is another important application on provisioned connection server: AD LDS Instance VMware VDMDS that as it's name demonstrate, it's the directory service for VMware VDI suite and is a separated component from Horizon View Connection Server

So let me explain about structure of Horizon View. It's fundamental is based on Lightweight Directory Access Protocol (So you cannot install this service on a domain controller). View LDAP is a data repository and consists of all of it's configuration information and it will be created when you install the first View connection server. This repository will be replicated to other View replica servers. Like other LDAP service it has some partitions, objects and attributes of objects and can be edited by ADSIEdit, just remember if you want to do it, type the Distinguished Name like :

dc=vdi, dc=vmware, dc=int

And then you can find connection server object on these Sub OUs: 'Properties\Server' and 'Properties\LVM\Server'.

The second way to check and change VDI configurations on connection servers is doing by windows registry editor. You can see related path (HKLM\ Software\ VMware, Inc.\VMware VDM) about Horizon View on second picture:

  But regardless of these two rough and dangerous methods, VMware recommended vdmadmin CLI for troubleshooting of Horizon View (note using regedit is not a suitable way). If you refer to following path you can also see other useful CLI like vdmexport and vdmimport:

%ProgramFiles%\ VMware\ VMware View\ Server\ tools\ bin\

Absolutely we know all of gathered information from each of these troubleshooting methods must be same, for example if you check the system GUID , all ways must return the same value:

  Reinstalling the connection server component is the fastest and easiest way, but if risks assessment and information security policies of your organization prevents you from that, now what method you will Choose to reconfigure your virtual desktop infrastructure servers? We will review more about Horizon View troubleshooting on another parts of this series.

  Source of content on my personal blog: https://virtualundercity.blogspot.com/2019/02/vmware-vdi-horizon-view-troubleshooting.html

I recorded a video series to illustrate about vSphere Distributed Switch design and configuration, I hope it can be useful for all of you.

Undercity of Virtualization: vSphere Distributed Switch Design & Configuration - Part I: Create & Basic Setup

  Yes exactly, another post about NTP service and important role of time synchronization between virtual infrastructure components. In another post i described about a problem with ESXi v6.7 time setting and also talk about some of useful CLIs for the time configuration, manually ways or automated. But in a lab scenario with many versions of ESXi hypervisors (because of servers type we cannot upgrade some of them to higher version of ESXi) we planned to configure a NTP server as the "Time Source" of whole virtual environment (PSC/VC/ESXi hosts & so on).

   But our first deployed NTP server was a Microsoft Windows Server 2012 and there was a deceptive issue. Although time configuration has been done correctly and time synchronization has occurred successfully, but when i was monitoring the NTP packets with tcpdump, suddenly i saw time shifting has been happened to another timestamp.

   ntp-problem .PNGntpconf.PNG

At the first step of T-shoot, i think it's maybe happened because of time zone of vCenter server (but it worked correctly) or not being same version of NTP client and NTP Server. (to check NTP version on ESXi, use NTP query utility: n

tpq --version) and also change ntp.conf file to set exact version of NTP. (vi /etc/ntp.conf and add "version #" to end of server line) But NTP is a backward compatible s

ervice as and i thought it's not reason of this matter.

So after more and more investigation about cause of the problem, we decided to change our NTP server, for example a Mikrotik router Appliance. and after initial setup and NTP config on the

Mikrotik OVF, we changed our time source. So after setting again the time manually with "esxcli hardware clock" and "esxcli system time" configure host time synchronization with NTP. Initial manual settings must be done because your time delta with NTP server must be less than 1min.

  ntpdsvc.PNG

Then after restart NTP service on the host ( /etc/init.d/ntpd restart) i checked it again to make sure the problem has been resolved.

ntp-check2.PNG

link of post in my personal blog: Undercity of Virtualization: Time differentiate between ESXi host & NTP Server

In one of my projects I had a bad problem with vSphere environment . The issue had been occurred in following situation:

In the first episode VCSA server encountered with a low disk space problem and suddenly crashed. After increase size of VMDK files and fix the first problem, I saw one of the ESXi host belongs to the cluster is unreachable (disconnected and also vCenter cannot connect to it, but both of them is reachable by my client system. In a SSH access I checked the ESXi host is accessible but vCenter server couldn't connect only to this host.

All network parameters and storage zone settings, and all time settings and service configuration were same for each hosts. Sadly syslog settings was not configured and we didn't have access to scratch logs in duration of the issue had been occurred (I don't know why). Trying to restart all management agents of the host was suspended and suppressing to it by running services.sh restart process was stuck and nothing really happened. also trying to restart vpxa and hostd didn't fix the issue.

There was only one error in summary tab of disconnected host that described about the vSphere HA that is not configured and ask to remove and add the host again to the vCenter. But I couldn't reconnect it. My only guess is it's only related to startup sequence of ESXi hosts and storage systems because tech support unit restarted some of them after confronting to the problem, So HA automatically tried to migrate VMs of that offline hosts to other online hosts and this is the moment I want to call it "Complex Disaster". So was stuck decided to disable HA and DRS on cluster settings, nothing changed! problem still existed. After fixing the VCSA problem I knew if we restart that host, maybe the second problem will be solved but because of a VM operation, we couldn't do it. Migration did not work and we were confused.

Then I tried to shutdown some of not-necessary VMs belong to the disconnected host. after releasing some CPU/RAM resources, this time management agent restart was done successfully (services.sh restart operation)

So trying to connect VCSA to that problematic ESXi was possible and the problem was gone forever!

After that I wrote a procedure for that company's IT Department as the Virtualization Checklist:

1. Attend to your VI's assets logs. Don't forget to keep them locally in a safe repository and also in a syslog server.

2. Always monitor used and free process/memory resources of cluster. Never override their thresholds, because a host failure may cause to consecutive failures

3. Control status of virtual infrastructure management services include vCenter Server, NSX Manager and also their disk usage. Execute "df -h" in CLI or check status of their VMDKs in GUI. (I explained about how to do it in this post)

4. In critical situations or even maintenance operations always first shutdown your ESXi hosts and then storage systems and for reloading the system first start the storage, then the hosts.

5. In the end, please DO NOT disconnect vNIC of VCSA from associated Port Group if it is part of a Distributed vSwitch. They did it and it's made me to suffer a lot to reconnect VCSA. Even if you restore a new backup of VCSA, don't remove network connectivity of failed VCSA until the problem is not solve.

Link to my personal blog: Undercity of Virtualization: An Example of Importance of Management and Controlling Virtual Infrastructure Resources

 

In the third part of SDDC Design (based on VMware Validated Design Reference Architecture Guide) we will review about one of the major steps on SDDC Design about physical design and availability. Before any steps on data-center physical design, we should consider an important classification based on availability aspect: Regions & Zones. Multiple Availability Zones form a Region, but what is the A-Zone?

  Unfortunately many disastrous events like earthquakes, massive floods, and large power fluctuations may cause of interruption of IT communication as the service failures or unavailability of network components. So you need to segregate the DC total infrastructure into the regions & zones. A zone of SDDC, now mentioned as A-Zone is an independent area of infrastructure that is isolated as a physical distinct. A-Zones will improve SLA and redundancy factor and must be highly reliable because controlling of network infrastructure failure boundaries is the main reason of their presence. Interruptions may have internal causes, such as power outage, cooling problems and generators failure so each one of the zones should have their own safety teams (HSE and Fire departments).

There is two main factors to distinguish the differences of A-Zone & Region: distance of two site (Primary/Recovery) and network bandwidth of fiber connections between them. Basically A-Zones have metro distances (less than 50km/30mile) that usually connected with dark fiber to each other and there must be low latency as a single-digit and high network bandwidth. So they can act as Active-Active or Active-Passive sites for each other. For more than that mentioned distance range it’s highly recommended to put each A-Zone to different Regions but related workloads must be spread across multiple A-Zones belongs to same Region.

 

SDDC Business Continuity can be improved by operating many technologies and replication techniques such as:

  1. VMware vSphere FT for the VM-level Replication.
  2. VMware vSphere HA to provide VM availability at Host and Cluster-level.
  3. VMware vSphere DRS to act as a VM distributer to prevent load/VM aggregation on Host of a cluster.
  4. VMware vSAN as the Software-Defined Storage solution for better availability on environments without physical storage system.
  5. VMware Replication as an integrated appliance-based replication solution for inside or outside of the site or zone.
  6. Storage-vendor Replication solutions as the third-party solutions replication such as DELL EMC RecoverPoint, NetApp SnapMirror and HPE 3PAR.
  7. Software Replication solution such as Zerto Virtual Replication and Veeam Backup & Replication.
  8. VMware SRM as one of the best options for site replication & recovery solution.

 

Link of post on my personal blog: Undercity of Virtualization: VMware SDDC Design Considerations - PART Three: SDDC Availability

 

 

Generally coredump will be generated whenever the OS kernel sends certain signals to specified process, specially when the process send an access request to the out of address memory space. Often system will be crashed in this situation and generated errors give us related information about hardware faults or application bugs.

Sometimes you may encountered a ESXi host has been crashed, it will try to write diagnostics information on a file that has been name "VMkernel Core Dump". This file contains information about halt experience of host named purple screen state and has a high degree of importance, because in this situation, you don't have access to your system data and logs. So it's necessary to gather and analyze coredump files from all of ESXi host into one or more repositories.

There are two mechanisms for collection of coredump files: DiskDump to saving on specified permitted disk and NetDump to send coredump information by the network. If ESXi can't save coredump information on it's disk, there may be an issue with storage devices or it's connection to the host (Failed Array Controller, RAID Problem, broken physical path to storage, FC/SCSI connectivity problem, SAN switch failure and so on). So you should configure at least one alternative target to save coredump information.

But before that let's check about what is the netdump exactly?

netdump is a protocol for sending coredump information from a failed ESXi to the dump collector service that has these characteristics:

1. Listen on UDP port 6500.

2. Support only IPv4

3. Clear-text network traffic

4. No Authentication /Authorization

To retrieve current configuration for coredump saving location:

# esxcli system coredump partition get

# esxcli system coredump network get  (it can be used by check option too)

 

If the service is not enabled:

# esxcli system coredump network set --enable true

# esxcli system coredump partition set --enable true --smart

 

To set new configuration for coredump:

# esxcli system coredump partition set --partition="mpx.vmhba2:C0:T0:L0"

# esxcli system coredump network set --interface-name vmk0 --server-ipv4 10.10.10.10 --server-port 6500

 

To find-out which storage devices we have on the host:

# esxcli storage core path list

 

For the older version of VMware ESXi:

# esxcfg-dumppart --list

# esxcfg-dumppart --get-active

# esxcfg-dumppart --smart-activate

 

Network Dump Collector is a built-in service within vcenter server that provides a way of host coredump information gathering.But remember that NetDump does not work if aggregation protocols

such as LACP or Etherchannel has been configured for the vmkernel traffic.VMware recommends for segregation of VMkernel networking for NetDump by VLAN or physical LAN separation to prevent traffic interception. (In ESXi 5.0 VLAN tagging configured at the vSwitch level are ignored during network core dump transmission.)

Also the name structure and format of recieved coredump file is something like this: yyyy-mm-dd-hh_mm-N.zdump .

Maximum default size of zdump file is 2GB and older dump files automatically will be deleted. (The Dump Collector service has a non-configurable 60-second timeout and if no information is received in this period, the partial file will be deleted.)

Source of content inside my personal blog: Undercity of Virtualization: What is VMKernel Core Dump - Part I