These are my video links about the migration process steps from the vSphere standard switch (VSS) to the vSphere distributed switch (VDS). I hope they can be helpful for you all and enjoy it:
These are my video links about the migration process steps from the vSphere standard switch (VSS) to the vSphere distributed switch (VDS). I hope they can be helpful for you all and enjoy it:
Storage infrastructure is one of the main part of IT environment, so good design and principled configuration will cause better and easier troubleshooting of each possible issue related to this area. One of the primary components of storage infrastructure is HBA, connector of servers to the storage area. Then we can consider some of the greatly possible storage-related problems back to the Host Bus Adapter installed in the ESXi host and also its physical connections to the SAN storage or SAN switche. So let's begin how to investigate step by step storage troubleshooting inside the VMware infrastructure.
First situation may be occured for local array of disks that are not detected as a local datastore. You can check status of internal disk controller (for exmaple in a HP Proliant server) via running the following command:
The result will be shown like this:
(please be careful when I used the hba word and when the capital form)
But if the considered datastore is not local and is a shared volume of existing SAN storage in our infrastructure, then we must check the HBA status:
./usr/lib/vmware/vmkmgmt_keyval -a | less
The last mentioned command has been used in the ESXi version 5.5 and higher, so for older versions you must check the following folder for both of HBA most popular vendors:
Also if you don't find the related vmhba adapter in result of the following command, it means the ESXi host did not detect your HBA yet
Also you can run the swfw.sh command and combine it with grep to find related information of the connected HBA devices to the ESXi, include: device model, driver, firmware and also WWNN for FC-HBA (InstanceID value)
./usr/lib/vmware/vm-support/bin/swfw.sh | grep HBA
In another situation imagine you have deploy a new SAN storage inside a vSphere cluster, but you are not ensure that HBA could detect the provided LUN or not. As the first step run the below ESXCLI:
esxcli storage core device list
For the shown result, please check important fields, like these ones: Display Name, Device Type, Devfs path, Vendor & Model.
And next you can run the following command, then it will give you back more information about the HBA adapters and state of each one of them:
esxcli storage core adapter list
VMware Definition Tip1: NAA (Network Addressing Authority) or EUI (Extended Unique Identifier) is the preferred method of identifying LUNs and the number that follows is generated by the storage device itself. Since the NAA or EUI is unique to the LUN, if the LUN is presented the same way across all ESXi hosts, the NAA or EUI identifier remains the same.
Also this command will show you list of available and detected partition by the ESXi host:
esxcli storage core device partition list
VMware Definition Tip2: You can see two types of fb & fc. fb is the system ID for VMFS and fc is the vmkernel core dump partition (vmkcore)
There is more useful storage command, like the oldman CLI esxcfg-scsidevs. (-a show HBA devices, -m for mapped VMFS volumes and -l list all known logical devices)
So finally as the conclusion of first part of troubleshooting the problems related to the storage side of vSphere environment, we understood that we need to check the status of HBA, how they are performing and connected disk devices, LUNs & volumes via each one of them. I hope it can be helpful for you all
Link to my personal blog: vSphere Storage Troubleshooting - Part 1: HBA & Connectivity
Many people have confusion sight about how really a host assigns CPU resource to the virtual machines; More precisely I can say how the processing operation of a VM has been executed via the physical CPU resources. In the Intel terminology, the physical processor is a CPU socket, but in this post, I consider the pCPU as a physical core in the existing sockets of servers.
By default, each of the added vCPU to the VMs is assigned to one of the existing pCPUs. So if we configure 8 vCPU for a VM, there must exist at least 8 pCPU in the host. In other words, if there is not enough pCPU for the VM, it cannot be started.
Based on design, VMware ESXi can handle the CPU oversubscription (request of vCPU more than existing processors/pCPU). It means the pCPU~vCPU ratio is not one by one (1:1) anymore. In the vSphere environment, the ESXi host will handle the processing operations to execute requests of every VM, then the host needs to schedule processing time for each of them. But here the question is what ratio should be configured as the best settings? The answer depends on choosing Capacity or Performance aspects, really it can be very different based on the virtualized application requirements ...
Each VM needs the pCPU resources, then implementation of many VMs specially highly-applicable and resource-consumption virtual machines demand more CPU clocking. So if you provision more VMs and also increase the pCPU~vCPU Ratio (1:2, 1:4 or greater) the performance of the ESXi host will be affected.
As the VMware mentioned vSphere ESXi scheduling mechanism prefers to use the same vCPU-to-pCPU mapping to boost performance through CPU caching on the socket. If there is no specific documentary for the CPU design of the Application, you can set it up with a single vCPU, then scale up based on requires. So oversubscription will not have a serious negative impact.
Also, we must consider the CPU Ready Time is another important metric as the CPU utilization metric is. Generally, vCPU~pCPU ratio is based on many factors like the following:
Should we use Hyperthreading or not ?!
Hyperthreading is a great technology that makes a single pCPU act as the two logical processors. In the case of the low-usage of ESXi host, each of those logical cores can handle two independent applications at the same time. So if you have 16 logical processors in the ESXi host, after enabling of HT (In both of the BIOS config and ESXi advanced settings) you will see the host has 32 logical processors. But using HT does not mean performance is increased always and it's highly dependent on application architecture. So in some cases maybe you encounter performance degradation via HT usage. Before enabling of HT in the ESXi hosts, review critical virtualized applications deploy on their VMs.
Source of original post in my personal blog: Undercity of Virtualization: Virtualization Tip1: Relation between physical CPU & virtual CPU
In this post and other series of this title, I will review some great hints of a good datacenter virtualization design. But before anything, I want to ask you some major question:
In this post and other parts of this series, I want to deep dive into the details of good design for the virtual infrastructure based on VMware products.
In the first part, I investigated more about the basic requirements and prerequisites of IT infrastructures to migrate into virtualization. In other parts, I will review VMware's primary services and their impacts to achieve this goal.
1. Physical to Virtual
The first step is the estimation of the real needs of physical resources for the service providing. Processor Clock Rate (GHz), Memory & Disk Usage (GB) and also Network Transmission Rate (Gbps) must be calculated separately per each existing service and then we can talk about the required resources for the server virtualization. However, we should consider the hypervisor (ESXi host) overhead and add this measure to the total estimated count.
P2V migration always impacts the service availability and usually needs to operationally downtime of the migrated service/OS. There are also some complexities in this manner, including:
So in the following, I provided a questionnaire about the P2V operation and you must answer to each of them carefully before executing real migration:
At last, you should provide a checklist like the following list to specify the server’s priority orders:
2. Storage resources… How to provision?
If the physical server attached to a storage device/LUN/volume, there may be two difficulties exist:
On the other-side, in services with high critical transaction log files like Exchange server, migration of mailbox databases needs to consider the rate of the log space suddenly growth. Finally, in every kind of P2V Migration, we need to more attention to temporary and permanent storage resources space.
3. Security consideration as the physical and traditional deployment
For choosing the virtualization platform, the selected solution must supply every security technologies that are deployed in the physical networking. It’s recommended that every aspect of physical switch security features like MAC learning, Private VLAN and so on can be supported by virtual switches. Distributed vSwitch technology used in the VMware vSphere platform is an ideal virtual networking solution for supporting many advanced security concepts like port mirroring and NetFlow. Except for VMware distributed switches (VDS), products of many vendors like Cisco, HP, IBM are supported by the vSphere networking platform. For example, Cisco Nexus 1000v is designed just as an integrated distributed vSwitch for the VMware platform. Of course, VDS design and migration from vSphere standard switch (VSS) to the VDS, requires its implementation considerations (that I reviewed in this video playlist on my YouTube channel.)
4. Provide suitable physical resources for virtual infrastructure
One of the important characteristics of server virtualization in front of traditional server provisioning is the increasing rate of service availability and this requires the construction of VMware clustering. As a result, comply with the deployment prerequisites like employment of the same CPU generation and technologies in the ESXi members of the cluster is required.
It’s also recommended to use more similar physical servers instead of fewer servers with more physical resources. Thereby the Blade servers are a better choice as the hypervisor physical resources in front of other types of servers like the Tower servers.
5. Do not forget cleanup operation
After migration successfully has been done, you should start the post-migration operations, including checking the detected virtual hardware devices into the VM and also remove everything that is not required anymore on the new converted VM. For example in the windows guest OS you can run: devmgr_show_nonpresent_devices=1 and next run devmgmt.msc, then go to the view>show hidden devices and finally you can remove unnecessary or hidden items.
In the next part, I will talk about the power supply used for the computing and storage racks and how to calculate it.
Source of original post in my personal blog: Undercity of Virtualization: Best practice for a good Virtualized Datacenter Design - Part 1
Every part of the virtual infrastructure environment needs a channel to communication and a safe and secure channel always requires a certificate. ESXi Hosts, vCenter Server, NSX Manager, Horizon Connection Server and so on, each one of them has at least a machine certificate or a web-access management portal with a self-signed SSL certificate. After introducing of vSphere6.0 Platform Service Controller (PSC) will handle the vSphere generated certificates with a web access panel that has been called VMware Certificate Authority (VMCA). But in this post I want to introduce some CLI to manage VMware certificates:
There are many types of stores inside the VECS:
Now let me ask what are the roles of solution users? There are five solution users:
The default paths of certificate management utilities are down below:
And for windows type of vCenter server you can go to the:
Surely I will talk about what is the vmafd itself and other useful CLI vdcpromo in this path on another post. Also, I will provide a video about how to work with certificate-manager." is the default path of windows-based vCenter server.
For the last note, always remember that deleting Trusted Roots is not permitted, because if you do it, it can cause some sophistic problems in your VMware certificate infrastructure.
Link of content inside my personal blog: Undercity of Virtualization: Manage VCSA Certificates - Chapter I
In the third part of the VDI troubleshooting series, unlike the last two parts, I want to talk about client-side connection problems. For instance, if there is a dedicated subnet of IP addresses for Zero Client devices, then incorrect setup or miss-configuration of routing settings can be the reason for the connection problem between VDI clients and servers. Same way, wrong VLAN configs (ID, subnet, Inter VLAN Routing) can be the main reason for the trouble. So I provided a checklist of "What to do if you have a problem with your Horizon connection servers?"
1. Check the correctness of Zero/Thin client's communication infrastructure (routing, switching, etc) to the VDI servers (Connection Server, Security Server)
2. Check network connection between Connection Server subnet and deployed Virtual Machines of Desktop Pool, if they are separated. Of course, logically there is no need to connect their dedicated Hosts/Clusters to each other, so you can have separate ESXi Clusters, one for Desktop pools and another for VDI Servers.
3. Investigate the vCenter Server is accessible from Connection Server and also its related credential.
4. If you have a Composer Server, check it's Services. So many times I saw the Composer Server service does not start after a server reboot, while it's automated and no warning/error event has been reported. Also, you need to check the ODBC Connection between Composer Server and its Database.
5. Investigate installed View Agent state inside the Desktop Pool's VMs. If you need to provide client redirection to the desktop (without the presence of Connection Server) View Direct Agent is needed too.
6. A TCP connection on port 4001(non-SSL)/4002(SSL-based) between Desktop's View Agent and Connection Server must be established, It's required for connection and you can check it by running netstat -ano | findstr "4001".
7. Review the User Entitlement for provided Desktop Pools, maybe there is a mistake especially when you add AD Groups instead of AD Users. (also check them, are they still available or assigned to the other users?)
8. Type of Virtual Desktop provisioning is also important. Except for Full Clone, on Linked Clone and Instant Clone models, you need to check the status of Virtual Desktops in Inventory\Resources\Machines of the View Admin web page.
9. If there is an interruption in connected sessions, you need to review their states in Inventory\monitoring of the View Admin web page.
10. For the last Note: DO NOT FORGET TO CONFIGURE EVENT DATABASE! I had encountered too many Horizon View deployment that did not configure any event database, so in troubleshooting situations, we had NOTHING to know really what happened.
I hope it can be helpful for you all buddy...
Link to the original post on my personal blog: Undercity of Virtualization: VMware VDI (Horizon View) Troubleshooting - Part III
Memory resource provisioning is one of the biggest challenges for IT administrators and virtualization designers. Although the provision of CPU is the major factor in the virtual infrastructure, most of the time there is lesser attention to CPU, because many applications that are deployed in virtualization need more memory, not the processor. And also today CPU technologies are very powerful, but applications like SQL and Oracle for some of their processes need more memory. Lack of enough memory resources may restrict the development of many virtual infra, so what should we do in this situation, before providing more physical resources for our hypervisors?
There are many technologies to handle the problem of memory over-commitment such as Swapping and Ballooning and VMware ESXi use some of them to confronting with these issues. in this post I will review act and importance both mentioned mechanism:
1. Virtual Memory Ballooning is just a memory reclaiming technique that lets the VMkernel retrieve idle memory pages. When the ESXi host has less than 6% free memory (actually >=6%), Ballooning will come into the ring to handle out of memory problem! If a VM has many idle pages, its host borrows them to use as the temporary overhead for the VMs with more memory demand because probably they have some memory-intensive processes.
When a virtual machine wants to release some of the old used page files dedicated from host physical memory, does not remove it exactly, just change the address space pointer of the allocated memory list to the free memory list. So VM with Balloon Driver (vmmemctl.sys) can decide which pages (idle pages) can be reclaimed back to the host (up to 65% of VM guest memory) and which ones are needed for itself yet (already used pages) without involving host in this decision procedure. Now time for inflating step to happen If you disable memory ballooning driver inside the guest OS, VM will not be aware of host memory state and amount of available or unused physical memory and hypervisor cannot understand how much memory can take it back for other VMs memory requests.
There is one crucial point: If Memory Compression could happen, there is no need for hypervisor swap procedure execution. (I will discus more about it later on another post)
Remember, ESXi Host's memory will stay in share state until total required memory for them (memory requests) is greater than available physical memory (over-commitment), So if we want to prevent this situation before ballooning activation, we need to do these jobs:
2. Use VMware DRS in cluster-level resources to distribute VMs between hosts and balance resource usage, then no need to worried about low-memory. It's recommended to provision equal memory for all of ESXi hosts.
3. Use Resource Pools, especially in cluster-level and reserved estimated memory for mission-critical application and high resource-demand VMs and ensure they access to enough resources. Also, reserved memory is protected against memory reclaiming, even contention happened.
4. Do Not disable Balloon Driver, because it's the best intelligent memory reclamation technique. but if you want to do that, you can do it via the vSphere Client setting (Remove
sched.mem.maxmemctl entry from .vmx file) or Guest OS Windows Registry (Set value of
HKLM\SYSTEM\CurrentControlSet\services\VMMEMCTL from 2 to 4)
Link of post on my personal blog: Undercity of Virtualization: Memory Virtualization: Management & Reclaiming - Section I
VMware vCenter Server Appliance Management Interface (VAMI) is a very useful web console environment for VCSA management & maintenance operations, that has been existing on HTTPS://VCSA_URL:5480. One of these tools is Backup and with these tools, you can take a specified backup of vCenter data consists of Inventory & Configuration (Required) and Stats, Events & Tasks (Optional)
VMware published VCSA6.7 (version 126.96.36.199) with these protocols for Backup: FTPS, HTTPS, SCP, FTP, HTTP. But announced NFS & SMB will be supported after VCSA6.7U2 (version 188.8.131.52). We had two big problems with these useful tools, One of them is related to VSAN Health Service.
Whenever the Backup task has been started, it was stopped immediately and generated a warning about VSAN Health Service, because it seems to be crashed. (VCSA Management GUI exactly will tell us this happened.) Sadly if you try to start this service (even with --force option) it leads to another failed attempt and result is something like this:
So after many retrying for starting this service, I decided to check the files structure of this service in path of /etc/VMware-vsan-health and compare them with a fresh-installed vCenter Server.
Also, there are two files that could to be related to the cause of this issue: logger.conf file that has been absolutely empty in the troubled vCenter Server and VI result shows nothing, whereas in the healthy VCSA you can see something like below results:
When I checked the vsanhealth.properties, it shows communication of this service is worked with HTTPS, so its connections need to have an SSL structure. Then I found the second file: fewsecrets, it contains something like two hash streams. So I decided to risk and remove this file and the logger.conf file too (of course after getting its backup). At last, after some minutes the next try of service start was successful.
Remember that you need always to check DNS (FWD, RVS), NTP, Certificate and Firewall, Especially if you setup the vSphere environment with an External Platform Service Controller. I will explain the second problem in another post.
Link to my personal blog: Undercity of Virtualization: VCSA Backup Failed because of VSAN
Now it's time to investigate more deeply about troubleshooting of VMware Horizon View. In this post, I want to continue speaking about the LDAP structure and data of the VDI server. If you look at the first post of this series, I talked about how to connect to View LDAP with windows MMC: ADSI Edit. Now I will show you which VDI objects belong to which one of OUs in Directory Service hierarchy of VMware View:
2. OU=Servers contain all the VMs (Desktops that have been deployed by every desktop pool.
3. OU=Data Disk listed all of the generated virtual disks belong to each of the desktop.
4. OU=Groups contains all of predefined Admin groups and manually added roles in horizon administration console with their allowed permissions mentioned into the pae-AdminRolePermissions attribute of defined object.
5. OU=Applications is about all added virtual APPs to Horizon environment, for example by an Application Pool of an RDS Farm. Each of the created Apps are listed here.
Now let's review sub_OUs of OU=Properties:
1. If you configured the View event database, you can see the related object in sub_OU of OU=Database as a pae-EventDatabase class. Database server type and instance name, configured TCP Port, Database name and also events longevity are the main attributes of this class of object.
2. OU=Server is about Horizon View servers class as the pae-VDMProperties class. OU=Server, OU=LVM contains VDI servers (as same as the last mentioned object class) that are related to Linked-Mode Desktop Pools.
3. OU=VirtualCenter listed configured vCenter servers (VC) and composer servers (SVI) with object class type of pae-VirtualCenter. You can also check specified connection credential and URL addresses of each server: https://VC:443/sdk and https://SVI:18443
4. OU=Global contains some important objects such as:
4-1 CN= Common with some important attributes about VDI management, like Pod Name (or Cluster Name that has been generated from computer name of first / primary installed connection server), timeout of console session and connected desktop desktop, Maximum session time duration, Syslog related configuration, Pre-forced logoff message for Horizon endpoint users, IPSec mode and etc.
4-2 CN=License with hashed-form of imported license key for VMware Horizon View.
4-3 CN=Keys contains RADIUS configs, some session timeouts like RDP, VDM Gateway and Security servers, Security Server Pairing settings and etc.
I tried to mention some useful and critical OUs of VMware Horizon View LDAP structure on this post, if you think I forgot to review another important object of View LDAP, I will be appreciated to tell me about it.
Link to my personal blog's post: Undercity of Virtualization: VMware VDI (Horizon View) Troubleshooting - Part II
Usually we think it's so easy to change most of the server's name! Just a little configuration changes like editing FQDN value maybe destroy all of your configuration and setting. It's going to sounds like a disaster if you change computer name/account of a server without any consideration or checklist about how to do it. But sometimes you may be wrong in initial server name configuration and after service setup and startup, then you will understand what's happened (forget to change the default name or choose a suitable name based on your design worksheets). Now let's check this matter about a VMware Horizon View Connection Server. First of all you should answer some of questions like:
1. What should we do if we need to change computer account?
2. What will happen if we change computer account?
3. What should we do as the post-execution after renaming the server?
As the first step, you have to review the service documentary specially troubleshooting documents. Then investigate side effects in your virtual desktop infrastructure objects like desktop pools or provided and in-used virtual desktops. Naturally none of them cannot connect to the server anymore, especially if you change the primary connection server nor any of replica servers. As the best and safest way to configure server after renaming it, you can uninstall VMware Horizon 7 Connection Server component (and also HTML Access) and install it again without any concern about losing VDI data and structure. Because there is another important application on provisioned connection server: AD LDS Instance VMware VDMDS that as it's name demonstrate, it's the directory service for VMware VDI suite and is a separated component from Horizon View Connection Server
So let me explain about structure of Horizon View. It's fundamental is based on Lightweight Directory Access Protocol (So you cannot install this service on a domain controller). View LDAP is a data repository and consists of all of it's configuration information and it will be created when you install the first View connection server. This repository will be replicated to other View replica servers. Like other LDAP service it has some partitions, objects and attributes of objects and can be edited by ADSIEdit, just remember if you want to do it, type the Distinguished Name like :
dc=vdi, dc=vmware, dc=int
And then you can find connection server object on these Sub OUs: 'Properties\Server' and 'Properties\LVM\Server'.
The second way to check and change VDI configurations on connection servers is doing by windows registry editor. You can see related path (HKLM\ Software\ VMware, Inc.\VMware VDM) about Horizon View on second picture:
But regardless of these two rough and dangerous methods, VMware recommended vdmadmin CLI for troubleshooting of Horizon View (note using regedit is not a suitable way). If you refer to following path you can also see other useful CLI like vdmexport and vdmimport:
%ProgramFiles%\ VMware\ VMware View\ Server\ tools\ bin\
Absolutely we know all of gathered information from each of these troubleshooting methods must be same, for example if you check the system GUID , all ways must return the same value:
Reinstalling the connection server component is the fastest and easiest way, but if risks assessment and information security policies of your organization prevents you from that, now what method you will Choose to reconfigure your virtual desktop infrastructure servers? We will review more about Horizon View troubleshooting on another parts of this series.
Source of content on my personal blog: https://virtualundercity.blogspot.com/2019/02/vmware-vdi-horizon-view-troubleshooting.html
I recorded a video series to illustrate about vSphere Distributed Switch design and configuration, I hope it can be useful for all of you.
Yes exactly, another post about NTP service and important role of time synchronization between virtual infrastructure components. In another post i described about a problem with ESXi v6.7 time setting and also talk about some of useful CLIs for the time configuration, manually ways or automated. But in a lab scenario with many versions of ESXi hypervisors (because of servers type we cannot upgrade some of them to higher version of ESXi) we planned to configure a NTP server as the "Time Source" of whole virtual environment (PSC/VC/ESXi hosts & so on).
But our first deployed NTP server was a Microsoft Windows Server 2012 and there was a deceptive issue. Although time configuration has been done correctly and time synchronization has occurred successfully, but when i was monitoring the NTP packets with tcpdump, suddenly i saw time shifting has been happened to another timestamp.
At the first step of T-shoot, i think it's maybe happened because of time zone of vCenter server (but it worked correctly) or not being same version of NTP client and NTP Server. (to check NTP version on ESXi, use NTP query utility: n
tpq --version) and also change ntp.conf file to set exact version of NTP. (vi /etc/ntp.conf and add "version #" to end of server line) But NTP is a backward compatible s
ervice as and i thought it's not reason of this matter.
So after more and more investigation about cause of the problem, we decided to change our NTP server, for example a Mikrotik router Appliance. and after initial setup and NTP config on the
Mikrotik OVF, we changed our time source. So after setting again the time manually with "esxcli hardware clock" and "esxcli system time" configure host time synchronization with NTP. Initial manual settings must be done because your time delta with NTP server must be less than 1min.
Then after restart NTP service on the host ( /etc/init.d/ntpd restart) i checked it again to make sure the problem has been resolved.
link of post in my personal blog: Undercity of Virtualization: Time differentiate between ESXi host & NTP Server
In one of my projects I had a bad problem with vSphere environment . The issue had been occurred in following situation:
In the first episode VCSA server encountered with a low disk space problem and suddenly crashed. After increase size of VMDK files and fix the first problem, I saw one of the ESXi host belongs to the cluster is unreachable (disconnected and also vCenter cannot connect to it, but both of them is reachable by my client system. In a SSH access I checked the ESXi host is accessible but vCenter server couldn't connect only to this host.
All network parameters and storage zone settings, and all time settings and service configuration were same for each hosts. Sadly syslog settings was not configured and we didn't have access to scratch logs in duration of the issue had been occurred (I don't know why). Trying to restart all management agents of the host was suspended and suppressing to it by running services.sh restart process was stuck and nothing really happened. also trying to restart vpxa and hostd didn't fix the issue.
There was only one error in summary tab of disconnected host that described about the vSphere HA that is not configured and ask to remove and add the host again to the vCenter. But I couldn't reconnect it. My only guess is it's only related to startup sequence of ESXi hosts and storage systems because tech support unit restarted some of them after confronting to the problem, So HA automatically tried to migrate VMs of that offline hosts to other online hosts and this is the moment I want to call it "Complex Disaster". So was stuck decided to disable HA and DRS on cluster settings, nothing changed! problem still existed. After fixing the VCSA problem I knew if we restart that host, maybe the second problem will be solved but because of a VM operation, we couldn't do it. Migration did not work and we were confused.
Then I tried to shutdown some of not-necessary VMs belong to the disconnected host. after releasing some CPU/RAM resources, this time management agent restart was done successfully (services.sh restart operation)
So trying to connect VCSA to that problematic ESXi was possible and the problem was gone forever!
After that I wrote a procedure for that company's IT Department as the Virtualization Checklist:
1. Attend to your VI's assets logs. Don't forget to keep them locally in a safe repository and also in a syslog server.
2. Always monitor used and free process/memory resources of cluster. Never override their thresholds, because a host failure may cause to consecutive failures
3. Control status of virtual infrastructure management services include vCenter Server, NSX Manager and also their disk usage. Execute "df -h" in CLI or check status of their VMDKs in GUI. (I explained about how to do it in this post)
4. In critical situations or even maintenance operations always first shutdown your ESXi hosts and then storage systems and for reloading the system first start the storage, then the hosts.
5. In the end, please DO NOT disconnect vNIC of VCSA from associated Port Group if it is part of a Distributed vSwitch. They did it and it's made me to suffer a lot to reconnect VCSA. Even if you restore a new backup of VCSA, don't remove network connectivity of failed VCSA until the problem is not solve.
In the third part of SDDC Design (based on VMware Validated Design Reference Architecture Guide) we will review about one of the major steps on SDDC Design about physical design and availability. Before any steps on data-center physical design, we should consider an important classification based on availability aspect: Regions & Zones. Multiple Availability Zones form a Region, but what is the A-Zone?
Unfortunately many disastrous events like earthquakes, massive floods, and large power fluctuations may cause of interruption of IT communication as the service failures or unavailability of network components. So you need to segregate the DC total infrastructure into the regions & zones. A zone of SDDC, now mentioned as A-Zone is an independent area of infrastructure that is isolated as a physical distinct. A-Zones will improve SLA and redundancy factor and must be highly reliable because controlling of network infrastructure failure boundaries is the main reason of their presence. Interruptions may have internal causes, such as power outage, cooling problems and generators failure so each one of the zones should have their own safety teams (HSE and Fire departments).
There is two main factors to distinguish the differences of A-Zone & Region: distance of two site (Primary/Recovery) and network bandwidth of fiber connections between them. Basically A-Zones have metro distances (less than 50km/30mile) that usually connected with dark fiber to each other and there must be low latency as a single-digit and high network bandwidth. So they can act as Active-Active or Active-Passive sites for each other. For more than that mentioned distance range it’s highly recommended to put each A-Zone to different Regions but related workloads must be spread across multiple A-Zones belongs to same Region.
SDDC Business Continuity can be improved by operating many technologies and replication techniques such as:
Link of post on my personal blog: Undercity of Virtualization: VMware SDDC Design Considerations - PART Three: SDDC Availability
Generally coredump will be generated whenever the OS kernel sends certain signals to specified process, specially when the process send an access request to the out of address memory space. Often system will be crashed in this situation and generated errors give us related information about hardware faults or application bugs.
Sometimes you may encountered a ESXi host has been crashed, it will try to write diagnostics information on a file that has been name "VMkernel Core Dump". This file contains information about halt experience of host named purple screen state and has a high degree of importance, because in this situation, you don't have access to your system data and logs. So it's necessary to gather and analyze coredump files from all of ESXi host into one or more repositories.
There are two mechanisms for collection of coredump files: DiskDump to saving on specified permitted disk and NetDump to send coredump information by the network. If ESXi can't save coredump information on it's disk, there may be an issue with storage devices or it's connection to the host (Failed Array Controller, RAID Problem, broken physical path to storage, FC/SCSI connectivity problem, SAN switch failure and so on). So you should configure at least one alternative target to save coredump information.
But before that let's check about what is the netdump exactly?
netdump is a protocol for sending coredump information from a failed ESXi to the dump collector service that has these characteristics:
1. Listen on UDP port 6500.
2. Support only IPv4
3. Clear-text network traffic
4. No Authentication /Authorization
To retrieve current configuration for coredump saving location:
# esxcli system coredump partition get
# esxcli system coredump network get (it can be used by check option too)
If the service is not enabled:
# esxcli system coredump network set --enable true
# esxcli system coredump partition set --enable true --smart
To set new configuration for coredump:
# esxcli system coredump partition set --partition="mpx.vmhba2:C0:T0:L0"
# esxcli system coredump network set --interface-name vmk0 --server-ipv4 10.10.10.10 --server-port 6500
To find-out which storage devices we have on the host:
# esxcli storage core path list
For the older version of VMware ESXi:
# esxcfg-dumppart --list
# esxcfg-dumppart --get-active
# esxcfg-dumppart --smart-activate
Network Dump Collector is a built-in service within vcenter server that provides a way of host coredump information gathering.But remember that NetDump does not work if aggregation protocols
such as LACP or Etherchannel has been configured for the vmkernel traffic.VMware recommends for segregation of VMkernel networking for NetDump by VLAN or physical LAN separation to prevent traffic interception. (In ESXi 5.0 VLAN tagging configured at the vSwitch level are ignored during network core dump transmission.)
Also the name structure and format of recieved coredump file is something like this: yyyy-mm-dd-hh_mm-N.zdump .
Maximum default size of zdump file is 2GB and older dump files automatically will be deleted. (The Dump Collector service has a non-configurable 60-second timeout and if no information is received in this period, the partial file will be deleted.)
Source of content inside my personal blog: Undercity of Virtualization: What is VMKernel Core Dump - Part I