kmcd03's Posts

Thank you for documenting this. I was leaning towards creating a new vlan TZ.  And applying to the nodes using the Transport Node Profile that VCF created.  Is  attaching, or detaching, the TNP to ... See more...
Thank you for documenting this. I was leaning towards creating a new vlan TZ.  And applying to the nodes using the Transport Node Profile that VCF created.  Is  attaching, or detaching, the TNP to a Baseline WLD disruptive?  I assume in this case only adding additional TZ to the nodes, so it won't take VMs or nodes offline.
I would like to start using NSX-T distributed firewall (dFW) in my VCF 4.5 domains, but won't be using logical routing at this time. VCF has prepared the nodes, e.g. created transport zone, the upli... See more...
I would like to start using NSX-T distributed firewall (dFW) in my VCF 4.5 domains, but won't be using logical routing at this time. VCF has prepared the nodes, e.g. created transport zone, the uplink and transport node profiles, configured NSX on the nodes, etc. If I want to start using dFW, but don't care about overlay, do I simply create a segment with a VLAN in NSX Manager and associate it with the transport zone created by VCF? Then binding VMs vNIC to the VDS portgroup so dFW policies and rules are applied. Or should I create a new VLAN transport zone and associate with hosts in the domain?  And then create the segment and bind VM vNICs. Thanks!
I upgraded our 16-node vSAN stretched cluster from 6.7 to 7.0U3.  After updating the Disk Format I am seeing the warning for vSAN object format health. I did find Cormac Hogan's blog and believe this... See more...
I upgraded our 16-node vSAN stretched cluster from 6.7 to 7.0U3.  After updating the Disk Format I am seeing the warning for vSAN object format health. I did find Cormac Hogan's blog and believe this is the cause of the warning:  https://cormachogan.com/2021/02/09/vsan-7-0u1-object-format-health-warning-after-disk-format-v13-upgrade/ The check is showing 130 objects and 215 TB that need reformat. The problem is the hosts at our primary Fault Domains have <20% free space. We don't have a network overlay, like Geneve or OTV, so our stretched cluster is more active-passive.  So the Catch-22 here is there might not be enough slack space to change the format objects >255 GB, so can't upgrade objects to get new option to lessen slack space requirements.  This cluster is scheduled to be decommissioned and VMs migrated to new VCF cluster in next 90 days. Is there any harm in ignoring the warning?   Or does anyone know if there are safeguards to prevent the change object format task from using all the free space?  Will this task check for enough slack space to run?  Is it smart enough that it will only convert a few objects at a time and queue up the other objects? Thanks.
For the isolation address, I referenced Duncan Epping's blog (vSphere HA heartbeat datastores, the isolation address and vSAN | Yellow Bricks)  We created a Switch Virtual Interface (SVI) on the ... See more...
For the isolation address, I referenced Duncan Epping's blog (vSphere HA heartbeat datastores, the isolation address and vSAN | Yellow Bricks)  We created a Switch Virtual Interface (SVI) on the physical switches.  With the IP in same subnet as vSAN and one for each site.  Then configured the advanced option setting das.isolationaddress0.
If only the Witness host loses connectivity to both sites, VMs will stay online. Last week our 14+1 stretched cluster (between two data centers) lost connectivity to third data center where t... See more...
If only the Witness host loses connectivity to both sites, VMs will stay online. Last week our 14+1 stretched cluster (between two data centers) lost connectivity to third data center where the witness was located. (redundant network to all sites, but outage was caused by firewall misconfig)  There was no affect to the VMs at either data center (preferred and secondary fault domain). vSAN health checks alerted on multiple errors, like connection to the Witness host and network partition has occurred. But no affect to guest VMs. However when connectivity was restored several hours later, the witness would not rejoin the cluster.  We also have two 2-node clusters with witness at third site that wouldn't re-connect to their Witnesses either.  I confirmed that could ping between vSAN hosts and witness across the appropriate interface (vmk) for all clusters. I opened ticket with GSS and only solution was to disable the stretch configuration.  Then creating the stretch configuration again, putting the hosts in the correct fault domain and choosing the Witness host.  Once the Witness was back online and participating in the cluster could see in the health check that objects were rebuilding on the Witness.
We're experiencing this same problem, but with fc640 blades using and FD332 storage sleds with FD332-PERC (dual NOC) controller.  In last two weeks I've had vSAN mark SSDs permanently disable... See more...
We're experiencing this same problem, but with fc640 blades using and FD332 storage sleds with FD332-PERC (dual NOC) controller.  In last two weeks I've had vSAN mark SSDs permanently disabled (PDL) on two different hosts.  I had same problem eight months ago (May). I had three different hosts have same problem and I opened tickets with GSS and Dell support. Recommendation was to update the controller firmware from version 25.5.5.0004 to 25.5.5.0005.  And the lsi_mr3 driver from version 7.703.18.00 to 7.703.20.  Both version combinations of firmware and driver are on the VCG for vSAN. I have new tickets opened both with GSS and Dell Support for over ten days now.  I was asked to upgrade firmware to 25.5.6.0009 that was released in Sept-2019.  I think there continues to be a problem with the H730 family of controllers and vSAN 6.x.  And replacing the current controllers with HBA330 cards is the fix. How much effort is needed to replace/change controllers with vSAN (configured for encryption and de-dupe)?  Is it as simple as putting host in maintenance mode with option ensure accessibility then replace controller?  Will vSAN see the SSDs and diskgroups as unchanged?  The controller is in pass-through mode so wondering as long as the drivers load, there's no change to the SSDs, disk IDs/signatures, and diskgroups. (14+1 stretched cluster so capacity for applied storage policies) Or do I have to evacuate all stored components from the host and delete the diskgroups, and then swap controller and recreate the diskgroups?
Our vSAN 6.6.1 stretched cluster (12+1) had an error that it lost connectivity to the witness ESXi host. We're using the virtual appliance for witness at third site.  Confirmed I could ping b... See more...
Our vSAN 6.6.1 stretched cluster (12+1) had an error that it lost connectivity to the witness ESXi host. We're using the virtual appliance for witness at third site.  Confirmed I could ping between hosts and witness using the vmk inteface used by vSAN. At the same noticed the CPU utilization on the witness VM was at 100%.  I can see in vROPs the CPU utilization of the witness VM jumped from ~8% to ~65% for eight hours.  It then jumped 100%. I was going to reboot the witness ESXi host.  I ran task to generate a support bundle beforehand.  Generating support bundle caused VM CPU to drop back to normal.  And host disconnect errors on the cluster are gone.  All health checks are green. I'll open ticket with GSS but asking if anyone else seen this? Thanks!
Follow up if anyone else encounters similar problem of Affinity rule for Preferred/Secondary in a storage policy not following the Preferred fault domain designation in a stretched cluster.  With... See more...
Follow up if anyone else encounters similar problem of Affinity rule for Preferred/Secondary in a storage policy not following the Preferred fault domain designation in a stretched cluster.  With help of GSS was able to identify a duplicate entry for preferred fault domain in CMMDS.  Object was removed and Affinity now aligns with the cluster setting for Preferred/Secondary.  In my case the duplicate entry referenced a non-existent witness host that was used for initial, but failed configuration of stretched cluster.  The witness was successful in communicating to ESXi hosts at layer-3, but ESXi hosts failed to communicate to witness because of asymmetric route.  The routing problem was corrected and stretch configured with new witness. I suspect removing the witness used in first attempt did not successfully remove object from CMMDS so Affinity in Storage Policy was detached from the cluster designation for Preferred.
Sorry, I'm probably over describing the problem. The Affinity setting in the Storage Policy for Preferred/Secondary fault domain is diametric from the designated Preferred fault domain setting... See more...
Sorry, I'm probably over describing the problem. The Affinity setting in the Storage Policy for Preferred/Secondary fault domain is diametric from the designated Preferred fault domain setting configured in the clusters Configure | "Fault Domain & Stretched Cluster" section of vCenter Server Web Client. I could leave it as is and accept that the site in the Affinity rule is just the opposite.  But my concern is the risk that a future change, like a patch or upgrade, corrects this problem and the VMs objects are moved unintentionally. I'm guessing I will have to disable the stretched cluster setting and re-configure to designate the fault domain I want to be preferred. The goal was to have the ability to pin VMs to specific sites using Storage Policies and DRS VM-to-Host rules.  There would be a storage policy with PFTT=0 and Affinity=Preferred for Site-A.  And a second storage policy with PFTT=0 and Affinity=Secondary for Site-B.  The storage policies would allow me to keep VM (all objects) local to a site.  And the DRS rules would pin the VMs to hosts at a site. I will be migrating existing production VMs into this new vSAN cluster.  These VMs do not have a stretched layer-3 networking, so must be pinned to specific site.  And there are Test/Dev VMs that do not require replication between sites.  Next phase will be to deploy NSX to stretch the network for VMs.  Can then fully take advantage of a storage policy with PFTT=1 (and Affinity not relevant). Eventually there is a use case in our environment for wanting to change the Preferred fault domain designation.  We are required by vendor to demonstrate DR every six months.  Having the workloads move between sites will satisfy that requirement.  Being able to change the Preferred designation would help us balance the workload across sites and still maintain continuity in a network outage. Thanks.
Thanks for reply.  Unfortunately workaround didn't fix the problem.  I changed storage policy by adding IOPS limit = 0 and did an update.  But the VM objects location does not move/change. If ... See more...
Thanks for reply.  Unfortunately workaround didn't fix the problem.  I changed storage policy by adding IOPS limit = 0 and did an update.  But the VM objects location does not move/change. If I change the Affinity Preferred/Secondary back and forth, the objects move between hosts in the fault domain.  But the Preferred/Secondary designation in the Storage Policy Affinity is opposite than the setting observed in vCenter. I've also ran the command  "esxcli vsan cluster preferredfaultdomain get" on the hosts and witness, and all reporting preferred fault domain is same as vCenter Server Web Client. This is a new environment with just few test and non-prod VMs running, so do have an opportunity to try things.  I've been doing some failure testing of this new cluster prior to moving legacy production VMs.  We simulated a cut on the 10Gb connection between data centers used by vSAN.  This did work previously, but last week changed the Preferred Fault Domain designation and discovered this problem. Thanks!
Has anyone seen a problem where changing the Preferred Fault Domain designation in a stretched vSAN cluster causes the Affinity rule for Preferred/Secondary in a Storage Policy to become out of s... See more...
Has anyone seen a problem where changing the Preferred Fault Domain designation in a stretched vSAN cluster causes the Affinity rule for Preferred/Secondary in a Storage Policy to become out of synch? I have a new vSAN 6.6 stretched cluster with 6 hosts in one data center (DC-North) and 6 hosts in another (DC-South).  During the initial install I designated DC-South as the Preferred Fault Domain.  I created a storage policy for each DC with Primary Failures to Tolerate = 0.  The DC-South policy had PFTT=0 and Affinity rule "Primary Fault Domain".  And another storage policy for DC-North had PFTT=0 and Affinity=Secondary. Last week I changed the preferred designation in the "Fault Domain & Stretched Cluster" section so hosts in the DC-North fault domain are now Preferred. I also updated the Affinity rule in the storage policies.  So DC-North storage policy is Affinity=Preferred and DC-South Affinity=Secondary. I re-applied the policies to the VMs, but the VMs are storing their objects on hosts in the wrong fault domain.  E.g. VM is running on hosts in Preferred (DC-North) hosts, but objects stored on disks at Secondary (DC-South).  And vice versa for VMs running at Secondary but stored at Preferred.  I've created new storage policies, cloned and update the originals, built new VMs with both new and existing policies, etc.  But the Affinity rule for Preferred/Secondary in Storage Policy does not match the vSAN cluster setting for Preferred/Secondary. I have a ticket open with GSS.  Thought I would ask communities if this has been seen before. Thanks!
We have a pair of switches for data node/ESXi host management and VM traffic.  A second pair of switches were added for vSAN traffic to the ESXi hosts.  A 10 GB circuit between the data centers i... See more...
We have a pair of switches for data node/ESXi host management and VM traffic.  A second pair of switches were added for vSAN traffic to the ESXi hosts.  A 10 GB circuit between the data centers is also on this second stack.  The data nodes are using L2 for vSAN.  We're using a /23 for vSAN vmk interfaces with the bottom /24 at one DC and top /24 at other.  An route was configured for a switch at primary data center.  Hosts at one site could connect to the witness, but witness couldn't connect to hosts.  We had to create a route for the other data center and also add static host routes (/32) for each host for traffic to work. I opened ticket with GSS to confirm if on the Witness host could consolidate Management and Witness traffic on to vmk0.  The Witness added with no errors and also passed health checks.  However discovered unintended side effect is access to KMS (coincidentally same site and vlan/IP subnet as the Witness) broke.  It looks like the traffic for the encryption KMS is traversing the vSAN vmk of the data hosts. 
Thanks for reply.  For me some of the confusion is the documentation, like Storagehub or config guide, isn't always clear when it changes context between 2 Node and Stretched Cluster.
I've been trying to deploy the witness for a 12-node Stretched vSAN 6.6 cluster and having some L3 problems getting the vSAN vmk interfaces used by the data nodes to connect to the witness host v... See more...
I've been trying to deploy the witness for a 12-node Stretched vSAN 6.6 cluster and having some L3 problems getting the vSAN vmk interfaces used by the data nodes to connect to the witness host vSAN vmk interface.  Data nodes are at two different data centers connected by stretched L2.  And witness is at a 3rd/separate location and using the VM appliance as a witness host.  I was able to get the witness to work by having vmk0 on the witness ESXi host do both Management and vSAN traffic.  (after I un-checked the box for vSAN from vmk1 on the witness ESXi host) I added a static route on my data nodes so their vSAN vmk interfaces (vmk1) can ping to the witness IP for Management and now vSAN. And from the witness I can ping the IPs of the data nodes vSAN vmk interfaces. I was able to successfully configure the stretched cluster and add the witness.  The healthchecks are green. My question is if having both Management *and* vSAN traffic on the witness vmk0 is a supported configuration.  There is a link on Storagehub for Witness Traffic Separation (WTS) that describes this configuration.  But isn't clear to me if this is supported for Stretched Cluster or only for 2 Node Direct Connect cluster. Thanks!