Tibmeister's Posts

I do experience some latency in getting the baselines sometimes, but eventually they get there.  Is this a 2-node cluster?  Just wondering why you don't switch to Image Based Updates.
You will definitely interrupt vSAN networking which can cause bad things to happen. What I would do is create new vmkernel interfaces on the vSwitch with the new IP config. Then, enable vSAN on the n... See more...
You will definitely interrupt vSAN networking which can cause bad things to happen. What I would do is create new vmkernel interfaces on the vSwitch with the new IP config. Then, enable vSAN on the new vmk’s and let things sit. Then you should be able to disable vSAN on the old vmk’s and as long as there’s IP connectivity on the new vmk’s you should be golden without issues.  Now I’ve never done this but in theory it should work. If this is production, open a ticket with support and run it by then to ensure no oddities. At the very least, they can be ready invade something goes sideways and can help get things going again. 
Being on the same subnet ESXi, well any OS, will use the lowest number interface first to send the data, in this case vmk0. Put vmk2 on a separate subnet and you’re golden. 
Are vmk2 and vmk0 on the same subnet?
The balloon driver can cause some interesting results, plus all it does is create a virtual page in one guest's RAM for another guest to use.  Not something that is what I want in a secure environmen... See more...
The balloon driver can cause some interesting results, plus all it does is create a virtual page in one guest's RAM for another guest to use.  Not something that is what I want in a secure environment, plus, it makes the VM "donating" the RAM seem much more heavily utilized unless the monitoring tool is balloon aware, which most aren't, so will lead to some false conclusions.
vSAN really changes the entire conversation, because while core vSAN services may only use 50GB on each host, during operation it may use more due to network congestion, controller saturation, etc.  ... See more...
vSAN really changes the entire conversation, because while core vSAN services may only use 50GB on each host, during operation it may use more due to network congestion, controller saturation, etc.  If it's on a 10GB network, pretty much expect 40% of your host RAM to be consumed by vSAN, and yes, it will take RAM from VMs and force ballooning to occur because if it doesn't, the VMs will be IO starved and not function anyway. You also have the memory by other ESXi services as well, and having 15 VMS with 128GB RAM each is a hefty load on the host to try and schedule and keep up with the NUMA node configuration if you have more than one socket in your host. When I do capacity planning, I shave off the host requirements and such before doing any VM calculations.  So, you have 768GB in each host (765GB Reported).  Using the reported size, take 40% of that off the top, then I take another 10% off the top for covering ancillary services.  So, 765 - 40% = 465GB, then subtract the additional 10%, rounding down, gives you 413GB of usable RAM on each host.  Multiply that by 3, and you get 1239 of usable RAM in the cluster. Now these calculations have never failed me, and the highest I've pushed a cluster, using no overcommit of RAM is 85% of RAM.  I adjust alarms to 90% WARN and 95% ERROR, which is more than reasonable on these large of hosts. So, as you can see, you don't have enough available RAM to run the given workload, and the fact that ballooning is occurring re-enforces that. Now, do you really need 128GB of RAM on those VMs?  Probably not, and if the 95th percentile utilization over 45 or 90 days is under 70%, your VM has more RAM than it can use and you need to right-size.  A VM that is running at 75% ~ 85% utilization on its RAM in the given timeframe and statistic is a perfectly sized VM from a RAM perspective. Yes, you will receive pushback on reducing a VM from "vendor recommendations", but the vendor isn't paying for your hardware, or the waste of it, and you need to make a financial case to your business that sizing the VM for the actual workload and not a theoretical test case is more fiscal.
Is the disk Thin or Thick?  If it's Thin, then nothing needs to be done because the space will never be used since it's not allocated in the VM guest.  If it's Thick, then you would need to Storage v... See more...
Is the disk Thin or Thick?  If it's Thin, then nothing needs to be done because the space will never be used since it's not allocated in the VM guest.  If it's Thick, then you would need to Storage vMotion or use VMware Converter to convert the disk to Thin.  You can't deallocate blocks from a virtual disk that has been allocated. I suppose you could add a new disk, clone the unexpanded disk over and remove the old disk.
I've seen a lot of ARP duplication when in this default state.
What does the rest of the environment look like?  vSAN, NFS, iSCSI?  NSX?  I've seen a standalone ESXi host fresh installed take up to 15% of available RAM to run just basic services.  In your case, ... See more...
What does the rest of the environment look like?  vSAN, NFS, iSCSI?  NSX?  I've seen a standalone ESXi host fresh installed take up to 15% of available RAM to run just basic services.  In your case, you're looking at close to 50%, so there's something else in the environment that is requiring this amount of RAM.
Not really using STP, but you can have a lot of ARP reflection and duplication of packets if you have the same VLAN going over multiple interfaces without some type of aggregation protocol running.  ... See more...
Not really using STP, but you can have a lot of ARP reflection and duplication of packets if you have the same VLAN going over multiple interfaces without some type of aggregation protocol running.  If you don't want to run an aggregation protocol, you would need to only have a single link as Active and the rest as Standby to act as failover.  You can do this on a per-PortGroup/VLAN basis, so you can have multiple VLANs, each going over their own link, to manually balance out the bandwidth consumption of the links.
You do realize ESXi 6.7 is not supported, which is probably the root cause of your issues.
Adding a new vmkernel works very nicely, I would still provision a new subnet that’s non-routed for iSCSI and avoid messing with routes on the ESXi hosts. 
I'm assuming you want to use the vmkernel that you are using for vSAN to also transport iSCSI traffic from the backend storage to the host.  vSAN should already be on it's own non-routed subnet, but ... See more...
I'm assuming you want to use the vmkernel that you are using for vSAN to also transport iSCSI traffic from the backend storage to the host.  vSAN should already be on it's own non-routed subnet, but since you have iSCSI traffic going out the gateway, I'm thinking that you have the vmkernel interfaces all in the same subnet/VLAN, which is the crux of the issue. Create a new subnet/VLAN and do not route it at all.  Then, assign an interface on the iSCSI storage to that subnet, and also the vmkernel you want to use for storage.  Since it's not routed, it won't be in the route table for vMotion or Management, and since it's in the same subnet as the vmkernel you are using for storage, ESXi will use that interface to move the traffic.
I haven't seen that doc in years!  I do have to wonder, how much has changed?  I think the general picture is unchanged though, so should be good for OP's needs. BTW, why are you needing this level ... See more...
I haven't seen that doc in years!  I do have to wonder, how much has changed?  I think the general picture is unchanged though, so should be good for OP's needs. BTW, why are you needing this level of info, if you don't mind me asking?
I think you just described the issue, high IOWAIT, which I will bet when the context of the snapshot is taken into account you will find your storage is the main issue.
So thanks to someone running their car into the power transformer we had to go through a shutdown over the weekend.  What I'm noticing is that when vCenter performs the " Execute power off logic on o... See more...
So thanks to someone running their car into the power transformer we had to go through a shutdown over the weekend.  What I'm noticing is that when vCenter performs the " Execute power off logic on orchestration host" on the second host, it just hangs until it times out.  The command stays at 79%, but once it times out then I can "Resume Shutdown" and then the " Execute power off logic on orchestration host" goes through pretty quick and the host then shuts down, then the cluster shuts down.  This is after the first one shuts down without issue. about 5ms latency between the cluster and the shared witness, so shouldn't be an issue there and it does go through fast on the first host and when we do the resume it goes through fast on the second host.
BusyBox is a small binary that runs in the shell and provides standard functions such as ping, vi, nslookup, etc.  VMkernel is an application that runs at startup.
What is the overall goal in asking this question?  ESXi is based on a Linux system, so it will load similarly by starting the bootloader then switching over to the kernel.
So you're saying the only indication of throttling you have is that the guest numbers don't match the host numbers?  Well, they won't, rarely ever.  The reason is there's different methods measured i... See more...
So you're saying the only indication of throttling you have is that the guest numbers don't match the host numbers?  Well, they won't, rarely ever.  The reason is there's different methods measured in order to come up with the measurements. For instance, most OS's (Linux and Windows) use a method called a Watchdog Timer in order to determine the CPU usage.  It does this by starting a low priority threat on the CPU(s) and waits to see how long it takes for that thread to complete.  That calculation is the measurement for usage of the CPU.  The logic is that the low priority thread will not complete until all other threads are completed, so in theory, this is a viable ball park measurement.  This is what happens regardless of the hardware, which is why it works because it is hardware agnostic. So, you have a VM that is reporting high CPU usage because it's timer threads are taking a while to return.  Taking into account the vCPU's are scheduled across a finite number of pCPU's, and that other VM workloads can impact this, you often will see a VM guest reporting higher CPU usage than what the host itself reports for the same VM.  In this case, the host is actually correct because it knows about the scheduling and can take that into account, as well as all the other VMs running on it.  The VM guest has no knowledge of this so therefore is blind and thinks that things are more utilized than they are. Now, there's a whole slew of other factors in this, %RDY, IOWAIT, CO-STOP, etc.  One thing to keep in mind is that for the most part, the vCPU of the VM will be used to process data that normally would be handled by a storage controller and NIC, which is the IOWAIT measurement.  If this is high, then the VM is waiting for the vCPU to process IO from either the storage or network stack, which causes the watchdog threads to take much longer to complete, therefore the VM thinks it's CPUs are heavily utilized, when in fact that is not the case, you have an IO bottleneck somewhere.  Often, if the storage doesn't have high latency, this will be something in the network stack, like a long running SQL query or a large single-threaded data transfer. Now one may think to just throw more vCPU at the problem, but that only makes the situation worse for not only the VM in question, but all VMs on the host.  This is why the term "Right Sizing" is so heavily stressed and used, you have to properly size the VM's resources to the actually workload and observe.  Often, VMs will be given resources just because, or "Because the vendor says so", then wonders why they have this situation occur. Also, hyperthreading is not your friend because despite popular belief, it's not a full added thread; in reality it's only a 50% increase in performance.  So, having 8 core and 16 threads does not equal having 16 cores.  Sometimes you can get lucky, but most times, you will see your VM report high CPU utilization that is not actually true. You need to look at the VM counters on the host for %RDY, CO-STOP (CSTOP), and IOWAIT.  This is a good start to determine what is going on which your VM.  Also, do not override vNUMA by changing the default core per socket from 1.  Leave that alone unless you got some software that still thinks it's a great idea to license this way.  You don't actually gain any real benefit from messing with this setting and can cause more harm than good.  Also, disable Hot-Add for both the memory and CPU, another performance killer. Lastly, right size the VMs.  Do you actually need 8 vCPU's assigned to the VM?  Most folks think that if the CPU utilization of a VM is > 50% then more CPUs need to be added.  That's absolutely wrong, in a VM, if you run between 70% and 80% normally, then you are right sized for sure.  Measure this by taking 1-minute samples over 90 days and then only using the 95th percentile, you don't care about spikes, only plateaus.  I ran a very large infrastructure on that basic principle and not only did things perform better with less vCPU's, several million $$$'s in equipment was avoided.  It works. Also, lastly, not every VM is made the same, even if the same software is installed on each one.  Smal variations in workload, how the workload is used, and how the IO stack is used will cause massive variations in functionality of each.  You must treat each VM as it's own container for fine-grained tuning; t-shirt sizes are a great starting point but not the end of the conversation. Look up VM right sizing on this forum, you will find a lot of good discussions, even possibly some of my past discussions, that will explain way in depth more than I have.
Interesting, I am going to have to play with this to see what's going on in my environment that is causing me fits.  Thanks for the confirmation on the expected behavior, that helps a lot.