Good afternoon We are running a stretched cluster with 16 nodes and 10gbps uplinks The version is 6.5 update 3 We had an alarm raised because some of the VMs experienced read/write latencies...
See more...
Good afternoon We are running a stretched cluster with 16 nodes and 10gbps uplinks The version is 6.5 update 3 We had an alarm raised because some of the VMs experienced read/write latencies of about 800ms I think I have traced the issue back to a disk group All of the disks in the disk group have been showing results that like that below Not seeing any issues with cache destage rates Does anyone know why there would be high physical/firmware layer latency on all disks in the group Thanks in advance
It might be worth trawling through all the design guides on storagehub as throughput can be impacted even by the type of switches used. Even a small buffer size on the switch port can impact perf...
See more...
It might be worth trawling through all the design guides on storagehub as throughput can be impacted even by the type of switches used. Even a small buffer size on the switch port can impact performance We had mtu set of 9000 and we use stretched clusters but we discovered that the network admin had left it at 1500 on the core so performance on our stretched cluster was impacted We have about 26 stretched clusters with 12 hosts in each site. Each host has 4 disk groups (all flash) with five capacity drives in each Storage policy settings depend on the need of an application. So for postgres boxes we would do raid 1 for database and logs I would strongly recommend that you use hcibench and look at the performance you get with different disk policies
Interesting so are you saying that if you go from 6.5 u3 to 6.7 u3 you need to deploy a new witness. I have upgraded a few stretched clusters using Virtual witnesses with update manager and it up...
See more...
Interesting so are you saying that if you go from 6.5 u3 to 6.7 u3 you need to deploy a new witness. I have upgraded a few stretched clusters using Virtual witnesses with update manager and it upgraded without any issues. I can understand 6 - 6.7 though
Thanks for replying TheBobkin I just chose that drive as we were seeing alerts in vSAN my suspicion is that they are related to driver issues as the environment has not been patched for well ov...
See more...
Thanks for replying TheBobkin I just chose that drive as we were seeing alerts in vSAN my suspicion is that they are related to driver issues as the environment has not been patched for well over a year but we cannot patch it due to a Corvid change freeze and our call centre runs on it It is running 6.5 7388607 also the firmware is well out of date so I was going to wait until the change freeze was over and get it up dated to 6.7 We are seeing the error below LSOMEventNotify:6956: Virtual SAN device 52dabb19-aa43-e70d-3aab-0e4f63bb13c7 has gone offline. but it clears
Quite a lot of info but I will do my best I assume you have tried pftt=1 with raid 1 I would also try it with SFTT=0 Erasure coding roughly has a 40% increase in read/write amplification depe...
See more...
Quite a lot of info but I will do my best I assume you have tried pftt=1 with raid 1 I would also try it with SFTT=0 Erasure coding roughly has a 40% increase in read/write amplification depending on your circumstances also what stats are you getting on your vmnic esxcli network nic stats get -n vmnicx see if you are getting dropped or receive packet errors NIC statistics for vmnic2 Packets received: 0 Packets sent: 0 Bytes received: 0 Bytes sent: 0 Receive packets dropped: 0 Transmit packets dropped: 0 Multicast packets received: 0 Broadcast packets received: 0 Multicast packets sent: 0 Broadcast packets sent: 0 Total receive errors: 0 Receive length errors: 0 Receive over errors: 0 Receive CRC errors: 0 Receive frame errors: 0 Receive FIFO errors: 0 Receive missed errors: 0 Total transmit errors: 0 Transmit aborted errors: 0 Transmit carrier errors: 0 Transmit FIFO errors: 0 Transmit heartbeat errors: 0 Transmit window errors: 0
Good Morning when I run esxcli storage core device stats get I get the following information below mpx.vmhba1:C2:T2:L0 Device: mpx.vmhba1:C2:T2:L0 Successful Commands: 1879424...
See more...
Good Morning when I run esxcli storage core device stats get I get the following information below mpx.vmhba1:C2:T2:L0 Device: mpx.vmhba1:C2:T2:L0 Successful Commands: 1879424953 Blocks Read: 8194120138 Blocks Written: 7770849709 Read Operations: 946209367 Write Operations: 933147753 Reserve Operations: 3 Reservation Conflicts: 0 Failed Commands: 631 Failed Blocks Read: 6534 Failed Blocks Written: 0 Failed Read Operations: 484 Failed Write Operations: 0 Failed Reserve Operations: 0 There are indications of failed commands and failed blocks read however when I run esxcli storage core device smart get on the same device it says the health status is ok Just wondering if this is an indication of drive failure as the ILO reports the drive as healthy esxcli storage core device smart get -d mpx.vmhba1:C2:T2:L0 Parameter Value Threshold Worst ---------------------------- ----- --------- ----- Health Status OK N/A N/A Media Wearout Indicator N/A N/A N/A Write Error Count N/A N/A N/A Read Error Count 130 39 130 Power-on Hours 100 0 100 Power Cycle Count N/A N/A N/A Reallocated Sector Count 100 1 100 Raw Read Error Rate 130 39 130 Drive Temperature 100 1 100 Driver Rated Max Temperature N/A N/A N/A Write Sectors TOT Count N/A N/A N/A Read Sectors TOT Count N/A N/A N/A Initial Bad Block Count N/A N/A N/A Thanks in advance
We have a six node Log Insight 8 cluster. with 6 nodes. All the nodes are configured as large and the ingestion rate is below 15,000 which is what is ok for these nodes I have also ran some chec...
See more...
We have a six node Log Insight 8 cluster. with 6 nodes. All the nodes are configured as large and the ingestion rate is below 15,000 which is what is ok for these nodes I have also ran some checks on the nodes for corrupt buckets I have switched off the 1 alert we have just to see what the issue is The problem is that simple queries grind to a halt to the point that it becomes unusable I have checked the cluster nodes in vRops and cannot see evidence of hardware issues Does anyone know of ways to improve query performance in log Insight I have logged a call with GSS as well Interestingly enough we did not seem to have the issues with 4.8 Any help would be greatly appreciated
Thanks to all those who replied and thanks to the Bobkin who has been really helpful in this post and in others I managed to reach out to GSS We rebooted the host in question and disabled the v...
See more...
Thanks to all those who replied and thanks to the Bobkin who has been really helpful in this post and in others I managed to reach out to GSS We rebooted the host in question and disabled the vsan services on boot He logged back into the host and removed the disk without any of the vsan services running Thanks again to the community here
When trying to remove a disk group using esxcli vsan storage remove -u 5265713e-a712-9d31-4a8b-b77b5e1bb39e I get the error below Unable to remove device: Failed to write partition I have m...
See more...
When trying to remove a disk group using esxcli vsan storage remove -u 5265713e-a712-9d31-4a8b-b77b5e1bb39e I get the error below Unable to remove device: Failed to write partition I have managed to dismount all of the ssd capacity disks in the management group but cannot dismount the cache disk. The cache disk is faulty but in vCenter is reporting as mounted but with permanent disk failure In vcenter it is reported as mounted but with permamant disk failure will not let me delete the disk or Management group as the cache disk is still mounted Any help would be gratefully received
thanks for your reply Yes sort of as we have not isolated vsan traffic and it is sharing it with other traffic types but there is reluctance to use Netioc When you look at the stats for the n...
See more...
thanks for your reply Yes sort of as we have not isolated vsan traffic and it is sharing it with other traffic types but there is reluctance to use Netioc When you look at the stats for the nic there are quite a few total receive errors which should indicate we may need to increase the rx buffer size on the nic The cards are 25GBPs For some of the VMs we have isolated the copy to the preferred site as the application is clustered and changed the storage policy to mirroring Just wondering if there is any formula that would show how much reduction in write amplification I would get if I changed the stripe width from 4 to 1
We have a number of stretched clusters all flash. Most of out our VMs have a stripe width of 4 which I know is bad but was done before I joined the company My question is if we reduce the str...
See more...
We have a number of stretched clusters all flash. Most of out our VMs have a stripe width of 4 which I know is bad but was done before I joined the company My question is if we reduce the stripe width of 4 to 1 (VMware recommendation) will we see a reduction in read/write IO amplification Our number one goal is to reduce the load on the back end Thanks in advance
Hi just wondering if somebody could help We have three vcenter's all in linked running vcenter version 6.5 u3 We have vsan stretched clusters which all seem to be working ok However when I ...
See more...
Hi just wondering if somebody could help We have three vcenter's all in linked running vcenter version 6.5 u3 We have vsan stretched clusters which all seem to be working ok However when I log into vcenter all the vsan configuration options have disappeared including the monitoring options so we cannot see the health tests I have logged in with both my own admin account and the admin sso account. The same thing is happening with my colleaugue All the appropriate services are running on the vcenters including the vcenter health service All the licensing and time sources appear ok
Really simple question I know We have two nics in a team with the config of route based on original virtual port Both Nics are set to active/active We also have the failover order set to fail...
See more...
Really simple question I know We have two nics in a team with the config of route based on original virtual port Both Nics are set to active/active We also have the failover order set to failback My question is do the failover order configuration apply when the nics are set to active/active or do they only apply when it is set to active/passive Thanks in advance
We have had major issues with VMs not being in compliance with their SPBM disk policies I have had a look in vrops 6.7 and their does not seem to be an easy way of exposing this information. We h...
See more...
We have had major issues with VMs not being in compliance with their SPBM disk policies I have had a look in vrops 6.7 and their does not seem to be an easy way of exposing this information. We have some very large vsan stretched clusters and we would like to create a dashboard that has a list of VMs that are not in compliance with their disk policies Is this something that can be done Thanks in advance
also when doing ntpq -q we get the output below the reach figure of 377 indicates successful communications to the ntp server but the host and vsphere web client seem to show time out...
See more...
also when doing ntpq -q we get the output below the reach figure of 377 indicates successful communications to the ntp server but the host and vsphere web client seem to show time out by an hour
our own intenal ntp servers which we know are reporting correct time as we can see it in vcenter the problem we have is it is not being picked up by the host we use the same ntp servers on about ...
See more...
our own intenal ntp servers which we know are reporting correct time as we can see it in vcenter the problem we have is it is not being picked up by the host we use the same ntp servers on about 2500 of our hypervisors
Yes I set the time using the command esxcli hardware clock set but as soon as you stop and start the ntp service the time reverts back to what it was ie an hour earlier I ran esxcli hardware clo...
See more...
Yes I set the time using the command esxcli hardware clock set but as soon as you stop and start the ntp service the time reverts back to what it was ie an hour earlier I ran esxcli hardware clock get