VMware Cloud Community
KKSAdmin
Enthusiast
Enthusiast

VSAN Metrics not being collected in vROPS

Current build of vROPS 6.7 running on an all-flash vSphere/VSAN 6.7 cluster

We've had some issues with bad SSDs as of late and are trying to increase our visibility in this area.

While the VSAN dashboards have been working well for us, the following sections of the "Troubleshooting VSAN" dashboard appear to have no metrics:

15)  Are we reading from cache?

18)  Is Congestion High?

19)  How often does congestion happen

21)  Is Disk Group running low on capacity (shows 100% which is not true)

22)  Is the disk groups usage balanced (capacity metric is present, but % used metric shows zero/green which is not true)

23)  Any Errors on the Disk Group (always zero)

24)  And Dropped packets on VSAN network? (always zero)

25)  Cache Disks:  Any hardware issues? (both metrics always zero)

26)  Capacity Disks:  Any hardware issues (both metrics are always zero)

The rest of the metrics in the dashboard (1-17) are fine, but we'd like to see the others work too! 

Most important to us right now are 25 and 26, which help to identify bad drives.  I ran some PowerCLI to identify the most vulnerable drives but we'd much rather see this in vROPS!

I modified the VSAN adapter to start collecting SMART metrics but this made no impact.

Curious to see of others have had similar experiences and what the resolution may have been.  Thanks!

Reply
0 Kudos
6 Replies
TheBobkin
Champion
Champion

Hello KKSAdmin​,

Are you able to see the graph-data from the vSAN Performance graphs (which tell things such as congestion) under Cluster/Host > Monitor > Performance?

If these and the Health checks for things such as Disk Balance (Cluster > Monitor > vSAN > Health) are functional then it is likely that the problem is on the vROPS side and you should consider moving this question to the vROPS sub-community (or asking a Mod to do so).

If these do not show the expected data then there are a few low-hanging fruit troubleshooting steps you can take such as restarting vsanmgmtd on the nodes, restarting vSAN Health and Performance services on the vCenter and ensuring you have any vendor-specific plug-ins/vibs required for monitoring drives and other hardware components.

KKSAdmin
Enthusiast
Enthusiast

Thanks.  The VSAN Health service appears to be fine other than that they don't seem to be alarming on bad drives.  These symptoms persist after reboot.

I will move this to the vROPS area.  Thanks. 

Reply
0 Kudos
GayathriS
Expert
Expert

Could you please share a screen shot of the dashboard which is problematic or shows nothing ?

Also confirm few things here :

-->Was this working earlier or never worked from the time you got VSAN adapter configured

-->Version of VSAN management pack version ?

-->Version of your vcenter and vsphere

-->Vrops Version

regards

Gayathri

Reply
0 Kudos
KKSAdmin
Enthusiast
Enthusiast

Hi Gayathri,

Thanks for reply and apologies for the belated response.

VSAN adapter in general was always working. vROPS was deployed greenfield as 6.7 (no migration, fresh install).  vCenter is 6.7.0d, and ESXi is 6.7 but missing August patches.

Some background:

* Most VSAN metrics for vROPS are fine and always have been.  Just a few that are suspect and always have been.

* We had a recent incident where VSAN drives were added and one drive was bad.  The I/O degradation from the one bad drive brought the entire VSAN volume offlie.

* We have had other incidents of bad drives which were not alarmed nor mitigated by VSAN/vROPS

We are stable now, but looking at metrics in vROPS, these are the graphs that seem suspect:

15 and 18 are always 0.  Array is all-flash.

pastedImage_12.png

21) This shows 100% is always free.  We run close to 80% capacity.

pastedImage_13.png

23/24)  Always no errors on disk group when we know otherwise due to bad disks.  Same with packet drops -- we can see a non-zero value on the VSAN dedicated switch ports.

pastedImage_14.png

22)  Disks are not closed to being balanced.  If I hoover over any block it shows disk utilization as zero -- which is 100% false.

pastedImage_15.png

26)  This shows zero values 100% of the time -- include when we have bad drives (and outages from bad drives).  Same for cache tier.

pastedImage_16.png

The vast majority of VSAN metrics are working fine.  Disk latency, write buffers and many more all work fine.  But these noted above do not seem to have valid metrics and has always been like since since vROPS was deployed new as 6.7.

Reply
0 Kudos
RickVerstegen
Expert
Expert

Did you turned on vSAN Performance Service?

Was I helpful? Give a kudo for appreciation!
Blog: https://rickverstegen84.wordpress.com/
Twitter: https://twitter.com/verstegenrick
Reply
0 Kudos
wreedMH
Hot Shot
Hot Shot

Ever get this fixed?

Reply
0 Kudos