Hi guys, thought I would put this out there in case anyone else has seen this.
A couple of weeks ago we upgraded from vSphere 6.5 to 6.7, the system seemed fine and worked as expect for 2 days. On the evening of the second day the Storage DRS seemed to go crazy and moved around 500 vmdks' . This caused our tiered storage to become full and stopped working correctly.
We only had on "Space balance automation level" which up to then move 3 or 4 a night, unless if we have worked on some machines and then it could be up to 40 but these were rear.
Has anyone else seen this kind of behaviour, was thinking it could be a bad upgrade but then why did it take 2 days to show itself ?
The datastore has approx 35 2tb LUNs
i know there was an issue with 6.5 U2 and SDRS being overly aggressive when moving VMs around to create space when multiple datastores are close to full. not sure that still is the case with 6.7 though. I suggest doing a log dump and phoning support!
i know there was an issue with 6.5 U2 and SDRS being overly aggressive when moving VMs around to create space when multiple datastores are close to full. not sure that still is the case with 6.7 though. I suggest doing a log dump and phoning support!
Hi Depping, I think you are spot on here, done what you suggested last Tuesday (not this Tuesday) and still waiting. Do you think they will see the reason in the logs or just that it did lots .
My money is "yep I can see it did lots"
ALso I'm trying to set up an alert for this in vrops but can't seem to see a good counter to trigger on Stroage DRS counts (can see standard DRS), any suggestions? At the moment I'm going to full back to vrlogs
I haven't looked at that VROps triggers for Storage DRS myself. Let me ask if the issue exists in 6.7 for you, will reach out to one of the developers.
What is your Support Request number?
I asked one of the support engineers, and there's a known issue with Linked Clones and SDRS, so if you are using linkedclones that could be the reason. It is solved in 6.7 U3.
Anyway, if you provide the SR one of our SDRS experts can have a look.
No linked Clones Depping, I've IM your the SR number
SR provided to internal SDRS specialist. Will be looked at.
Depping, I'm not getting any real answers to the questions in the SR, e.g.
and thinking about closing the ticket as the tech is just answering these questions, is your SDRS specialist able to help out here?
ta
He is going to look at it now, he just replied to my request, should have an update today if this is a known issue or not and what can be done to avoid it.
Thanks depping
If I was a guessing man, I think the DRS ran made suggestions and then started moving the VMs with multiple large drives, before they were finished it then ran again not taking into account drives not moved yet (or not completed) and suggested moving other VM's drives to these LUNs and then the original move completed which blow out all the stats and then the progress just ran again.
This is only a guess from what I saw on the night, unsure which exact logs to confirm all this
It seems that the problem is known by vmware a patch will be available in a few days ( info from our TAM ).
and I confirm we went from 6.0 to 6.7 and since, we have a lot of problem. the balance is not working at all.
we get some tricks from the support like the "PercentIdleMBinSpaceDemand=0" in the advanced option but with no results.
We experience the same problems when having SDRS setting to automated on v6.7 U3g. It is going totally crazing, overfilling datastores and causing VMs to crash. It gives us hundreds of recomendation while SDRS is set to 100% conservative and wants to move VMs where is says Utilization source Before: 93 After 92.9, Utilization target before: 53.3 After: 53.3. Seems not very reasonable to me to do such a move. It is also crazy that it gives us loads of recommendations even if we run it each day. Also it seems that we don't experience crashes when we start it manually but we do when we allow it to run fully automated with the same settings.
We are on vSphere 6.7.0.44000 Build 16046470 and hope that the latest Update we install this week will fix something.
This is really a serious problem as it leads to production outage. We will open a SR if it is not better after updating to the most recent version.
Before going from 6.5 to 6.7 we never had those problems.