A VMware cluster of 4 ESXi hosts with the H/W DL380 Gen9 with Graphics cards :Nvidia Grid M-60-2q.
Total VMs 94:With 93 in Powered on state and 1 in powered off state. All VMs are windows 10 only.
Problem :Hosts have degraded performance when it comes to GPUs and are not supporting the expanding user base which is currently 609 to be precise. It appears to me that while designing the cluster, scalability was not taken into consideration and hence this problem has cropped up.
Solution 1:Replace the existing Nvidia cards with the new and more powerful GPU cards.
Nvidia Grid M-60-2q is EOA & EOL.What are the compatible GPU cards for DL380 Gen 9 then?
Solution 2:Expand the existing VMware cluster adding 2 more ESXi hosts.
DL380 Gen 9 is EOA & EOL. Only EOS is available till 2026.
So adding 2 DL380 Gen10 hosts to the existing cluster of DL380 Gen9 can be a likely possibility.
vMotion will work provided we enable EVC.
In the cluster of 6 hosts,4 hosts will have different GPU cards and 2 hosts will have different GPU cards from each other.In such a scenario will failover work if one of the hosts malfunctions?
We have deployed VMware Cloud Foundation 4.2 on HPE Synergy 12000 frames with Synergy DL480 Gen10 Blade servers.
Can I create a new segment on VCF at location B and attach all the 94 VMs running at location A to it?
I can use Backup and restore to take the VMs from location A to location B.
Will this work? Or do you have any better option?
Out of the 3 likely solutions mentioned above which one is the best one?
Needless to say, any help shall highly be appreciated.
Thanks in advance!
Avoid solution 2: You cannot failover the vms to different gpu, because different gpu drivers and also different gpu profiles are assigned to the vms.
I would go with solution 1, which holds the least risk. You can replace the existing gpu with the Nvidia Tesla T4 (R0W29A). This gpu is only listed with Gen10 and above, but will work for the period of time the Gen9 are still in your lifecycle. Then you can replace the Gen9 with Gen10 Plus or Gen11.
The best solution would be to replace the Gen9 with Gen10 Plus. I think you could be very happy with the new DL345 Gen10 Plus and the Tesla T4. Very good price-performance ratio. DL345 Gen10 Plus is a 2U, AMD powered system with only one CPU socket.
i am working with a 20node cluster where 14 hosts have P4 and 6 hosts have T4
to avoid starting the vms on the wrong hosts i have created drs-groups. also a HA failover will start the vms on the correct host.
with drs-groups you have not to replace all gpu-cards
Thank you so much for your reply! As per HPE, Nvidia Graphics cards:Tesla T4 & A10 are no longer being supported with DL380Gen9 servers. That leaves me with the only option: Getting the Tesla T4 directly from the Nvidia itself bypassing HPE and getting it installed on 4xDL380 Gen9 servers replacing the existing M-60 cards. This way I could use them till July 2025.
These are not my servers. They belong to one of our esteemed clients. Will my approaching Nvidia directly help me resolve the issue amicably?
Looking forward to hearing from you soon,
With warm regards,
you can get the Tesla T4 directly from HPE however they are no official supported for Gen9. I have seen them running on Gen9 hardware a few days ago. I would recommend to order one card and test it in your deployment.
If this help. Kudos is welcome 🙂