Cores per socket:
As already mentioned, one reason and the initial reason for this feature was licensing (introduced in 4.0, supported in 4.1). That isn't far from the only impact though. The CPU topology (independent from the NUMA topology) is providing information of which cores share a cache, i.e. all cores in a socket share a last level cache (LLC) which translates locality benefits for threads that could hit the same cache line.
In 5.0, the design decision was made to "couple" cpuid.corePerSocket with maxVcpusPerNode (edit: sorry, internal short form) numa.vcpu.maxPerVirtualNode, i.e. setting a CPU topology would also influence the NUMA topology (which in theory are separate concept's) as that was the common reality. Since then multi NUMA node per socket, multi LLC per NUMA node etc. are a lot more common and it turns out that misconfiguration of cores per socket was so common that it we split the "coupling" of those two options so that the CPU topology no longer influenced the vNUMA topology.
There is still a very good reason to configure it correctly though, OSs and workloads can rely on CPU / Cache topology information to make scheduling decisions and showing it the actual underlying physical constrains can be beneficial.
Best Practices are usually separated into two distinct versions:
1. The least likely chance for someone to make things worse
2. The best performing option for some (even many) workloads but at the cost of operational complexity and danger to mis-configure and making things worse than the default.
Not touching corespersocket is 1, adjusting it is 2.
You'd usually want all cores to fit into a socket if your vCPUs stay below the cores of the physical sockets, there is a lot more detail to the actual sizing and whether you want to enforce HT but most logic is very well represented in this fling: https://flings.vmware.com/virtual-machine-compute-optimizer
On a side note, some applications can actually perform better with 1 cores per socket, those are usually sleep wake heavy, communicating threads that _shouldn't_ have their threads separated onto different cores (as those IPIs incur an overhead) but will due to the "intra socket is cheap" assumption. Having just 1 core per socket will keep them on the same vCPU, eliminating the need for IPIs. That is more or less a corner case but not as rare as one would think, esp. for "latency sensitive" / synchronous IO "benchmarks".
Controller vs. Disk
Queues were already mentioned and rarely an issue for anything but the largest IO workloads. The other important part is that interrupt coalescing rate determination is on a per controller basis, i.e. for DBs you want to have synchronous IO (latency sensitive) issued and completed as fast as possible, if you also have a data disk on that controller the flood of async IO would widen the interrupt window for the serial IO. Again, mostly relevant in performance critical workloads.