I profiled your benchmark and found that it spends most of its time in these three Windows HAL functions:
39.83% hal!KfLowerIrql
19.82% hal!KeRaiseIrqlToDpcLevel
19.07% hal!KeRaiseIrqlToSynchLevel
The hot spots in each function are TPR accesses (0FFFE0080h is the address of the TPR in the local APIC):
hal!KfLowerIrql:
807168e4 890d8000feff mov dword ptr ds:\[0FFFE0080h],ecx
807168ea a18000feff mov eax,dword ptr ds:\[FFFE0080h]
hal!KeRaiseIrqlToDpcLevel:
807168a0 8b158000feff mov edx,dword ptr ds:\[0FFFE0080h]
807168a6 c7058000feff41000000 mov dword ptr ds:\[0FFFE0080h],41h
hal!KeRaiseIrqlToSynchLevel:
807168bc 8b158000feff mov edx,dword ptr ds:\[0FFFE0080h]
807168c2 c7058000feff41000000 mov dword ptr ds:\[0FFFE0080h],41h
Since the local APIC is virtualized, a TPR access typically causes a VM-Exit under hardware virtualization. However, Intel has introduced FlexPriority, which avoids the VM-Exit for all TPR reads and for some TPR writes. Because of this, ESX 4.0 defaults to VT-x for 32-bit Windows 2003 on Intel chips with FlexPriority. Unfortunately, FlexPriority is not a panacea. On native hardware, TPR accesses generally take only a few cycles. With FlexPriority, TPR accesses that do not cause a VM-Exit may still take several hundred cycles. TPR accesses that do cause VM-Exits take several thousand cycles. Fortunately, we still have the option of using binary translation. Under binary translation, TPR accesses generally take tens of cycles.
For this particular workload, you should configure your guest to use binary translation. On my Penryn system, the benchmark runs in 22 seconds using VT-x (with FlexPriority), but it only takes 13 seconds using binary translation. (For completeness, it takes 90 seconds using VT-x without FlexPriority).
Your client's situation is different. AMD has never introduced a technology equivalent to FlexPriority. However, if your client has configured their VM to use hardware MMU support, then the VM will be using AMD-V, which suffers from the same problems as VT-x without FlexPriority. Make sure that they have configured the VM to use software MMU support so that it will execute using binary translation. (The default execution mode for this guest under ESX 3.5 is binary translation.)