When running a high CPU utilisation utility in a Windows VM, XenServer appears to roll the vCPUs across all available pCPUs. This has a significant impact on performance.
When I manually pin the vCPUs to the pCPUs within a single NUMA node, the application will process the test data in about 470 seconds. Without the manual pinning, it takes over 700 seconds. (on our test system in both cases, obviously. YMMV)
Xen appears to support NUMA aware scheduling (I don't have a Xen install to test on to confirm, unfortunately) but XenServer does not.
The application we're using is freely available and comes with a test data set (which we're using for our testing):
Not sure what else to include, happy to answer any questions you might have.