I'm working on a NUMA-related benchmark and got an issue troubling me for a week.
I use numactl to pin a metrics multiplication workload to use node0's CPU and node1's memory like this:
However, I observe tons of memory access still goes to node0 (throughput is high), either using pcm or...