Why memory RSS is low but throughput is high

daniellee343

New Member
Joined
Mar 26, 2022
Messages
4
Reaction score
0
Credits
54
I'm working on a NUMA-related benchmark and got an issue troubling me for a week.

I use numactl to pin a metrics multiplication workload to use node0's CPU and node1's memory like this:
1676162398486.png

However, I observe tons of memory access still goes to node0 (throughput is high), either using pcm or numatop like this:
1676162481523.png
1676162526866.png

Then I track the pages mapping of the workload by observing /proc/<pid>/numa_maps and dump the # pages on both NUMA nodes, throughout workload running:
1676162792737.png

Funny thing is, node0's # active pages remains low all the way.
Then I would assume there would be dynamic libs reside on node0 during runtime, so I track those libs and evict them from memory before running. However, the result stays the same.

The question is: I see Numpy uses a well-optimized libs called OpenBLAS, but why the OpenBLAS library still has tons of small objects going to node0's memory rather than node1? On contrast, for Golang who doesn't use OpenBLAS, (almost) all its memory access goes to node1. See the figure here. Is the Linux kernel doing the trick?
1676227938399.png
 
Last edited:


I like to think I'm pretty tech savvy. But, I'm at a bit of a loss.

I recognize a few words! I also don't actually understand what it is you're trying to do.

Don't let this discourage you too much, as we have some pretty smart (and experienced) people here.

I'm not sure if it'll help, but it might help to try to elaborate more about what you're using, maybe link to the projects you're using, and that sort of stuff.

It can't hurt...
 
@KGIII ....I am not sure I like your chances of getting a reply now.
I am active.
I also don't actually understand what it is you're trying to do.
I am trying to move all memory access to the remote NUMA node(i.e., node1), using numatop or other tools, if possible. However, seems to me the Numpy that uses OpenBLAS internally does not go to node1.
This is the workload I am testing, just a simple matrix multiplication. Running it on a modern 2-NUMA-socket server and pinning the memory access to node1 like this
Code:
sudo numactl --cpunodebind 0 --membind 1 --python matmul.py 10000
Then observe the memory access pattern through
Code:
numatop
or Intel PCM:
PHP:
pcm-memory
Let me know if you have other questions. Appreciate all your help!
 

Members online


Top