Why memory RSS is low but throughput is high

daniellee343 · Feb 12, 2023

I'm working on a NUMA-related benchmark and got an issue troubling me for a week.

I use numactl to pin a metrics multiplication workload to use node0's CPU and node1's memory like this:

However, I observe tons of memory access still goes to node0 (throughput is high), either using pcm or numatop like this:

Then I track the pages mapping of the workload by observing /proc/<pid>/numa_maps and dump the # pages on both NUMA nodes, throughout workload running:

Funny thing is, node0's # active pages remains low all the way.
Then I would assume there would be dynamic libs reside on node0 during runtime, so I track those libs and evict them from memory before running. However, the result stays the same.

The question is: I see Numpy uses a well-optimized libs called OpenBLAS, but why the OpenBLAS library still has tons of small objects going to node0's memory rather than node1? On contrast, for Golang who doesn't use OpenBLAS, (almost) all its memory access goes to node1. See the figure here. Is the Linux kernel doing the trick?

KGIII · Feb 12, 2023

I like to think I'm pretty tech savvy. But, I'm at a bit of a loss.

I recognize a few words! I also don't actually understand what it is you're trying to do.

Don't let this discourage you too much, as we have some pretty smart (and experienced) people here.

I'm not sure if it'll help, but it might help to try to elaborate more about what you're using, maybe link to the projects you're using, and that sort of stuff.

It can't hurt...

Condobloke · Feb 12, 2023

This members only other post...: https://www.linux.org/threads/reboot-error-after-umount-mnt-xxx-on-persistent-memory.39741/

He did not reply to that one

@KGIII ....I am not sure I like your chances of getting a reply now.

daniellee343 · Feb 12, 2023

Condobloke said:
@KGIII ....I am not sure I like your chances of getting a reply now.

I am active.

KGIII said:
I also don't actually understand what it is you're trying to do.

I am trying to move all memory access to the remote NUMA node(i.e., node1), using numatop or other tools, if possible. However, seems to me the Numpy that uses OpenBLAS internally does not go to node1.
This is the workload I am testing, just a simple matrix multiplication. Running it on a modern 2-NUMA-socket server and pinning the memory access to node1 like this

Code:

sudo numactl --cpunodebind 0 --membind 1 --python matmul.py 10000

Then observe the memory access pattern through

Code:

numatop

or Intel PCM:

PHP:

pcm-memory

Let me know if you have other questions. Appreciate all your help!

Why memory RSS is low but throughput is high

daniellee343

New Member

KGIII

Super Moderator

Condobloke

Well-Known Member

daniellee343

New Member

Members online

Latest posts