WHY check sock[i]->sk_state in reuseport_select_sock?

ZZZZZHB · Feb 18, 2020

While reading how to find a tcp listener in linux kervel, version 5.5.4
I found that when reuseport switch is open. kernel will search reuseport_cb and find a sk with sk->sk_state is not equal to TCP_ESTABLISH

code as follows, in linux-5.5.4\net\core\sock_reuseport.c:302

if (!sk2) {
int i, j;
i = j = reciprocal_scale(hash, socks);
while (reuse->socks->sk_state == TCP_ESTABLISHED) {
i++;
if (i >= reuse->num_socks)
i = 0;
if (i == j)
goto out;
}
sk2 = reuse->socks;
}

Why check sk_state here?
Does sk_state be changed somewhere else?

any help will be appreciated

JasKinasis · Feb 18, 2020

The code in the above has lost a few bits thanks to your failure to include code-tags.

I've just tracked down the routine in question and copy/pasted the whole function - for reference - And put it in a spoiler tag to keep it out of the way.

C:

/**
* reuseport_select_sock - Select a socket from an SO_REUSEPORT group.
* @sk: First socket in the group.
* @hash: When no BPF filter is available, use this hash to select.
* @skb: skb to run through BPF filter.
* @hdr_len: BPF filter expects skb data pointer at payload data. If
* the skb does not yet point at the payload, this parameter represents
* how far the pointer needs to advance to reach the payload.
* Returns a socket that should receive the packet (or NULL on error).
*/
struct sock *reuseport_select_sock(struct sock *sk,
u32 hash,
struct sk_buff *skb,
int hdr_len)
{
struct sock_reuseport *reuse;
struct bpf_prog *prog;
struct sock *sk2 = NULL;
u16 socks;

rcu_read_lock();
reuse = rcu_dereference(sk->sk_reuseport_cb);

/* if memory allocation failed or add call is not yet complete */
if (!reuse)
goto out;

prog = rcu_dereference(reuse->prog);
socks = READ_ONCE(reuse->num_socks);
if (likely(socks)) {
/* paired with smp_wmb() in reuseport_add_sock() */
smp_rmb();

if (!prog || !skb)
goto select_by_hash;

if (prog->type == BPF_PROG_TYPE_SK_REUSEPORT)
sk2 = bpf_run_sk_reuseport(reuse, sk, prog, skb, hash);
else
sk2 = run_bpf_filter(reuse, socks, prog, skb, hdr_len);

select_by_hash:
/* no bpf or invalid bpf result: fall back to hash usage */
if (!sk2) {
int i, j;

i = j = reciprocal_scale(hash, socks);
while (reuse->socks[i]->sk_state == TCP_ESTABLISHED) {
i++;
if (i >= reuse->num_socks)
i = 0;
if (i == j)
goto out;
}
sk2 = reuse->socks[i];
}
}

out:
rcu_read_unlock();
return sk2;
}
EXPORT_SYMBOL(reuseport_select_sock);

And we're interested in this little snippet of it here:

C:

i = j = reciprocal_scale(hash, socks);
while (reuse->socks[i]->sk_state == TCP_ESTABLISHED) {
i++;
if (i >= reuse->num_socks)
i = 0;
if (i == j)
goto out;
}
sk2 = reuse->socks[i];
}
}

out:
rcu_read_unlock();
return sk2;
}

Unless I'm reading any of this incorrectly - It looks to me as if it's going through an array of sockets in the reuse struct that are all connected to a certain port. It loops through all of them and checks each sockets state.

i and j are set to the same value based on the result of the reciprocal_scale function.
Not sure exactly what that does, because I haven't looked at it. But i and j are going to be used in the while loop that we're looking at. i for indexing the array of sockets and j as a limit. Both are being set to the same value initially. So the reciprocal_scale function must be determining which one of the socket connections in the array to start with - based on some criteria or other.

In the while loop - If the routine finds a socket that is NOT in the TCP_ESTABLISHED state it will set sk2 to that socket and that will be returned.

Whilst looping - if the condition i>=reuse->num_socks is met - i is reset to zero.
OK, that seems logical.

Whilst looping, if the condition i==j is met and all sockets are still in the TCP_ESTABLISHED state - then the code jumps to the out label via a call to goto out;
Which makes sense. When we entered the loop i and j were the same - the state was checked and i was incremented.
So if we reach the state where i==j again, it means we're back to the socket we started.
And I'm guessing that when that happens - an empty sk2 is returned - indicating that there are no sockets available for re-use.

That's what I think is going on!

So the snippet we're looking at takes a pointer to a struct containing a pointer to an array of structs relating to socket connections. It runs a routine to determine which socket to look at first.

It then loops through all of the socket connections until it either:
A. Finds a socket that is not in the TCP_ESTABLISHED state
or
B. Gets back to the first socket it looked at.

If it finds a socket that is NOT in the TCP_ESTABLISHED state - then that socket is identified as available for re-use and is returned in the return value sk2.
If it gets back to the first socket it looked at and they are ALL in the TCP_ESTABLISHED state - it's return value sk2 must be blank/null indicating that there are currently no sockets available for reuse.

So, yes - if you ask me - it seems that check for the TCP_ESTABLISHED state is a logical and valid thing to do.

Can the state change elsewhere? - Yes - I assume it can because we're dealing with pointers to data-structures that are in memory.
But the state is not changing whilst we're in the routine. The routine is merely checking the state of all of the sockets at that particular time to see if any are NOT in the TCP_ESTABLISHED state and therefore available for reuse.

Does that make sense?

WHY check sock[i]->sk_state in reuseport_select_sock?

ZZZZZHB

New Member

JasKinasis

Super Moderator

Staff online

Members online

Latest posts