weird high CPU usage

Diputs

Active Member
Joined
Jul 28, 2021
Messages
250
Reaction score
109
Credits
1,924
Running " top " on our Linux servers (RedHat 7) I am seeing this behaviour:
Our monitoring tool reports high CPU usage over long time, just for 1 machine only (out of several tens of machines).
I look into TOP to find the process/processes using lots of CPU ...

... there is none such. And why ? Because most CPU percentage is taken by " sy "
A value which usually is 0.0 or 0.4
But now, it goes all the ranges, up to 80.
So, 80% of the CPU power is the used by this " component "
and no process is taking lots of CPU. Basically since this " sy " thing is not just processes, but something else.

Bounced the machine: same issue.
Stopped all of OUR software - that is the software we maintain (which is non-OS related) same behaviour. None of our processes running, CPU usage still high.

How do I find what is going on here ? I have root access, but I am NOT the Linux admin.


As illustration, not MY screenshot because I legally am not allowed anyway,
juist an example of such:


Line number 3 of TOP displays :
" us " user CPU which is sometimes high
" sy " system CPU which is almost always low but now in our case very high
" id " which is usually pretty high but just a bit lower when " us " is being used
 


View Kernel logs with
Code:
dmesg | less
or
Code:
tail -f /var/log/messages

You can check for interrupts with
Code:
cat /proc/interrupts
 
Running " top " on our Linux servers (RedHat 7) I am seeing this behaviour:
Our monitoring tool reports high CPU usage over long time, just for 1 machine only (out of several tens of machines).
I look into TOP to find the process/processes using lots of CPU ...

... there is none such. And why ? Because most CPU percentage is taken by " sy "
A value which usually is 0.0 or 0.4
But now, it goes all the ranges, up to 80.
So, 80% of the CPU power is the used by this " component "
and no process is taking lots of CPU. Basically since this " sy " thing is not just processes, but something else.

Bounced the machine: same issue.
Stopped all of OUR software - that is the software we maintain (which is non-OS related) same behaviour. None of our processes running, CPU usage still high.

How do I find what is going on here ? I have root access, but I am NOT the Linux admin.


As illustration, not MY screenshot because I legally am not allowed anyway,
juist an example of such:


Line number 3 of TOP displays :
" us " user CPU which is sometimes high
" sy " system CPU which is almost always low but now in our case very high
" id " which is usually pretty high but just a bit lower when " us " is being used
If sy is the time taken running kernel processes, and you have "Stopped all of OUR software", then one way of testing is to run the machine on another kernel which may help if it is a kernel problem.
 
View Kernel logs with
Code:
dmesg | less
or
Code:
tail -f /var/log/messages

You can check for interrupts with
Code:
cat /proc/interrupts

I can't detail too much on the service itself which I named 'xxx' here, but it's some kind of security related thing

DMESG has many lines with :
xxx [3879]: segfault at 0 ip 0000... sp 0000.... error 4 in libensfwnetfilterengine.so [3f5999...]

/VAR/LOG/MESSAGES:
has many lines:
systemd: xxx.service holdoff time over, scheduling restart.
then it starts service xxx,
and 20 second later it repeats the same
It seems to constantly stop and start service 'xxx'

INTERRUPTS also had some weird differences (against a normal server)
But I didn't get the copy of the info right here now
There was one interrupt very active on the problem server
 
If sy is the time taken running kernel processes, and you have "Stopped all of OUR software", then one way of testing is to run the machine on another kernel which may help if it is a kernel problem.

It's using it when our software isn't running, so that excludes a lot but there's also other software running, security type software. I can't really stop that.
I also can't do anything with the Kernel, we're not supposed to mess with that.
Basically I'm a user with Root privileges, so technically I can, but I'm not expected to do so.
 
It seems solved. I can't be sure because I'm not the only one working on this machine and the people that do aren't the kind of people that share a lot, so anything is possible. They also don't care but that should be by advantage since if they don't care, they don't solve issues that they think aren't an issue. All while an independant monitoring tool reports this issue. But, they know everything, of course.

So I stopped a service manually. Actually, the same service that caused functional problem on OTHER machines, but ONLY on this server, it seemed to cause some kind of CPU leak or whatever core process using way to much CPU.
 

Members online


Top