Linux Troubleshooting

Mark Taguiad Aug 13, 2025 · 25 min read

CPU and Processes

Understanding top command

 1$ top
 2
 3top - 18:47:04 up 22:45,  2 users,  load average: 0.46, 0.74, 0.85
 4Tasks: 410 total,   1 running, 407 sleeping,   0 stopped,   2 zombie
 5%Cpu(s):  7.1 us,  2.4 sy,  4.1 ni, 86.2 id,  0.1 wa,  0.0 hi,  0.0 si,  0.0 st 
 6MiB Mem :  47961.0 total,  32654.2 free,  11221.4 used,   7702.5 buff/cache     
 7MiB Swap:  20479.5 total,  20479.4 free,      0.1 used.  36739.6 avail Mem 
 8
 9    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                         
10  22671 mcbtagu+  14  -6   12.2g 968172 437988 S  47.2   2.0 171:20.29 firefox-bin

Header line (system status)

1top - 18:47:04 up 22:45,  2 users,  load average: 0.46, 0.74, 0.85

18:47:04 - current time
up 22:45 - machine has been running for 22 hours 45 minutes
2 users - two logged-in users
load average: 0.46, 0.74, 0.85
- Average number of runnable processes over:
  - last 1 min
  - last 5 min
  - last 15 min
- On a multi-core system, these numbers are low - system is not stressed

Tasks (process summary)

1Tasks: 410 total,   1 running, 407 sleeping,   0 stopped,   2 zombie

410 total - total processes/threads
1 running - only one actively using CPU right now
407 sleeping - normal; most processes wait for work
2 zombie - processes that exited but haven’t been cleaned up

CPU usage

1%Cpu(s):  7.1 us,  2.4 sy,  4.1 ni, 86.2 id,  0.1 wa,  0.0 hi,  0.0 si,  0.0 st

us (7.1%) - user programs
sy (2.4%) - kernel/system work
ni (4.1%) - low-priority (nice) processes
id (86.2%) - idle CPU
wa (0.1%) - waiting on disk I/O
hi / si - hardware/software interrupts
st - stolen by hypervisor (VMs)

Memory usage

1MiB Mem :  47961.0 total,  32654.2 free,  11221.4 used,   7702.5 buff/cache

Total: ~48 GB
Free: ~32 GB (very high)
Used: ~11 GB (actual app usage)
buff/cache: ~7.7 GB (filesystem cache, reclaimable)

1MiB Swap:  20479.5 total,  20479.4 free,      0.1 used.  36739.6 avail Mem

Swap almost unused - excellent
avail Mem ~36 GB - memory pressure is basically zero

Process list

1PID     USER     PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+  COMMAND
222671   mcbtag+  14  -6   12.2g  968172 437988 S  47.2   2.0 171:20.29

Identity

PID 22671 - process ID
USER mcbtag+ - user
COMMAND - (cut off, but this is the process name) Priority
PR 14 - scheduler priority
NI -6 - higher priority than normal
Negative nice = process is favored by scheduler Memory
VIRT 12.2g - virtual memory (address space)
RES 968 MB - actual physical RAM used
SHR 438 MB - shared memory (counts once globally) State
S - sleeping (can still show CPU usage due to sampling) Usage
%CPU 47.2
- About half of one CPU core
- Totally fine on a multi-core system
%MEM 2.0
TIME+ 171:20
- Total CPU time used since start

Some top command - while running

Command	Description
`P`	sort by CPU
`M`	sort by memory
`1`	show per-CPU usage
`k`	kill a process
`H`	show threads
`q`	quit

Process States

 1ps -eo pid,stat,comm | head
 2PID STAT COMMAND
 31 Ss   systemd
 42 S    kthreadd
 53 S    pool_workqueue_release
 64 I<   kworker/R-rcu_gp
 75 I<   kworker/R-sync_wq
 86 I<   kworker/R-kvfree_rcu_reclaim
 97 I<   kworker/R-slub_flushwq
108 I<   kworker/R-netns
1110 I<   kworker/0:0H-events_highpri

Primary States

Symbol	Type	Meaning / Description	Notes / Examples
R	Primary	Running	Actively using CPU or ready to run. Check `%CPU` to see load.
S	Primary	Sleeping (interruptible)	Waiting for an event or input. Normal for most processes.
D	Primary	Uninterruptible sleep	Waiting on I/O (disk, network). Cannot be killed with SIGKILL. High numbers = I/O bottleneck.
Z	Primary	Zombie	Process finished but parent hasn’t reaped it. No CPU usage, but occupies PID.
T	Primary	Stopped	Paused by signal or debugger (e.g., `SIGSTOP`).
X	Primary	Dead	Shouldn’t appear normally. Process is terminated.

Common Modifiers

Symbol	Type	Meaning / Description	Notes
s	Modifier	Session leader	Process started a session (usually shells, daemons).
l	Modifier	Multi-threaded	Process has multiple threads (POSIX threads).
+	Modifier	Foreground process group	Attached to terminal and can receive input signals.
<	Modifier	High priority	Real-time or kernel-specified priority.
N	Modifier	Low priority / nice	Positive nice value (lower scheduler priority).
L	Modifier	Has pages locked in memory	Used in real-time processes.

Top CPU User

Status

1ps aux --sort=-%cpu | head
2
3ps aux --sort=-%cpu | head
4USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
5mcbtagu+   22671 17.4  2.0 12941056 1023280 ?    S<l  Feb04 239:29 /usr/lib/firefox/firefox
6mcbtagu+    3742  8.6  0.5 1755524 286288 tty1   S<l+ Feb04 132:35 cosmic-comp

If one process is always on top - investigate it.

Also on a multi-core system, always check if one CPU is at high utilization or 100%. App may be single-threaded. To check you can use other top command, or using top press 1. This will display per CPU load.

1top - 21:45:59 up 11 days, 10:13,  2 users,  load average: 0.34, 0.35, 0.32
2Tasks: 173 total,   2 running, 171 sleeping,   0 stopped,   0 zombie
3%Cpu0  :  7.1 us,  5.1 sy,  0.0 ni, 86.9 id,  0.0 wa,  0.0 hi,  1.0 si,  0.0 st
4%Cpu1  : 12.9 us,  5.1 sy,  0.0 ni, 81.4 id,  0.0 wa,  0.0 hi,  0.7 si,  0.0 st
5MiB Mem :   3792.2 total,    148.5 free,   1690.5 used,   2513.0 buff/cache
6MiB Swap:   3792.0 total,   3058.0 free,    734.0 used.   2101.7 avail Mem

Multi-threaded

To check if an application is multi-threaded.

1ps -eLf
2
3root     4047471 4047447 4047471  0   10 Feb04 ?        00:00:06 /beszel serve --http=0.0.0.0:8090
4root     4047471 4047447 4047523  0   10 Feb04 ?        00:00:07 /beszel serve --http=0.0.0.0:8090
5root     4047471 4047447 4047524  0   10 Feb04 ?        00:00:00 /beszel serve --http=0.0.0.0:8090
6root     4047471 4047447 4047525  0   10 Feb04 ?        00:00:00 /beszel serve --http=0.0.0.0:8090
7root     4047471 4047447 4047627  0   10 Feb04 ?        00:00:07 /beszel serve --http=0.0.0.0:8090

PIDSTAT

Another tool to check cpu and process, it is a monitoring tool for individual task.

1pidstat
2
3Average:        USER       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
4Average:        root        17    0.00    0.17    0.00    0.17    0.17     -  rcu_preempt
5Average:        root      1049    2.00    0.67    0.00    0.00    2.66     -  dockerd

USER – The owner of the process.
PID – The process ID.
%usr – The percentage of CPU time spent in user mode.
%system – The percentage of CPU time spent in kernel mode.
%guest – The percentage of CPU time spent running virtual CPUs (guest OS).
%wait – The percentage of CPU time spent waiting for I/O (like disk or network).
%CPU – The total CPU usage of the process. Calculated roughly as %usr + %system + %guest + %wait.

To show the I/O statiscics.

1pidstat -d
2
310:30:14 PM   UID       PID   kB_rd/s   kB_wr/s kB_ccwr/s iodelay  Command
410:30:14 PM     0         1     29.68     52.61      0.88       0  systemd
510:30:14 PM     0       258      0.00      2.78      0.00       0  jbd2/dm-0-8

Time (10:30:14 PM) – Timestamp of the measurement.
UID (0) – User ID of the process owner.
PID (1) – Process ID.
kB_rd/s (29.68) – Kilobytes read per second from disk by the process.
kB_wr/s (52.61) – Kilobytes written per second to disk by the process.
kB_ccwr/s (0.88) – Kilobytes written to memory that required cache cleanup (i.e., writeback from page cache).
iodelay (0) – Time spent waiting for I/O in milliseconds (ms) for this interval.
Command (systemd) – The process name.

Other useful command.

1# show specific application
2pidstat -C <application_name>
3pidstat -p <PID>

perf

perf is a powerful analysis tool in profiling user-space applications and kernel code by sampling hardware and software performance counter.

 1perf stat -p 704 sleep 5
 2
 3 Performance counter stats for process id '704':
 4
 5                23      context-switches                 #      4.6 cs/sec  cs_per_second     
 6                 0      cpu-migrations                   #      0.0 migrations/sec  migrations_per_second
 7                 0      page-faults                      #      0.0 faults/sec  page_faults_per_second
 8          5,010.95 msec task-clock                       #      1.0 CPUs  CPUs_utilized       
 9        94,030,265      branch-misses                    #      2.9 %  branch_miss_rate         (66.55%)
10     3,218,655,134      branches                         #    642.3 M/sec  branch_frequency     (66.69%)
11    16,847,526,866      cpu-cycles                       #      3.4 GHz  cycles_frequency       (66.66%)
12    16,123,382,223      instructions                     #      1.0 instructions  insn_per_cycle  (66.57%)
13
14       5.011242868 seconds time elapsed

task-clock: total duration during which the CPU was actively executing instructions
context-switches: number of times the operating system switched execution from one task to another
cpu-migrations: number of times a process was transferred from one CPU core to another
page-faults: number of times a program attempted to access a memory page that was not currently loaded in RAM and had to be retrieved from disk

To record perf performance data.

1perf record -g -p 704 sleep 5
2[ perf record: Woken up 12 times to write data ]
3[ perf record: Captured and wrote 2.962 MB perf.data (19944 samples) ]

To view the saved data.

 1perf report
 2
 3Samples: 19K of event 'cpu/cycles/P', Event count (approx.): 16434851149
 4  Children      Self  Command  Shared Object      Symbol
 5+  100.00%     0.00%  yes      yes                [.] 0x000055f8fe8958f5                                         ◆
 6+  100.00%     0.00%  yes      libc.so.6          [.] __libc_start_main_impl (inlined)                           ▒
 7+  100.00%     0.00%  yes      libc.so.6          [.] call_init (inlined)                                        ▒
 8+  100.00%     0.00%  yes      libc.so.6          [.] 0x00007faa3c427741                                         ▒
 9+   99.24%     0.00%  yes      libc.so.6          [.] 0x00007faa3c494b04                                         ▒
10+   99.18%     0.00%  yes      libc.so.6          [.] 0x00007faa3c494ade                                         ▒
11+   70.93%     6.43%  yes      [kernel.kallsyms]  [.] entry_SYSCALL_64_after_hwframe                             ▒
12+   64.50%     1.49%  yes      [kernel.kallsyms]  [.] do_syscall_64                                              ▒
13+   53.81%     0.08%  yes      libc.so.6          [.] vmsplice                                                   ▒
14+   53.78%     0.00%  yes      yes                [.] 0x000055f8fe89579c                                         ▒
15+   45.81%     0.04%  yes      libc.so.6          [.] splice                                                     ▒
16+   45.79%     0.00%  yes      yes                [.] 0x000055f8fe8957db

Signals

Siagns are used to communicate with running processes. This is to send instruction to the process if to stop, pause, resume or to clean up.

The kernel can send signals, for instance, when a process attempts to divide by zero it receives the SIGFPE signal.

SIGSTOP

To stop a process.

1[arch@mrlgarchforge tmp]$ yes > /dev/null &
2[1] 1445
3[arch@mrlgarchforge tmp]$ kill -SIGSTOP 1445

SIGCONT

To resume.

1[arch@mrlgarchforge tmp]$ kill -SIGCONT 1445

SIGINT

This is the signal sent when CTrl+C is pressed.

SIGTERM and SIGQUIT

SIGTERM is the more polite way to shutdown a process (grecefull kill), while SIGQUIT is forceful request to terminate the process (usually when it is misbehaving).

SIGKILL

This is true force kill, when this signal is received, it can’t be ignored and it will forcefully terminate the process.

CPU & I/O

Other common killing the performance is I/O (input), this could stretch from disk to network. A common tool to check is using `vmstat.

vmstat

1vmstat
2
3procs -----------memory---------- ---swap-- -----io---- -system-- -------cpu-------
4 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st gu
5 7  0     52 28477860 916668 7946588    0    0   113   886 3986   15  7  2 88  2  0  0

r (run queue) - Number of runnable process, problem exist if r consistently > number of CPU cores.
b (blocked) - Processes waiting on I/O. High b = disk or network I/O bottleneck.
swpd - Amount of swap currently used (KB).
free - Completely unused RAM.
buff - Buffer cache (metadata, block I/O buffers).
cache - Page cache (file contents cached in RAM). Cache shrinking + swap growing → memory trouble.
si (swap in) - KB/s swapped from disk into RAM.
so (swap out) - KB/s swapped from RAM to disk. Any sustained non-zero values → severe memory pressure.
bi / bo (blocks in/out) - Disk read/write activity. High values ≠ bad by default. High with high wa = I/O problem.
in (interrupts/sec) - Hardware + software interrupts. Network traffic increases this a lot.
cs (context switches/sec) - How often CPU switches between processes/threads. Extremely high values → too many threads, locks, or I/O wakeups.
us - user CPU (apps)
sy - kernel CPU (syscalls, I/O handling). Network-heavy systems often show high.
id - idle CPU
wa - CPU waiting on I/O
st (steal) - Time stolen by hypervisor (VMs only)

Symptom	Likely Cause
High `r`, high `us`	CPU-bound
High `b`, high `wa`	Disk I/O bottleneck
High `sy`, high `in`	Network I/O
Non-zero `si/so`	Memory pressure
High `st`	VM host contention

Disk level I/O

iostat

 1iostat -xz
 2Linux 6.17.9-76061709-generic (tags-p51) 	02/06/2026 	_x86_64_	(8 CPU)
 3
 4avg-cpu:  %user   %nice %system %iowait  %steal   %idle
 5           6.95    0.70    1.89    2.19    0.00   88.27
 6
 7Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
 8dm-0             0.00      0.03     0.00   0.00    0.21    21.98    0.00      0.00     0.00   0.00    0.00     0.44    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
 9nvme0n1          0.98     30.17     0.24  19.51    0.20    30.77    6.52    306.79     5.36  45.12    4.69    47.07    0.00      0.00     0.00   0.00    0.00     0.00    0.38    0.71    0.03   0.40
10nvme1n1          0.03      2.38     0.01  16.48    0.30    71.60    0.19      4.13     0.02  10.80    3.53    22.11    0.00      0.00     0.00   0.00    0.00     0.00    0.03    0.75    0.00   0.03
11sda              0.55     74.06     0.40  42.18   18.06   134.38    0.58    540.84     0.93  61.83 1037.08   937.41    0.00      0.00     0.00   0.00    0.00     0.00    0.01  701.83    0.61   3.93
12zram0            0.00      0.01     0.00   0.00    0.02    20.34    0.00      0.00     0.00   0.00    0.00     4.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00

Critical iostat columns.

1Device  r/s  w/s  rkB/s  wkB/s  await  svctm  %util

%iowait - CPU time is spent waiting on disks.
%util - Total disk utilization.
r_await / w_await - Time an I/O spends waiting + being serviced.
aqu-sz - Queue depth, shows how many I/Os are waiting.
wareq-sz - Shows if writes or big or small.
r/s / w/s - IOPS (operations per second).

Identify Processes Stuck in I/O

In top look for state in D, or run this command.

1ps aux | awk '$8 ~ /D/ {print}'

Priority and Nice Values

When a process is starving others, or background jobs steal CPU - its a good practiace to lower the priority of the application.

Status

To check priority.

1ps -o pid,ni,comm -p PID

Lower priority.

1renice 10 -p PID

Raise priority

1renice -5 -p PID

Zombie Processes

Zombie process is a child process that has completed its excution and terminated but still has entry in the system’s process table.

Find zombies and parent process

1ps aux | grep Z
2ps -o ppid= -p ZOMBIE_PID

You fix the parent process and not the zombie process.

ulimit

ulimit is used to view and set limit the the system resources the users consume.

Soft Limits

This are values enforced by the kernel and can still be adjusted by the user. The hard limit acts as the celing for the soft limit.

 1[arch@mrlgarchforge ~]$ ulimit -Sa
 2real-time non-blocking time  (microseconds, -R) unlimited
 3core file size              (blocks, -c) unlimited
 4data seg size               (kbytes, -d) unlimited
 5scheduling priority                 (-e) 0
 6file size                   (blocks, -f) unlimited
 7pending signals                     (-i) 15544
 8max locked memory           (kbytes, -l) 8192
 9max memory size             (kbytes, -m) unlimited
10open files                          (-n) 1024
11pipe size                (512 bytes, -p) 8
12POSIX message queues         (bytes, -q) 819200
13real-time priority                  (-r) 0
14stack size                  (kbytes, -s) 8192
15cpu time                   (seconds, -t) unlimited
16max user processes                  (-u) 15544
17virtual memory              (kbytes, -v) unlimited
18file locks                          (-x) unlimited

Hard Limits

The maximum value for the soft limit and maximum amount of a resources a user can consume. Only root users can change the hard limit.

 1[arch@mrlgarchforge ~]$ ulimit -Ha
 2real-time non-blocking time  (microseconds, -R) unlimited
 3core file size              (blocks, -c) unlimited
 4data seg size               (kbytes, -d) unlimited
 5scheduling priority                 (-e) 0
 6file size                   (blocks, -f) unlimited
 7pending signals                     (-i) 15544
 8max locked memory           (kbytes, -l) 8192
 9max memory size             (kbytes, -m) unlimited
10open files                          (-n) 524288
11pipe size                (512 bytes, -p) 8
12POSIX message queues         (bytes, -q) 819200
13real-time priority                  (-r) 0
14stack size                  (kbytes, -s) unlimited
15cpu time                   (seconds, -t) unlimited
16max user processes                  (-u) 15544
17virtual memory              (kbytes, -v) unlimited
18file locks                          (-x) unlimited

Process Number

Limit maximum process the user can execute.

1ulimit -u 100

File Size

Limit maximum file size the user can make.

1# limit to 100KB
2ulimit -f 100

Virtual Memory

Limit the maximum virtual memory avaialbe to a process.

1# virt mem 1000KB
2ulimit -v 1000

Opened Files

Limit the number of simultaneously opened files (file descriptors).

1ulimit -n 10

Other limit

Flag	Meaning	Short Description
`-a`	all	Show all current limits
`-H`	hard	Show/set hard limits
`-S`	soft	Show/set soft limits
`-c`	core file size	Max size of core dump files
`-d`	data seg size	Max process data segment size
`-e`	scheduling priority	Max scheduling priority (`nice`)
`-f`	file size	Max size of files created
`-i`	pending signals	Max pending signals
`-l`	locked memory	Max locked-in-memory size
`-m`	resident set size	Max resident memory size
`-n`	open files	Max number of open file descriptors
`-p`	pipe size	Pipe buffer size
`-q`	message queues	Max bytes in POSIX message queues
`-r`	realtime priority	Max real-time scheduling priority
`-s`	stack size	Max stack size
`-t`	CPU time	Max CPU time per process
`-u`	user processes	Max number of user processes
`-v`	virtual memory	Max virtual memory available
`-x`	file locks	Max number of file locks
`-T`	threads	Max number of threads
`-P`	pseudoterminals	Max PTYs
`-b`	socket buffers	Max socket buffer size
`-k`	kqueues	Max kqueues allocated
`-w`	swaps	Max swap space

Finding PID using Port

netstat

1etstat -ltnup | grep 22
2tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      578/sshd: /usr/bin/ 
3tcp6       0      0 :::22                   :::*                    LISTEN      578/sshd: /usr/bin/

ss

1s -ltnup 'sport = :22'
2Netid   State    Recv-Q   Send-Q     Local Address:Port      Peer Address:Port   Process                          
3tcp     LISTEN   0        128              0.0.0.0:22             0.0.0.0:*       users:(("sshd",pid=578,fd=6))   
4tcp     LISTEN   0        128                 [::]:22                [::]:*       users:(("sshd",pid=578,fd=7))

lsof

1lsof -i :22
2COMMAND   PID USER FD   TYPE DEVICE SIZE/OFF NODE NAME
3sshd      578 root 6u  IPv4   8277      0t0  TCP *:ssh (LISTEN)
4sshd      578 root 7u  IPv6   8279      0t0  TCP *:ssh (LISTEN)
5sshd-sess 592 root 7u  IPv4   9233      0t0  TCP mrlgarchforge:ssh->192.168.254.14:38796 (ESTABLISHED)
6sshd-sess 607 arch 7u  IPv4   9233      0t0  TCP mrlgarchforge:ssh->192.168.254.14:38796 (ESTABLISHED)

Get Process ID

Show Background Process

1jobs
2[1]+  Running                    yes > /dev/null &

By Application Name

1ps -aux | grep firefox

By File

1fuser -v text.txt
2fuser -k text.txt
3fuser -cv text.txt

1lsof | { head -1 ; grep text.txt ; }

Killing a running application

Kill by application name

1pkill firefox
2kill firefox
3killall firefox

Terminate the process using its PID

1pgrep firefox
222671
3# or
4grep firefox
5mcbtagu+   22671 13.6  1.9 12791920 936552 ?     R<l  Feb04 161:06 /usr/lib/firefox/firefox
6
7pkill 22671
8kill 22671
9killall 22671

Forceful Termination

1pkill -9 22671
2kill -9 22671
3killall -9 22671
4
5kill -9 firefox
6killall -9 firefox

Kill all application run by user

1ps -o pid,pgid,sess,cmd -U your_username
2kill -SIGTERM -- -<PGID>

Kill process tree.

1# get pgid
2ps -ejf | grep <application_name>
3# or
4pgrep -g [pgid_number]
5
6# kill tree
7kill -SIGTERM -- -<PGID>

Kill by Port

1lsof -i udp:80 | awk '/80/{print $2}' | xargs kil
2ss -Slp | grep -Po ':88\s.*pid=\K\d+(?=,)' | xargs kill
3netstat -Slp | grep -Po ':80\s.*LISTEN.*?\K\d+(?=/)' | xargs kill

Finding who killed the process

Kernel

Review dmesg

1dmesg | tail -10

OOMKILL

1journalctl --list-boots | \
2    awk '{ print $1 }' | \
3    xargs -I{} journalctl --utc --no-pager -b {} -kqg 'killed process' -o verbose --output-fields=MESSAGE
4
5find /var/log -name kern* -exec grep -PnHe 'Killed process' {} + 2>

Memory

Usage

Checking memory usage with free, vmstat, and /proc/meminfo.

 1free
 2               total        used        free      shared  buff/cache   available
 3Mem:        49112048    16058028    25626304      285292    11305296    33054020
 4
 5vmstat
 6procs -----------memory---------- ---swap-- -----io---- -system-- -------cpu-------
 7 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st gu
 8 3  0     52 25628464 1058104 10247340    0    0    79   690 3693   16  8  2 88  2  0  0
 9
10cat /proc/meminfo
11MemTotal:       49112048 kB
12MemFree:        25607800 kB
13MemAvailable:   33035832 kB
14Buffers:         1058168 kB
15Cached:          9118284 kB
16SwapCached:            0 kB
17Active:         10680872 kB
18Inactive:        9254080 kB
19Active(anon):    9869064 kB
20Inactive(anon):   179956 kB
21Active(file):     811808 kB
22Inactive(file):  9074124 kB
23Unevictable:        6856 kB
24Mlocked:            6856 kB
25SwapTotal:      20970996 kB
26SwapFree:       20970944 kB
27Zswap:                 0 kB
28Zswapped:              0 kB
29Dirty:              3916 kB
30Writeback:             0 kB
31AnonPages:       9765028 kB
32Mapped:          1826376 kB
33Shmem:            286036 kB
34KReclaimable:    1129924 kB
35Slab:            1769516 kB
36SReclaimable:    1129924 kB
37SUnreclaim:       639592 kB
38KernelStack:       40272 kB
39PageTables:       124312 kB
40SecPageTables:      7180 kB
41NFS_Unstable:          0 kB
42Bounce:                0 kB
43WritebackTmp:          0 kB
44CommitLimit:    45527020 kB
45Committed_AS:   28825424 kB
46VmallocTotal:   34359738367 kB
47VmallocUsed:      355244 kB
48VmallocChunk:          0 kB
49Percpu:             9440 kB
50HardwareCorrupted:     0 kB
51AnonHugePages:    237568 kB
52ShmemHugePages:        0 kB
53ShmemPmdMapped:        0 kB
54FileHugePages:    202752 kB
55FilePmdMapped:         0 kB
56CmaTotal:              0 kB
57CmaFree:               0 kB
58Unaccepted:            0 kB
59Balloon:               0 kB
60HugePages_Total:       0
61HugePages_Free:        0
62HugePages_Rsvd:        0
63HugePages_Surp:        0
64Hugepagesize:       2048 kB
65Hugetlb:               0 kB
66DirectMap4k:     2512584 kB
67DirectMap2M:    38162432 kB
68DirectMap1G:     9437184 kB

Hogs

Finding memory hogs. First you can you use top or htop, with top press M to sort by memory.

smem

smem provide the real memory usage compare to top.

1smem
2
3  PID User     Command                         Swap      USS      PSS      RSS 
437901 user     /usr/lib/speech-dispatcher-        0      220      241     2832 
5 4142 user     dbus-broker --log 4 --contr        0      228      315     2844

 1# Sort by unique memory usage
 2smem -s uss
 3
 4# Show totals
 5smem -t
 6
 7# Show system-wide summary
 8smem -r
 9
10# Human readable
11smem -k

swap - Memory pages swapped to disk.
USS (Unique Set Size) - Memory used only by this process.
PSS (Proportional Set Size) - Shared memory is evenly divided among processes. Most accurate way to calculate total memory usage.
RSS (Resident Set Size) - Total physical memory used by the process, includes shared memory. This is what top and ps mostly show.

If the system is slow or OOMing, monitor RSS. Some useful command to monitor PID memory.

1ps -eo pid,comm,rss,vsz --sort=-rss | head
2# watch
3watch -n 1 'ps -o pid,comm,rss,vsz -p <PID>'

Leak

If an application is hogging the system, then you might need to run specific tool (base on the programming laguage use). Example would be valgrindto monitor/check C/C++ application. Let’s focus on application currently runnin in the system.

First indication is memory hogs, as mentioned above you can check with top. You can also check using journalctl, look for “out of memory”. Another option is using dmesg and grep for oom, if OOM killer is fired then an application is killed due to memory pressure.

Look for the process name, oom_score and cgroup(container).

1dmesg -T | grep -i oom
2journalctl -k | grep -i "out of memory"

OOM Score

This is the primary metric to determine which process has the highes priority to be killed when the system is under RAM pressure. Range from negative to 1000 as highest, negative is most likely not to be killed. It is also a good indication that the application has merory leak if it gets OOM Kill frequently. Fix the leak and adjust the OOM Score.

View OOM Score.

1cat /proc/[PID]/oom_score

To make sure your application is not killed (production environment), you can adjust the OOM score. To temporary adjust the score:

1echo 'New_Score' | sudo tee /proc/[PID]/oom_score_adj

To make it persistent, create a service for your application and add OOMScoreAdjust.

Swap

Swap acts as extention or extra memory when physical RAM is exhausted. It can be from a dedicated partition or file. It is recommended to use SSD for Swap as it is faster compared to HDD.

Like mentioned in CPU section, vmstat can be used to check pressure on swap.

si (swap in) - KB/s swapped from disk into RAM.
so (swap out) - KB/s swapped from RAM to disk. Any sustained non-zero values → severe memory pressure.

Find PID hogging the swap.

 1grep VmSwap /proc/*/status | sort -k2 -n | tail
 2
 3/proc/722/status:VmSwap:	   14160 kB
 4/proc/3294346/status:VmSwap:	   14592 kB
 5/proc/2438154/status:VmSwap:	   17664 kB
 6/proc/2438006/status:VmSwap:	   18480 kB
 7/proc/2390869/status:VmSwap:	   18560 kB
 8/proc/2438011/status:VmSwap:	   18576 kB
 9/proc/2437850/status:VmSwap:	   18596 kB
10/proc/3187376/status:VmSwap:	   53776 kB
11/proc/3294357/status:VmSwap:	   57092 kB
12/proc/2438293/status:VmSwap:	  194800 kB

Disk

Status

Check what’s full.

1df -h

Find space hogs.

1du -h --max-depth=1 / | sort -hr
2du -h --max-depth=1 /var | sort -hr

Find the offender big file.

1du -sh * | sort -rh | head -n 10

Disk I/O slowness / system freezing

Check real-time disk usage.

1iotop

Using iostat. Please check iostat. Loof for:

High %util
Long await times

To find the offending PID, monitor iodelay.

1pidstat -d

Disk health

smartctl

SMART status.

1smartctl -x -a /dev/sdX

Here are the factors to looked at when performing smartctl scan. Always look for error in the smartctl report.

Device Error Count
CRC errors
Non-CRC FIS errors
SMART overall health
Reallocated sectors
Pending sectors
Offline uncorrectable
UDMA CRC errors
Temperature
Remaining lifetime
Program_Fail_Count_Chip
Erase_Fail_Count_Chip
Wear_Leveling_Count

fsck

In some cases your disk might not be broken, just file system inconsistensies or errors (bad blocks etc) due to improper shutdowns. Fsck scans the file systems and attemt to fix it. Unmount first the device before scanning, unless you know what you are doing.

If the device is unmounted but throws an error message like the example below. In most cases, I always use a recovery drive or boot to recovery mode to scan device that is mounted to boot/root/recovery, there are some method which you temporary mount it to a directory, but we will not cover that.

In my case the device is mounted in /srv/ssd.

 1# Check disk mounts or just use df -h
 2lsblk -f
 3NAME          FSTYPE FSVER LABEL     UUID                                 FSAVAIL FSUSE% MOUNTPOINTS
 4sda                                                                                      
 5└─sda1        ext4   1.0   data      289429cd-429f-43b2-b6c6-f5455c6dda6c                
 6zram0                                                                                    [SWAP]
 7nvme0n1                                                                                  
 8├─nvme0n1p1   vfat   FAT32           7862-B414                               424M    58% /boot/efi
 9├─nvme0n1p2   vfat   FAT32           7862-B38D                               1.1G    71% /recovery
10├─nvme0n1p3   ext4   1.0             50fcc7ca-aedb-4f2f-9f1e-397dfeb09027  177.3G    16% /
11└─nvme0n1p4   swap   1               093c81f3-30bf-4688-9612-1a41f8955898                
12  └─cryptswap swap   1     cryptswap eb66fb4b-9c2e-4825-a386-e1e512223b97                [SWAP]
13
14# try to unmount
15umount -f /srv/ssd # got an error
16fsck from util-linux 2.39.3
17e2fsck 1.47.0 (5-Feb-2023)
18/dev/sda1 is in use.
19e2fsck: Cannot continue, aborting. 
20
21# to force unmount
22echo 1 | sudo tee /sys/block/sda/device/delete
23echo "- - -" | sudo tee /sys/class/scsi_host/host*/scan
24
25# check disk, notice that it is now remounted in in /dev/sdb
26sudo fdisk -l
27
28Disk /dev/sdb: 111.79 GiB, 120034123776 bytes, 234441648 sectors
29Disk model: Ramsta SSD S800 
30Units: sectors of 1 * 512 = 512 bytes
31Sector size (logical/physical): 512 bytes / 512 bytes
32I/O size (minimum/optimal): 512 bytes / 512 bytes
33Disklabel type: gpt
34Disk identifier: 5BFD6FBA-560B-4166-A769-E873998917F9
35
36Device     Start       End   Sectors   Size Type
37/dev/sdb1   2048 234440703 234438656 111.8G Linux filesystem

Run fsck and try to fix errors.

 1fsck -fy /dev/sdb1
 2fsck from util-linux 2.39.3
 3e2fsck 1.47.0 (5-Feb-2023)
 4data: recovering journal
 5Pass 1: Checking inodes, blocks, and sizes
 6Pass 2: Checking directory structure
 7Pass 3: Checking directory connectivity
 8Pass 4: Checking reference counts
 9Pass 5: Checking group summary information
10Free blocks count wrong (8050503, counted=27795691).
11Fix? yes
12
13Free inodes count wrong (7241024, counted=7325485).
14Fix? yes
15
16
17data: ***** FILE SYSTEM WAS MODIFIED *****
18data: 6355/7331840 files (0.3% non-contiguous), 1509141/29304832 blocks

trim

TRIM is used to optmized and increate longetivity of SSD by discarding the data blocks that no longer in use in the drive.

This is automatically started in the system, just to make sure check if the service is runing.

1systemctl status fstrim.timer

In mounting disk in /etc/fstab, make sure to add discard in the option to automatically trimmed.

1UUID="289429cd-429f-43b2-b6c6-f5455c6dda6c"   /srv/ssd   ext4  defaults,noatime,discard  0  2

Manually run TRIM.

1fstrim -av

Logs

logrotate

This is automatically enabled on your system, this basically archived and eventually delete your old logs. /etc/logrotate.conf

 1# see "man logrotate" for details
 2# global options do not affect preceding include directives
 3# rotate log files weekly
 4weekly
 5# use the adm group by default, since this is the owning group
 6# of /var/log/.
 7su root adm
 8# keep 1 weeks worth of backlogs
 9rotate 1
10# create new (empty) log files after rotating old ones
11create
12# use date as a suffix of the rotated file
13#dateext
14# uncomment this if you want your log files compressed
15#compress
16# packages drop log rotation information into this directory
17include /etc/logrotate.d
18# system-specific logs may also be configured here.

To add a custom directory to rotate, add the config file in /etc/logrotate.d/. maddy

1cat /etc/logrotate.d/maddy
2/srv/volume/maddy/log/maddy.log {
3    weekly
4    rotate 2
5    compress
6    missingok
7    notifempty
8}

journald

Usually journald hogs disk space if left to default configuration.

Reduce to a specific size (e.g., 500MB):

1journalctl --vacuum-size=100M

1sudo journalctl --vacuum-time=2weeks

Check current disk usage:

1journalctl --disk-usage

Configure permanent limit. etc/systemd/journald.conf

1[Journal]
2SystemMaxUse=200M

Docker System Usage

image/builder/volume

Docker eating up your disk, to get an overview on you docker system. You can see below that the build cache is filling up the disk.

1docker system df
2TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
3Images          7         6         7.491GB   7.242GB (96%)
4Containers      5         2         938kB     159.7kB (17%)
5Local Volumes   0         0         0B        0B
6Build Cache     47        0         4.213GB   1.332GB

Some command to prune docker resources. Most of the time this would be enough to free up some space.

1docker builder prune
2docker image prune

If you are brave enough (to do this in production).

1docker system prune -a
2docker system prune -a --volumes

logs

Find space used by container logs.

 1docker ps --format '{{.Names}}' | while read c; do
 2  echo "$c: $(docker logs --tail 1 $c 2>&1 | wc -c) bytes";
 3done
 4
 5docker ps --format '{{.Names}}' | while read c; do
 6  echo "$c: $(docker logs --tail 1 $c 2>&1 | wc -c) bytes";
 7done
 8beszel-agent: 61 bytes
 9autodiscover: 0 bytes 
10traefik: 311 bytes
11gerbil: 67 bytes
12pangolin: 120 bytes
13traefik-certs-dumper: 0 bytes
14cloudflared: 169 bytes
15newt: 59 bytes
16dockge: 84 bytes

Limting the logs file in container. Editing docker daemon/config.

/etc/docker/daemon.json

1{
2  "log-driver": "json-file",
3  "log-opts": {
4    "max-size": "10m",
5    "max-file": "3"
6  }
7}

Defining in compose file.

1services:
2  app:
3    logging:
4      driver: json-file
5      options:
6        max-size: "10m"
7        max-file: "3"

Docker

Container overloading the system

For some reason you deployed an application more than your server can handle. First try to kill the container.

1docker stop <container_name>
2docker rm <container_name>
3```bash
4If stop command is not doing anything, try to force remove the container. 
5```bash
6docker rm -f <container_name>

If all else fail, stop docker and containerd service.

 1systemctl restart containerd
 2systemctl restart docker
 3
 4# or
 5systemctl kill docker
 6systemctl kill containerd
 7
 8# do this if you know what you are doing
 9pkill -9 docker
10pkill -9 containerd
11pkill -9 containerd-shim
12pkill -9 runc

If the container is still not being killed. Find the unkillable PID state.

1ps -eo pid,stat,cmd | grep D
2kill -9 <PID>

Monitor system load and restart service.

1systemctl restart containerd
2systemctl restart docker

Table of Contents

CPU and Processes

Understanding top command

Header line (system status)

Tasks (process summary)

CPU usage

Memory usage

Process list

Some top command - while running

Process States

Top CPU User

Status

Multi-threaded

PIDSTAT

perf

Signals

SIGSTOP

SIGCONT

SIGINT

SIGTERM and SIGQUIT

SIGKILL

CPU & I/O

vmstat

Disk level I/O

iostat

Identify Processes Stuck in I/O

Priority and Nice Values

Status

Lower priority.

Raise priority

Zombie Processes

ulimit

Soft Limits

Hard Limits

Process Number

File Size

Virtual Memory

Opened Files

Other limit

Finding PID using Port

netstat

ss

lsof

Get Process ID

Show Background Process

By Application Name

By File

Killing a running application

Kill by application name

Terminate the process using its PID

Forceful Termination

Kill all application run by user

Kill process tree.

Kill by Port

Finding who killed the process

Kernel

OOMKILL

Memory

Usage

Hogs

smem

Leak

OOM Score

Swap

Disk

Status

Disk I/O slowness / system freezing

Disk health

smartctl

fsck

trim

Logs

logrotate

journald

Docker System Usage

image/builder/volume

logs

Docker

Container overloading the system