Linux Troubleshooting
Table of Contents
CPU and Processes
Understanding top command
1$ top
2
3top - 18:47:04 up 22:45, 2 users, load average: 0.46, 0.74, 0.85
4Tasks: 410 total, 1 running, 407 sleeping, 0 stopped, 2 zombie
5%Cpu(s): 7.1 us, 2.4 sy, 4.1 ni, 86.2 id, 0.1 wa, 0.0 hi, 0.0 si, 0.0 st
6MiB Mem : 47961.0 total, 32654.2 free, 11221.4 used, 7702.5 buff/cache
7MiB Swap: 20479.5 total, 20479.4 free, 0.1 used. 36739.6 avail Mem
8
9 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
10 22671 mcbtagu+ 14 -6 12.2g 968172 437988 S 47.2 2.0 171:20.29 firefox-bin
Header line (system status)
1top - 18:47:04 up 22:45, 2 users, load average: 0.46, 0.74, 0.85
- 18:47:04 - current time
- up 22:45 - machine has been running for 22 hours 45 minutes
- 2 users - two logged-in users
- load average: 0.46, 0.74, 0.85
- Average number of runnable processes over:
- last 1 min
- last 5 min
- last 15 min
- On a multi-core system, these numbers are low - system is not stressed
- Average number of runnable processes over:
Tasks (process summary)
1Tasks: 410 total, 1 running, 407 sleeping, 0 stopped, 2 zombie
- 410 total - total processes/threads
- 1 running - only one actively using CPU right now
- 407 sleeping - normal; most processes wait for work
- 2 zombie - processes that exited but haven’t been cleaned up
CPU usage
1%Cpu(s): 7.1 us, 2.4 sy, 4.1 ni, 86.2 id, 0.1 wa, 0.0 hi, 0.0 si, 0.0 st
- us (7.1%) - user programs
- sy (2.4%) - kernel/system work
- ni (4.1%) - low-priority (nice) processes
- id (86.2%) - idle CPU
- wa (0.1%) - waiting on disk I/O
- hi / si - hardware/software interrupts
- st - stolen by hypervisor (VMs)
Memory usage
1MiB Mem : 47961.0 total, 32654.2 free, 11221.4 used, 7702.5 buff/cache
- Total: ~48 GB
- Free: ~32 GB (very high)
- Used: ~11 GB (actual app usage)
- buff/cache: ~7.7 GB (filesystem cache, reclaimable)
1MiB Swap: 20479.5 total, 20479.4 free, 0.1 used. 36739.6 avail Mem
- Swap almost unused - excellent
- avail Mem ~36 GB - memory pressure is basically zero
Process list
1PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
222671 mcbtag+ 14 -6 12.2g 968172 437988 S 47.2 2.0 171:20.29
Identity
- PID 22671 - process ID
- USER mcbtag+ - user
- COMMAND - (cut off, but this is the process name) Priority
- PR 14 - scheduler priority
- NI -6 - higher priority than normal
- Negative nice = process is favored by scheduler Memory
- VIRT 12.2g - virtual memory (address space)
- RES 968 MB - actual physical RAM used
- SHR 438 MB - shared memory (counts once globally) State
- S - sleeping (can still show CPU usage due to sampling) Usage
- %CPU 47.2
- About half of one CPU core
- Totally fine on a multi-core system
- %MEM 2.0
- TIME+ 171:20
- Total CPU time used since start
Some top command - while running
| Command | Description |
|---|---|
P |
sort by CPU |
M |
sort by memory |
1 |
show per-CPU usage |
k |
kill a process |
H |
show threads |
q |
quit |
Process States
1ps -eo pid,stat,comm | head
2PID STAT COMMAND
31 Ss systemd
42 S kthreadd
53 S pool_workqueue_release
64 I< kworker/R-rcu_gp
75 I< kworker/R-sync_wq
86 I< kworker/R-kvfree_rcu_reclaim
97 I< kworker/R-slub_flushwq
108 I< kworker/R-netns
1110 I< kworker/0:0H-events_highpri
Primary States
| Symbol | Type | Meaning / Description | Notes / Examples |
|---|---|---|---|
| R | Primary | Running | Actively using CPU or ready to run. Check %CPU to see load. |
| S | Primary | Sleeping (interruptible) | Waiting for an event or input. Normal for most processes. |
| D | Primary | Uninterruptible sleep | Waiting on I/O (disk, network). Cannot be killed with SIGKILL. High numbers = I/O bottleneck. |
| Z | Primary | Zombie | Process finished but parent hasn’t reaped it. No CPU usage, but occupies PID. |
| T | Primary | Stopped | Paused by signal or debugger (e.g., SIGSTOP). |
| X | Primary | Dead | Shouldn’t appear normally. Process is terminated. |
Common Modifiers
| Symbol | Type | Meaning / Description | Notes |
|---|---|---|---|
| s | Modifier | Session leader | Process started a session (usually shells, daemons). |
| l | Modifier | Multi-threaded | Process has multiple threads (POSIX threads). |
| + | Modifier | Foreground process group | Attached to terminal and can receive input signals. |
| < | Modifier | High priority | Real-time or kernel-specified priority. |
| N | Modifier | Low priority / nice | Positive nice value (lower scheduler priority). |
| L | Modifier | Has pages locked in memory | Used in real-time processes. |
Top CPU User
Status
1ps aux --sort=-%cpu | head
2
3ps aux --sort=-%cpu | head
4USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
5mcbtagu+ 22671 17.4 2.0 12941056 1023280 ? S<l Feb04 239:29 /usr/lib/firefox/firefox
6mcbtagu+ 3742 8.6 0.5 1755524 286288 tty1 S<l+ Feb04 132:35 cosmic-comp
If one process is always on top - investigate it.
Also on a multi-core system, always check if one CPU is at high utilization or 100%. App may be single-threaded. To check you can use other top command, or using top press 1. This will display per CPU load.
1top - 21:45:59 up 11 days, 10:13, 2 users, load average: 0.34, 0.35, 0.32
2Tasks: 173 total, 2 running, 171 sleeping, 0 stopped, 0 zombie
3%Cpu0 : 7.1 us, 5.1 sy, 0.0 ni, 86.9 id, 0.0 wa, 0.0 hi, 1.0 si, 0.0 st
4%Cpu1 : 12.9 us, 5.1 sy, 0.0 ni, 81.4 id, 0.0 wa, 0.0 hi, 0.7 si, 0.0 st
5MiB Mem : 3792.2 total, 148.5 free, 1690.5 used, 2513.0 buff/cache
6MiB Swap: 3792.0 total, 3058.0 free, 734.0 used. 2101.7 avail Mem
Multi-threaded
To check if an application is multi-threaded.
1ps -eLf
2
3root 4047471 4047447 4047471 0 10 Feb04 ? 00:00:06 /beszel serve --http=0.0.0.0:8090
4root 4047471 4047447 4047523 0 10 Feb04 ? 00:00:07 /beszel serve --http=0.0.0.0:8090
5root 4047471 4047447 4047524 0 10 Feb04 ? 00:00:00 /beszel serve --http=0.0.0.0:8090
6root 4047471 4047447 4047525 0 10 Feb04 ? 00:00:00 /beszel serve --http=0.0.0.0:8090
7root 4047471 4047447 4047627 0 10 Feb04 ? 00:00:07 /beszel serve --http=0.0.0.0:8090
PIDSTAT
Another tool to check cpu and process, it is a monitoring tool for individual task.
1pidstat
2
3Average: USER PID %usr %system %guest %wait %CPU CPU Command
4Average: root 17 0.00 0.17 0.00 0.17 0.17 - rcu_preempt
5Average: root 1049 2.00 0.67 0.00 0.00 2.66 - dockerd
- USER – The owner of the process.
- PID – The process ID.
- %usr – The percentage of CPU time spent in user mode.
- %system – The percentage of CPU time spent in kernel mode.
- %guest – The percentage of CPU time spent running virtual CPUs (guest OS).
- %wait – The percentage of CPU time spent waiting for I/O (like disk or network).
- %CPU – The total CPU usage of the process. Calculated roughly as %usr + %system + %guest + %wait.
To show the I/O statiscics.
1pidstat -d
2
310:30:14 PM UID PID kB_rd/s kB_wr/s kB_ccwr/s iodelay Command
410:30:14 PM 0 1 29.68 52.61 0.88 0 systemd
510:30:14 PM 0 258 0.00 2.78 0.00 0 jbd2/dm-0-8
- Time (10:30:14 PM) – Timestamp of the measurement.
- UID (0) – User ID of the process owner.
- PID (1) – Process ID.
- kB_rd/s (29.68) – Kilobytes read per second from disk by the process.
- kB_wr/s (52.61) – Kilobytes written per second to disk by the process.
- kB_ccwr/s (0.88) – Kilobytes written to memory that required cache cleanup (i.e., writeback from page cache).
- iodelay (0) – Time spent waiting for I/O in milliseconds (ms) for this interval.
- Command (systemd) – The process name.
Other useful command.
1# show specific application
2pidstat -C <application_name>
3pidstat -p <PID>
perf
perf is a powerful analysis tool in profiling user-space applications and kernel code by sampling hardware and software performance counter.
1perf stat -p 704 sleep 5
2
3 Performance counter stats for process id '704':
4
5 23 context-switches # 4.6 cs/sec cs_per_second
6 0 cpu-migrations # 0.0 migrations/sec migrations_per_second
7 0 page-faults # 0.0 faults/sec page_faults_per_second
8 5,010.95 msec task-clock # 1.0 CPUs CPUs_utilized
9 94,030,265 branch-misses # 2.9 % branch_miss_rate (66.55%)
10 3,218,655,134 branches # 642.3 M/sec branch_frequency (66.69%)
11 16,847,526,866 cpu-cycles # 3.4 GHz cycles_frequency (66.66%)
12 16,123,382,223 instructions # 1.0 instructions insn_per_cycle (66.57%)
13
14 5.011242868 seconds time elapsed
- task-clock: total duration during which the CPU was actively executing instructions
- context-switches: number of times the operating system switched execution from one task to another
- cpu-migrations: number of times a process was transferred from one CPU core to another
- page-faults: number of times a program attempted to access a memory page that was not currently loaded in RAM and had to be retrieved from disk
To record perf performance data.
1perf record -g -p 704 sleep 5
2[ perf record: Woken up 12 times to write data ]
3[ perf record: Captured and wrote 2.962 MB perf.data (19944 samples) ]
To view the saved data.
1perf report
2
3Samples: 19K of event 'cpu/cycles/P', Event count (approx.): 16434851149
4 Children Self Command Shared Object Symbol
5+ 100.00% 0.00% yes yes [.] 0x000055f8fe8958f5 ◆
6+ 100.00% 0.00% yes libc.so.6 [.] __libc_start_main_impl (inlined) ▒
7+ 100.00% 0.00% yes libc.so.6 [.] call_init (inlined) ▒
8+ 100.00% 0.00% yes libc.so.6 [.] 0x00007faa3c427741 ▒
9+ 99.24% 0.00% yes libc.so.6 [.] 0x00007faa3c494b04 ▒
10+ 99.18% 0.00% yes libc.so.6 [.] 0x00007faa3c494ade ▒
11+ 70.93% 6.43% yes [kernel.kallsyms] [.] entry_SYSCALL_64_after_hwframe ▒
12+ 64.50% 1.49% yes [kernel.kallsyms] [.] do_syscall_64 ▒
13+ 53.81% 0.08% yes libc.so.6 [.] vmsplice ▒
14+ 53.78% 0.00% yes yes [.] 0x000055f8fe89579c ▒
15+ 45.81% 0.04% yes libc.so.6 [.] splice ▒
16+ 45.79% 0.00% yes yes [.] 0x000055f8fe8957db
Signals
Siagns are used to communicate with running processes. This is to send instruction to the process if to stop, pause, resume or to clean up.
The kernel can send signals, for instance, when a process attempts to divide by zero it receives the SIGFPE signal.
SIGSTOP
To stop a process.
1[arch@mrlgarchforge tmp]$ yes > /dev/null &
2[1] 1445
3[arch@mrlgarchforge tmp]$ kill -SIGSTOP 1445
SIGCONT
To resume.
1[arch@mrlgarchforge tmp]$ kill -SIGCONT 1445
SIGINT
This is the signal sent when CTrl+C is pressed.
SIGTERM and SIGQUIT
SIGTERM is the more polite way to shutdown a process (grecefull kill), while SIGQUIT is forceful request to terminate the process (usually when it is misbehaving).
SIGKILL
This is true force kill, when this signal is received, it can’t be ignored and it will forcefully terminate the process.
CPU & I/O
Other common killing the performance is I/O (input), this could stretch from disk to network. A common tool to check is using `vmstat.
vmstat
1vmstat
2
3procs -----------memory---------- ---swap-- -----io---- -system-- -------cpu-------
4 r b swpd free buff cache si so bi bo in cs us sy id wa st gu
5 7 0 52 28477860 916668 7946588 0 0 113 886 3986 15 7 2 88 2 0 0
- r (run queue) - Number of runnable process, problem exist if r consistently > number of CPU cores.
- b (blocked) - Processes waiting on I/O. High b = disk or network I/O bottleneck.
- swpd - Amount of swap currently used (KB).
- free - Completely unused RAM.
- buff - Buffer cache (metadata, block I/O buffers).
- cache - Page cache (file contents cached in RAM). Cache shrinking + swap growing → memory trouble.
- si (swap in) - KB/s swapped from disk into RAM.
- so (swap out) - KB/s swapped from RAM to disk. Any sustained non-zero values → severe memory pressure.
- bi / bo (blocks in/out) - Disk read/write activity. High values ≠ bad by default. High with high wa = I/O problem.
- in (interrupts/sec) - Hardware + software interrupts. Network traffic increases this a lot.
- cs (context switches/sec) - How often CPU switches between processes/threads. Extremely high values → too many threads, locks, or I/O wakeups.
- us - user CPU (apps)
- sy - kernel CPU (syscalls, I/O handling). Network-heavy systems often show high.
- id - idle CPU
- wa - CPU waiting on I/O
- st (steal) - Time stolen by hypervisor (VMs only)
| Symptom | Likely Cause |
|---|---|
High r, high us |
CPU-bound |
High b, high wa |
Disk I/O bottleneck |
High sy, high in |
Network I/O |
Non-zero si/so |
Memory pressure |
High st |
VM host contention |
Disk level I/O
iostat
1iostat -xz
2Linux 6.17.9-76061709-generic (tags-p51) 02/06/2026 _x86_64_ (8 CPU)
3
4avg-cpu: %user %nice %system %iowait %steal %idle
5 6.95 0.70 1.89 2.19 0.00 88.27
6
7Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
8dm-0 0.00 0.03 0.00 0.00 0.21 21.98 0.00 0.00 0.00 0.00 0.00 0.44 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
9nvme0n1 0.98 30.17 0.24 19.51 0.20 30.77 6.52 306.79 5.36 45.12 4.69 47.07 0.00 0.00 0.00 0.00 0.00 0.00 0.38 0.71 0.03 0.40
10nvme1n1 0.03 2.38 0.01 16.48 0.30 71.60 0.19 4.13 0.02 10.80 3.53 22.11 0.00 0.00 0.00 0.00 0.00 0.00 0.03 0.75 0.00 0.03
11sda 0.55 74.06 0.40 42.18 18.06 134.38 0.58 540.84 0.93 61.83 1037.08 937.41 0.00 0.00 0.00 0.00 0.00 0.00 0.01 701.83 0.61 3.93
12zram0 0.00 0.01 0.00 0.00 0.02 20.34 0.00 0.00 0.00 0.00 0.00 4.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Critical iostat columns.
1Device r/s w/s rkB/s wkB/s await svctm %util
- %iowait - CPU time is spent waiting on disks.
- %util - Total disk utilization.
- r_await / w_await - Time an I/O spends waiting + being serviced.
- aqu-sz - Queue depth, shows how many I/Os are waiting.
- wareq-sz - Shows if writes or big or small.
- r/s / w/s - IOPS (operations per second).
Identify Processes Stuck in I/O
In top look for state in D, or run this command.
1ps aux | awk '$8 ~ /D/ {print}'
Priority and Nice Values
When a process is starving others, or background jobs steal CPU - its a good practiace to lower the priority of the application.
Status
To check priority.
1ps -o pid,ni,comm -p PID
Lower priority.
1renice 10 -p PID
Raise priority
1renice -5 -p PID
Zombie Processes
Zombie process is a child process that has completed its excution and terminated but still has entry in the system’s process table.
Find zombies and parent process
1ps aux | grep Z
2ps -o ppid= -p ZOMBIE_PID
You fix the parent process and not the zombie process.
ulimit
ulimit is used to view and set limit the the system resources the users consume.
Soft Limits
This are values enforced by the kernel and can still be adjusted by the user. The hard limit acts as the celing for the soft limit.
1[arch@mrlgarchforge ~]$ ulimit -Sa
2real-time non-blocking time (microseconds, -R) unlimited
3core file size (blocks, -c) unlimited
4data seg size (kbytes, -d) unlimited
5scheduling priority (-e) 0
6file size (blocks, -f) unlimited
7pending signals (-i) 15544
8max locked memory (kbytes, -l) 8192
9max memory size (kbytes, -m) unlimited
10open files (-n) 1024
11pipe size (512 bytes, -p) 8
12POSIX message queues (bytes, -q) 819200
13real-time priority (-r) 0
14stack size (kbytes, -s) 8192
15cpu time (seconds, -t) unlimited
16max user processes (-u) 15544
17virtual memory (kbytes, -v) unlimited
18file locks (-x) unlimited
Hard Limits
The maximum value for the soft limit and maximum amount of a resources a user can consume. Only root users can change the hard limit.
1[arch@mrlgarchforge ~]$ ulimit -Ha
2real-time non-blocking time (microseconds, -R) unlimited
3core file size (blocks, -c) unlimited
4data seg size (kbytes, -d) unlimited
5scheduling priority (-e) 0
6file size (blocks, -f) unlimited
7pending signals (-i) 15544
8max locked memory (kbytes, -l) 8192
9max memory size (kbytes, -m) unlimited
10open files (-n) 524288
11pipe size (512 bytes, -p) 8
12POSIX message queues (bytes, -q) 819200
13real-time priority (-r) 0
14stack size (kbytes, -s) unlimited
15cpu time (seconds, -t) unlimited
16max user processes (-u) 15544
17virtual memory (kbytes, -v) unlimited
18file locks (-x) unlimited
Process Number
Limit maximum process the user can execute.
1ulimit -u 100
File Size
Limit maximum file size the user can make.
1# limit to 100KB
2ulimit -f 100
Virtual Memory
Limit the maximum virtual memory avaialbe to a process.
1# virt mem 1000KB
2ulimit -v 1000
Opened Files
Limit the number of simultaneously opened files (file descriptors).
1ulimit -n 10
Other limit
| Flag | Meaning | Short Description |
|---|---|---|
-a |
all | Show all current limits |
-H |
hard | Show/set hard limits |
-S |
soft | Show/set soft limits |
-c |
core file size | Max size of core dump files |
-d |
data seg size | Max process data segment size |
-e |
scheduling priority | Max scheduling priority (nice) |
-f |
file size | Max size of files created |
-i |
pending signals | Max pending signals |
-l |
locked memory | Max locked-in-memory size |
-m |
resident set size | Max resident memory size |
-n |
open files | Max number of open file descriptors |
-p |
pipe size | Pipe buffer size |
-q |
message queues | Max bytes in POSIX message queues |
-r |
realtime priority | Max real-time scheduling priority |
-s |
stack size | Max stack size |
-t |
CPU time | Max CPU time per process |
-u |
user processes | Max number of user processes |
-v |
virtual memory | Max virtual memory available |
-x |
file locks | Max number of file locks |
-T |
threads | Max number of threads |
-P |
pseudoterminals | Max PTYs |
-b |
socket buffers | Max socket buffer size |
-k |
kqueues | Max kqueues allocated |
-w |
swaps | Max swap space |
Finding PID using Port
netstat
1etstat -ltnup | grep 22
2tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 578/sshd: /usr/bin/
3tcp6 0 0 :::22 :::* LISTEN 578/sshd: /usr/bin/
ss
1s -ltnup 'sport = :22'
2Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
3tcp LISTEN 0 128 0.0.0.0:22 0.0.0.0:* users:(("sshd",pid=578,fd=6))
4tcp LISTEN 0 128 [::]:22 [::]:* users:(("sshd",pid=578,fd=7))
lsof
1lsof -i :22
2COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
3sshd 578 root 6u IPv4 8277 0t0 TCP *:ssh (LISTEN)
4sshd 578 root 7u IPv6 8279 0t0 TCP *:ssh (LISTEN)
5sshd-sess 592 root 7u IPv4 9233 0t0 TCP mrlgarchforge:ssh->192.168.254.14:38796 (ESTABLISHED)
6sshd-sess 607 arch 7u IPv4 9233 0t0 TCP mrlgarchforge:ssh->192.168.254.14:38796 (ESTABLISHED)
Get Process ID
Show Background Process
1jobs
2[1]+ Running yes > /dev/null &
By Application Name
1ps -aux | grep firefox
By File
1fuser -v text.txt
2fuser -k text.txt
3fuser -cv text.txt
1lsof | { head -1 ; grep text.txt ; }
Killing a running application
Kill by application name
1pkill firefox
2kill firefox
3killall firefox
Terminate the process using its PID
1pgrep firefox
222671
3# or
4grep firefox
5mcbtagu+ 22671 13.6 1.9 12791920 936552 ? R<l Feb04 161:06 /usr/lib/firefox/firefox
6
7pkill 22671
8kill 22671
9killall 22671
Forceful Termination
1pkill -9 22671
2kill -9 22671
3killall -9 22671
4
5kill -9 firefox
6killall -9 firefox
Kill all application run by user
1ps -o pid,pgid,sess,cmd -U your_username
2kill -SIGTERM -- -<PGID>
Kill process tree.
1# get pgid
2ps -ejf | grep <application_name>
3# or
4pgrep -g [pgid_number]
5
6# kill tree
7kill -SIGTERM -- -<PGID>
Kill by Port
1lsof -i udp:80 | awk '/80/{print $2}' | xargs kil
2ss -Slp | grep -Po ':88\s.*pid=\K\d+(?=,)' | xargs kill
3netstat -Slp | grep -Po ':80\s.*LISTEN.*?\K\d+(?=/)' | xargs kill
Finding who killed the process
Kernel
Review dmesg
1dmesg | tail -10
OOMKILL
1journalctl --list-boots | \
2 awk '{ print $1 }' | \
3 xargs -I{} journalctl --utc --no-pager -b {} -kqg 'killed process' -o verbose --output-fields=MESSAGE
4
5find /var/log -name kern* -exec grep -PnHe 'Killed process' {} + 2>
Memory
Usage
Checking memory usage with free, vmstat, and /proc/meminfo.
1free
2 total used free shared buff/cache available
3Mem: 49112048 16058028 25626304 285292 11305296 33054020
4
5vmstat
6procs -----------memory---------- ---swap-- -----io---- -system-- -------cpu-------
7 r b swpd free buff cache si so bi bo in cs us sy id wa st gu
8 3 0 52 25628464 1058104 10247340 0 0 79 690 3693 16 8 2 88 2 0 0
9
10cat /proc/meminfo
11MemTotal: 49112048 kB
12MemFree: 25607800 kB
13MemAvailable: 33035832 kB
14Buffers: 1058168 kB
15Cached: 9118284 kB
16SwapCached: 0 kB
17Active: 10680872 kB
18Inactive: 9254080 kB
19Active(anon): 9869064 kB
20Inactive(anon): 179956 kB
21Active(file): 811808 kB
22Inactive(file): 9074124 kB
23Unevictable: 6856 kB
24Mlocked: 6856 kB
25SwapTotal: 20970996 kB
26SwapFree: 20970944 kB
27Zswap: 0 kB
28Zswapped: 0 kB
29Dirty: 3916 kB
30Writeback: 0 kB
31AnonPages: 9765028 kB
32Mapped: 1826376 kB
33Shmem: 286036 kB
34KReclaimable: 1129924 kB
35Slab: 1769516 kB
36SReclaimable: 1129924 kB
37SUnreclaim: 639592 kB
38KernelStack: 40272 kB
39PageTables: 124312 kB
40SecPageTables: 7180 kB
41NFS_Unstable: 0 kB
42Bounce: 0 kB
43WritebackTmp: 0 kB
44CommitLimit: 45527020 kB
45Committed_AS: 28825424 kB
46VmallocTotal: 34359738367 kB
47VmallocUsed: 355244 kB
48VmallocChunk: 0 kB
49Percpu: 9440 kB
50HardwareCorrupted: 0 kB
51AnonHugePages: 237568 kB
52ShmemHugePages: 0 kB
53ShmemPmdMapped: 0 kB
54FileHugePages: 202752 kB
55FilePmdMapped: 0 kB
56CmaTotal: 0 kB
57CmaFree: 0 kB
58Unaccepted: 0 kB
59Balloon: 0 kB
60HugePages_Total: 0
61HugePages_Free: 0
62HugePages_Rsvd: 0
63HugePages_Surp: 0
64Hugepagesize: 2048 kB
65Hugetlb: 0 kB
66DirectMap4k: 2512584 kB
67DirectMap2M: 38162432 kB
68DirectMap1G: 9437184 kB
Hogs
Finding memory hogs. First you can you use top or htop, with top press M to sort by memory.
smem
smem provide the real memory usage compare to top.
1smem
2
3 PID User Command Swap USS PSS RSS
437901 user /usr/lib/speech-dispatcher- 0 220 241 2832
5 4142 user dbus-broker --log 4 --contr 0 228 315 2844
1# Sort by unique memory usage
2smem -s uss
3
4# Show totals
5smem -t
6
7# Show system-wide summary
8smem -r
9
10# Human readable
11smem -k
- swap - Memory pages swapped to disk.
- USS (Unique Set Size) - Memory used only by this process.
- PSS (Proportional Set Size) - Shared memory is evenly divided among processes. Most accurate way to calculate total memory usage.
- RSS (Resident Set Size) - Total physical memory used by the process, includes shared memory. This is what top and ps mostly show.
If the system is slow or OOMing, monitor RSS. Some useful command to monitor PID memory.
1ps -eo pid,comm,rss,vsz --sort=-rss | head
2# watch
3watch -n 1 'ps -o pid,comm,rss,vsz -p <PID>'
Leak
If an application is hogging the system, then you might need to run specific tool (base on the programming laguage use). Example would be valgrindto monitor/check C/C++ application. Let’s focus on application currently runnin in the system.
First indication is memory hogs, as mentioned above you can check with top. You can also check using journalctl, look for “out of memory”. Another option is using dmesg and grep for oom, if OOM killer is fired then an application is killed due to memory pressure.
Look for the process name, oom_score and cgroup(container).
1dmesg -T | grep -i oom
2journalctl -k | grep -i "out of memory"
OOM Score
This is the primary metric to determine which process has the highes priority to be killed when the system is under RAM pressure. Range from negative to 1000 as highest, negative is most likely not to be killed. It is also a good indication that the application has merory leak if it gets OOM Kill frequently. Fix the leak and adjust the OOM Score.
View OOM Score.
1cat /proc/[PID]/oom_score
To make sure your application is not killed (production environment), you can adjust the OOM score. To temporary adjust the score:
1echo 'New_Score' | sudo tee /proc/[PID]/oom_score_adj
To make it persistent, create a service for your application and add OOMScoreAdjust.
Swap
Swap acts as extention or extra memory when physical RAM is exhausted. It can be from a dedicated partition or file. It is recommended to use SSD for Swap as it is faster compared to HDD.
Like mentioned in CPU section, vmstat can be used to check pressure on swap.
- si (swap in) - KB/s swapped from disk into RAM.
- so (swap out) - KB/s swapped from RAM to disk. Any sustained non-zero values → severe memory pressure.
Find PID hogging the swap.
1grep VmSwap /proc/*/status | sort -k2 -n | tail
2
3/proc/722/status:VmSwap: 14160 kB
4/proc/3294346/status:VmSwap: 14592 kB
5/proc/2438154/status:VmSwap: 17664 kB
6/proc/2438006/status:VmSwap: 18480 kB
7/proc/2390869/status:VmSwap: 18560 kB
8/proc/2438011/status:VmSwap: 18576 kB
9/proc/2437850/status:VmSwap: 18596 kB
10/proc/3187376/status:VmSwap: 53776 kB
11/proc/3294357/status:VmSwap: 57092 kB
12/proc/2438293/status:VmSwap: 194800 kB
Disk
Status
Check what’s full.
1df -h
Find space hogs.
1du -h --max-depth=1 / | sort -hr
2du -h --max-depth=1 /var | sort -hr
Find the offender big file.
1du -sh * | sort -rh | head -n 10
Disk I/O slowness / system freezing
Check real-time disk usage.
1iotop
Using iostat. Please check iostat. Loof for:
- High %util
- Long await times
To find the offending PID, monitor iodelay.
1pidstat -d
Disk health
smartctl
SMART status.
1smartctl -x -a /dev/sdX
Here are the factors to looked at when performing smartctl scan. Always look for error in the smartctl report.
- Device Error Count
- CRC errors
- Non-CRC FIS errors
- SMART overall health
- Reallocated sectors
- Pending sectors
- Offline uncorrectable
- UDMA CRC errors
- Temperature
- Remaining lifetime
- Program_Fail_Count_Chip
- Erase_Fail_Count_Chip
- Wear_Leveling_Count
fsck
In some cases your disk might not be broken, just file system inconsistensies or errors (bad blocks etc) due to improper shutdowns. Fsck scans the file systems and attemt to fix it. Unmount first the device before scanning, unless you know what you are doing.
If the device is unmounted but throws an error message like the example below. In most cases, I always use a recovery drive or boot to recovery mode to scan device that is mounted to boot/root/recovery, there are some method which you temporary mount it to a directory, but we will not cover that.
In my case the device is mounted in /srv/ssd.
1# Check disk mounts or just use df -h
2lsblk -f
3NAME FSTYPE FSVER LABEL UUID FSAVAIL FSUSE% MOUNTPOINTS
4sda
5└─sda1 ext4 1.0 data 289429cd-429f-43b2-b6c6-f5455c6dda6c
6zram0 [SWAP]
7nvme0n1
8├─nvme0n1p1 vfat FAT32 7862-B414 424M 58% /boot/efi
9├─nvme0n1p2 vfat FAT32 7862-B38D 1.1G 71% /recovery
10├─nvme0n1p3 ext4 1.0 50fcc7ca-aedb-4f2f-9f1e-397dfeb09027 177.3G 16% /
11└─nvme0n1p4 swap 1 093c81f3-30bf-4688-9612-1a41f8955898
12 └─cryptswap swap 1 cryptswap eb66fb4b-9c2e-4825-a386-e1e512223b97 [SWAP]
13
14# try to unmount
15umount -f /srv/ssd # got an error
16fsck from util-linux 2.39.3
17e2fsck 1.47.0 (5-Feb-2023)
18/dev/sda1 is in use.
19e2fsck: Cannot continue, aborting.
20
21# to force unmount
22echo 1 | sudo tee /sys/block/sda/device/delete
23echo "- - -" | sudo tee /sys/class/scsi_host/host*/scan
24
25# check disk, notice that it is now remounted in in /dev/sdb
26sudo fdisk -l
27
28Disk /dev/sdb: 111.79 GiB, 120034123776 bytes, 234441648 sectors
29Disk model: Ramsta SSD S800
30Units: sectors of 1 * 512 = 512 bytes
31Sector size (logical/physical): 512 bytes / 512 bytes
32I/O size (minimum/optimal): 512 bytes / 512 bytes
33Disklabel type: gpt
34Disk identifier: 5BFD6FBA-560B-4166-A769-E873998917F9
35
36Device Start End Sectors Size Type
37/dev/sdb1 2048 234440703 234438656 111.8G Linux filesystem
Run fsck and try to fix errors.
1fsck -fy /dev/sdb1
2fsck from util-linux 2.39.3
3e2fsck 1.47.0 (5-Feb-2023)
4data: recovering journal
5Pass 1: Checking inodes, blocks, and sizes
6Pass 2: Checking directory structure
7Pass 3: Checking directory connectivity
8Pass 4: Checking reference counts
9Pass 5: Checking group summary information
10Free blocks count wrong (8050503, counted=27795691).
11Fix? yes
12
13Free inodes count wrong (7241024, counted=7325485).
14Fix? yes
15
16
17data: ***** FILE SYSTEM WAS MODIFIED *****
18data: 6355/7331840 files (0.3% non-contiguous), 1509141/29304832 blocks
trim
TRIM is used to optmized and increate longetivity of SSD by discarding the data blocks that no longer in use in the drive.
This is automatically started in the system, just to make sure check if the service is runing.
1systemctl status fstrim.timer
In mounting disk in /etc/fstab, make sure to add discard in the option to automatically trimmed.
1UUID="289429cd-429f-43b2-b6c6-f5455c6dda6c" /srv/ssd ext4 defaults,noatime,discard 0 2
Manually run TRIM.
1fstrim -av
Logs
logrotate
This is automatically enabled on your system, this basically archived and eventually delete your old logs. /etc/logrotate.conf
1# see "man logrotate" for details
2# global options do not affect preceding include directives
3# rotate log files weekly
4weekly
5# use the adm group by default, since this is the owning group
6# of /var/log/.
7su root adm
8# keep 1 weeks worth of backlogs
9rotate 1
10# create new (empty) log files after rotating old ones
11create
12# use date as a suffix of the rotated file
13#dateext
14# uncomment this if you want your log files compressed
15#compress
16# packages drop log rotation information into this directory
17include /etc/logrotate.d
18# system-specific logs may also be configured here.
To add a custom directory to rotate, add the config file in /etc/logrotate.d/. maddy
1cat /etc/logrotate.d/maddy
2/srv/volume/maddy/log/maddy.log {
3 weekly
4 rotate 2
5 compress
6 missingok
7 notifempty
8}
journald
Usually journald hogs disk space if left to default configuration.
Reduce to a specific size (e.g., 500MB):
1journalctl --vacuum-size=100M
1sudo journalctl --vacuum-time=2weeks
Check current disk usage:
1journalctl --disk-usage
Configure permanent limit. etc/systemd/journald.conf
1[Journal]
2SystemMaxUse=200M
Docker System Usage
image/builder/volume
Docker eating up your disk, to get an overview on you docker system. You can see below that the build cache is filling up the disk.
1docker system df
2TYPE TOTAL ACTIVE SIZE RECLAIMABLE
3Images 7 6 7.491GB 7.242GB (96%)
4Containers 5 2 938kB 159.7kB (17%)
5Local Volumes 0 0 0B 0B
6Build Cache 47 0 4.213GB 1.332GB
Some command to prune docker resources. Most of the time this would be enough to free up some space.
1docker builder prune
2docker image prune
If you are brave enough (to do this in production).
1docker system prune -a
2docker system prune -a --volumes
logs
Find space used by container logs.
1docker ps --format '{{.Names}}' | while read c; do
2 echo "$c: $(docker logs --tail 1 $c 2>&1 | wc -c) bytes";
3done
4
5docker ps --format '{{.Names}}' | while read c; do
6 echo "$c: $(docker logs --tail 1 $c 2>&1 | wc -c) bytes";
7done
8beszel-agent: 61 bytes
9autodiscover: 0 bytes
10traefik: 311 bytes
11gerbil: 67 bytes
12pangolin: 120 bytes
13traefik-certs-dumper: 0 bytes
14cloudflared: 169 bytes
15newt: 59 bytes
16dockge: 84 bytes
Limting the logs file in container. Editing docker daemon/config.
/etc/docker/daemon.json
1{
2 "log-driver": "json-file",
3 "log-opts": {
4 "max-size": "10m",
5 "max-file": "3"
6 }
7}
Defining in compose file.
1services:
2 app:
3 logging:
4 driver: json-file
5 options:
6 max-size: "10m"
7 max-file: "3"
Docker
Container overloading the system
For some reason you deployed an application more than your server can handle. First try to kill the container.
1docker stop <container_name>
2docker rm <container_name>
3```bash
4If stop command is not doing anything, try to force remove the container.
5```bash
6docker rm -f <container_name>
If all else fail, stop docker and containerd service.
1systemctl restart containerd
2systemctl restart docker
3
4# or
5systemctl kill docker
6systemctl kill containerd
7
8# do this if you know what you are doing
9pkill -9 docker
10pkill -9 containerd
11pkill -9 containerd-shim
12pkill -9 runc
If the container is still not being killed. Find the unkillable PID state.
1ps -eo pid,stat,cmd | grep D
2kill -9 <PID>
Monitor system load and restart service.
1systemctl restart containerd
2systemctl restart docker