Linux Troubleshooting
Table of Contents
CPU and Processes
Understanding top command
1$ top
2
3top - 18:47:04 up 22:45, 2 users, load average: 0.46, 0.74, 0.85
4Tasks: 410 total, 1 running, 407 sleeping, 0 stopped, 2 zombie
5%Cpu(s): 7.1 us, 2.4 sy, 4.1 ni, 86.2 id, 0.1 wa, 0.0 hi, 0.0 si, 0.0 st
6MiB Mem : 47961.0 total, 32654.2 free, 11221.4 used, 7702.5 buff/cache
7MiB Swap: 20479.5 total, 20479.4 free, 0.1 used. 36739.6 avail Mem
8
9 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
10 22671 mcbtagu+ 14 -6 12.2g 968172 437988 S 47.2 2.0 171:20.29 firefox-bin
Header line (system status)
1top - 18:47:04 up 22:45, 2 users, load average: 0.46, 0.74, 0.85
- 18:47:04 - current time
- up 22:45 - machine has been running for 22 hours 45 minutes
- 2 users - two logged-in users
- load average: 0.46, 0.74, 0.85
- Average number of runnable processes over:
- last 1 min
- last 5 min
- last 15 min
- On a multi-core system, these numbers are low - system is not stressed
- Average number of runnable processes over:
Tasks (process summary)
1Tasks: 410 total, 1 running, 407 sleeping, 0 stopped, 2 zombie
- 410 total - total processes/threads
- 1 running - only one actively using CPU right now
- 407 sleeping - normal; most processes wait for work
- 2 zombie - processes that exited but haven’t been cleaned up
CPU usage
1%Cpu(s): 7.1 us, 2.4 sy, 4.1 ni, 86.2 id, 0.1 wa, 0.0 hi, 0.0 si, 0.0 st
- us (7.1%) - user programs
- sy (2.4%) - kernel/system work
- ni (4.1%) - low-priority (nice) processes
- id (86.2%) - idle CPU
- wa (0.1%) - waiting on disk I/O
- hi / si - hardware/software interrupts
- st - stolen by hypervisor (VMs)
Memory usage
1MiB Mem : 47961.0 total, 32654.2 free, 11221.4 used, 7702.5 buff/cache
- Total: ~48 GB
- Free: ~32 GB (very high)
- Used: ~11 GB (actual app usage)
- buff/cache: ~7.7 GB (filesystem cache, reclaimable)
1MiB Swap: 20479.5 total, 20479.4 free, 0.1 used. 36739.6 avail Mem
- Swap almost unused - excellent
- avail Mem ~36 GB - memory pressure is basically zero
Process list
1PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
222671 mcbtag+ 14 -6 12.2g 968172 437988 S 47.2 2.0 171:20.29
Identity
- PID 22671 - process ID
- USER mcbtag+ - user
- COMMAND - (cut off, but this is the process name) Priority
- PR 14 - scheduler priority
- NI -6 - higher priority than normal
- Negative nice = process is favored by scheduler Memory
- VIRT 12.2g - virtual memory (address space)
- RES 968 MB - actual physical RAM used
- SHR 438 MB - shared memory (counts once globally) State
- S - sleeping (can still show CPU usage due to sampling) Usage
- %CPU 47.2
- About half of one CPU core
- Totally fine on a multi-core system
- %MEM 2.0
- TIME+ 171:20
- Total CPU time used since start
Some top command - while running
| Command | Description |
|---|---|
P |
sort by CPU |
M |
sort by memory |
1 |
show per-CPU usage |
k |
kill a process |
H |
show threads |
q |
quit |
Process States
1ps -eo pid,stat,comm | head
2PID STAT COMMAND
31 Ss systemd
42 S kthreadd
53 S pool_workqueue_release
64 I< kworker/R-rcu_gp
75 I< kworker/R-sync_wq
86 I< kworker/R-kvfree_rcu_reclaim
97 I< kworker/R-slub_flushwq
108 I< kworker/R-netns
1110 I< kworker/0:0H-events_highpri
Primary States
| Symbol | Type | Meaning / Description | Notes / Examples |
|---|---|---|---|
| R | Primary | Running | Actively using CPU or ready to run. Check %CPU to see load. |
| S | Primary | Sleeping (interruptible) | Waiting for an event or input. Normal for most processes. |
| D | Primary | Uninterruptible sleep | Waiting on I/O (disk, network). Cannot be killed with SIGKILL. High numbers = I/O bottleneck. |
| Z | Primary | Zombie | Process finished but parent hasn’t reaped it. No CPU usage, but occupies PID. |
| T | Primary | Stopped | Paused by signal or debugger (e.g., SIGSTOP). |
| X | Primary | Dead | Shouldn’t appear normally. Process is terminated. |
Common Modifiers
| Symbol | Type | Meaning / Description | Notes |
|---|---|---|---|
| s | Modifier | Session leader | Process started a session (usually shells, daemons). |
| l | Modifier | Multi-threaded | Process has multiple threads (POSIX threads). |
| + | Modifier | Foreground process group | Attached to terminal and can receive input signals. |
| < | Modifier | High priority | Real-time or kernel-specified priority. |
| N | Modifier | Low priority / nice | Positive nice value (lower scheduler priority). |
| L | Modifier | Has pages locked in memory | Used in real-time processes. |
Top CPU User
Status
1ps aux --sort=-%cpu | head
2
3ps aux --sort=-%cpu | head
4USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
5mcbtagu+ 22671 17.4 2.0 12941056 1023280 ? S<l Feb04 239:29 /usr/lib/firefox/firefox
6mcbtagu+ 3742 8.6 0.5 1755524 286288 tty1 S<l+ Feb04 132:35 cosmic-comp
If one process is always on top - investigate it.
Also on a multi-core system, always check if one CPU is at high utilization or 100%. App may be single-threaded. To check you can use other top command, or using top press 1. This will display per CPU load.
1top - 21:45:59 up 11 days, 10:13, 2 users, load average: 0.34, 0.35, 0.32
2Tasks: 173 total, 2 running, 171 sleeping, 0 stopped, 0 zombie
3%Cpu0 : 7.1 us, 5.1 sy, 0.0 ni, 86.9 id, 0.0 wa, 0.0 hi, 1.0 si, 0.0 st
4%Cpu1 : 12.9 us, 5.1 sy, 0.0 ni, 81.4 id, 0.0 wa, 0.0 hi, 0.7 si, 0.0 st
5MiB Mem : 3792.2 total, 148.5 free, 1690.5 used, 2513.0 buff/cache
6MiB Swap: 3792.0 total, 3058.0 free, 734.0 used. 2101.7 avail Mem
Multi-threaded
To check if an application is multi-threaded.
1ps -eLf
2
3root 4047471 4047447 4047471 0 10 Feb04 ? 00:00:06 /beszel serve --http=0.0.0.0:8090
4root 4047471 4047447 4047523 0 10 Feb04 ? 00:00:07 /beszel serve --http=0.0.0.0:8090
5root 4047471 4047447 4047524 0 10 Feb04 ? 00:00:00 /beszel serve --http=0.0.0.0:8090
6root 4047471 4047447 4047525 0 10 Feb04 ? 00:00:00 /beszel serve --http=0.0.0.0:8090
7root 4047471 4047447 4047627 0 10 Feb04 ? 00:00:07 /beszel serve --http=0.0.0.0:8090
PIDSTAT
Another tool to check cpu and process, it is a monitoring tool for individual task.
1pidstat
2
3Average: USER PID %usr %system %guest %wait %CPU CPU Command
4Average: root 17 0.00 0.17 0.00 0.17 0.17 - rcu_preempt
5Average: root 1049 2.00 0.67 0.00 0.00 2.66 - dockerd
- USER – The owner of the process.
- PID – The process ID.
- %usr – The percentage of CPU time spent in user mode.
- %system – The percentage of CPU time spent in kernel mode.
- %guest – The percentage of CPU time spent running virtual CPUs (guest OS).
- %wait – The percentage of CPU time spent waiting for I/O (like disk or network).
- %CPU – The total CPU usage of the process. Calculated roughly as %usr + %system + %guest + %wait.
To show the I/O statiscics.
1pidstat -d
2
310:30:14 PM UID PID kB_rd/s kB_wr/s kB_ccwr/s iodelay Command
410:30:14 PM 0 1 29.68 52.61 0.88 0 systemd
510:30:14 PM 0 258 0.00 2.78 0.00 0 jbd2/dm-0-8
- Time (10:30:14 PM) – Timestamp of the measurement.
- UID (0) – User ID of the process owner.
- PID (1) – Process ID.
- kB_rd/s (29.68) – Kilobytes read per second from disk by the process.
- kB_wr/s (52.61) – Kilobytes written per second to disk by the process.
- kB_ccwr/s (0.88) – Kilobytes written to memory that required cache cleanup (i.e., writeback from page cache).
- iodelay (0) – Time spent waiting for I/O in milliseconds (ms) for this interval.
- Command (systemd) – The process name.
Other useful command.
1# show specific application
2pidstat -C <application_name>
3pidstat -p <PID>
CPU & I/O
Other common killing the performance is I/O (input), this could stretch from disk to network. A common tool to check is using `vmstat.
vmstat
1vmstat
2
3procs -----------memory---------- ---swap-- -----io---- -system-- -------cpu-------
4 r b swpd free buff cache si so bi bo in cs us sy id wa st gu
5 7 0 52 28477860 916668 7946588 0 0 113 886 3986 15 7 2 88 2 0 0
- r (run queue) - Number of runnable process, problem exist if r consistently > number of CPU cores.
- b (blocked) - Processes waiting on I/O. High b = disk or network I/O bottleneck.
- swpd - Amount of swap currently used (KB).
- free - Completely unused RAM.
- buff - Buffer cache (metadata, block I/O buffers).
- cache - Page cache (file contents cached in RAM). Cache shrinking + swap growing → memory trouble.
- si (swap in) - KB/s swapped from disk into RAM.
- so (swap out) - KB/s swapped from RAM to disk. Any sustained non-zero values → severe memory pressure.
- bi / bo (blocks in/out) - Disk read/write activity. High values ≠ bad by default. High with high wa = I/O problem.
- in (interrupts/sec) - Hardware + software interrupts. Network traffic increases this a lot.
- cs (context switches/sec) - How often CPU switches between processes/threads. Extremely high values → too many threads, locks, or I/O wakeups.
- us - user CPU (apps)
- sy - kernel CPU (syscalls, I/O handling). Network-heavy systems often show high.
- id - idle CPU
- wa - CPU waiting on I/O
- st (steal) - Time stolen by hypervisor (VMs only)
| Symptom | Likely Cause |
|---|---|
High r, high us |
CPU-bound |
High b, high wa |
Disk I/O bottleneck |
High sy, high in |
Network I/O |
Non-zero si/so |
Memory pressure |
High st |
VM host contention |
Disk level I/O
iostat
1iostat -xz
2Linux 6.17.9-76061709-generic (tags-p51) 02/06/2026 _x86_64_ (8 CPU)
3
4avg-cpu: %user %nice %system %iowait %steal %idle
5 6.95 0.70 1.89 2.19 0.00 88.27
6
7Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
8dm-0 0.00 0.03 0.00 0.00 0.21 21.98 0.00 0.00 0.00 0.00 0.00 0.44 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
9nvme0n1 0.98 30.17 0.24 19.51 0.20 30.77 6.52 306.79 5.36 45.12 4.69 47.07 0.00 0.00 0.00 0.00 0.00 0.00 0.38 0.71 0.03 0.40
10nvme1n1 0.03 2.38 0.01 16.48 0.30 71.60 0.19 4.13 0.02 10.80 3.53 22.11 0.00 0.00 0.00 0.00 0.00 0.00 0.03 0.75 0.00 0.03
11sda 0.55 74.06 0.40 42.18 18.06 134.38 0.58 540.84 0.93 61.83 1037.08 937.41 0.00 0.00 0.00 0.00 0.00 0.00 0.01 701.83 0.61 3.93
12zram0 0.00 0.01 0.00 0.00 0.02 20.34 0.00 0.00 0.00 0.00 0.00 4.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Critical iostat columns.
1Device r/s w/s rkB/s wkB/s await svctm %util
- %iowait - CPU time is spent waiting on disks.
- %util - Total disk utilization.
- r_await / w_await - Time an I/O spends waiting + being serviced.
- aqu-sz - Queue depth, shows how many I/Os are waiting.
- wareq-sz - Shows if writes or big or small.
- r/s / w/s - IOPS (operations per second).
Identify Processes Stuck in I/O
In top look for state in D, or run this command.
1ps aux | awk '$8 ~ /D/ {print}'
Priority and Nice Values
When a process is starving others, or background jobs steal CPU - its a good practiace to lower the priority of the application.
Status
To check priority.
1ps -o pid,ni,comm -p PID
Lower priority.
1renice 10 -p PID
Raise priority
1renice -5 -p PID
Zombie Processes
Zombie process is a child process that has completed its excution and terminated but still has entry in the system’s process table.
Find zombies and parent process
1ps aux | grep Z
2ps -o ppid= -p ZOMBIE_PID
You fix the parent process and not the zombie process.
Killing a running application
Kill by application name
1pkill firefox
2kill firefox
3killall firefox
Terminate the process using its PID
1pgrep firefox
222671
3# or
4grep firefox
5mcbtagu+ 22671 13.6 1.9 12791920 936552 ? R<l Feb04 161:06 /usr/lib/firefox/firefox
6
7pkill 22671
8kill 22671
9killall 22671
Forceful Termination
1pkill -9 22671
2kill -9 22671
3killall -9 22671
4
5kill -9 firefox
6killall -9 firefox
Kill all application run by user
1ps -o pid,pgid,sess,cmd -U your_username
2kill -SIGTERM -- -<PGID>
Kill process tree.
1# get pgid
2ps -ejf | grep <application_name>
3# or
4pgrep -g [pgid_number]
5
6# kill tree
7kill -SIGTERM -- -<PGID>
Memory
Usage
Checking memory usage with free, vmstat, and /proc/meminfo.
1free
2 total used free shared buff/cache available
3Mem: 49112048 16058028 25626304 285292 11305296 33054020
4
5vmstat
6procs -----------memory---------- ---swap-- -----io---- -system-- -------cpu-------
7 r b swpd free buff cache si so bi bo in cs us sy id wa st gu
8 3 0 52 25628464 1058104 10247340 0 0 79 690 3693 16 8 2 88 2 0 0
9
10cat /proc/meminfo
11MemTotal: 49112048 kB
12MemFree: 25607800 kB
13MemAvailable: 33035832 kB
14Buffers: 1058168 kB
15Cached: 9118284 kB
16SwapCached: 0 kB
17Active: 10680872 kB
18Inactive: 9254080 kB
19Active(anon): 9869064 kB
20Inactive(anon): 179956 kB
21Active(file): 811808 kB
22Inactive(file): 9074124 kB
23Unevictable: 6856 kB
24Mlocked: 6856 kB
25SwapTotal: 20970996 kB
26SwapFree: 20970944 kB
27Zswap: 0 kB
28Zswapped: 0 kB
29Dirty: 3916 kB
30Writeback: 0 kB
31AnonPages: 9765028 kB
32Mapped: 1826376 kB
33Shmem: 286036 kB
34KReclaimable: 1129924 kB
35Slab: 1769516 kB
36SReclaimable: 1129924 kB
37SUnreclaim: 639592 kB
38KernelStack: 40272 kB
39PageTables: 124312 kB
40SecPageTables: 7180 kB
41NFS_Unstable: 0 kB
42Bounce: 0 kB
43WritebackTmp: 0 kB
44CommitLimit: 45527020 kB
45Committed_AS: 28825424 kB
46VmallocTotal: 34359738367 kB
47VmallocUsed: 355244 kB
48VmallocChunk: 0 kB
49Percpu: 9440 kB
50HardwareCorrupted: 0 kB
51AnonHugePages: 237568 kB
52ShmemHugePages: 0 kB
53ShmemPmdMapped: 0 kB
54FileHugePages: 202752 kB
55FilePmdMapped: 0 kB
56CmaTotal: 0 kB
57CmaFree: 0 kB
58Unaccepted: 0 kB
59Balloon: 0 kB
60HugePages_Total: 0
61HugePages_Free: 0
62HugePages_Rsvd: 0
63HugePages_Surp: 0
64Hugepagesize: 2048 kB
65Hugetlb: 0 kB
66DirectMap4k: 2512584 kB
67DirectMap2M: 38162432 kB
68DirectMap1G: 9437184 kB
Hogs
Finding memory hogs. First you can you use top or htop, with top press M to sort by memory.
smem
smem provide the real memory usage compare to top.
1smem
2
3 PID User Command Swap USS PSS RSS
437901 user /usr/lib/speech-dispatcher- 0 220 241 2832
5 4142 user dbus-broker --log 4 --contr 0 228 315 2844
1# Sort by unique memory usage
2smem -s uss
3
4# Show totals
5smem -t
6
7# Show system-wide summary
8smem -r
9
10# Human readable
11smem -k
- swap - Memory pages swapped to disk.
- USS (Unique Set Size) - Memory used only by this process.
- PSS (Proportional Set Size) - Shared memory is evenly divided among processes. Most accurate way to calculate total memory usage.
- RSS (Resident Set Size) - Total physical memory used by the process, includes shared memory. This is what top and ps mostly show.
If the system is slow or OOMing, monitor RSS. Some useful command to monitor PID memory.
1ps -eo pid,comm,rss,vsz --sort=-rss | head
2# watch
3watch -n 1 'ps -o pid,comm,rss,vsz -p <PID>'
Leak
If an application is hogging the system, then you might need to run specific tool (base on the programming laguage use). Example would be valgrindto monitor/check C/C++ application. Let’s focus on application currently runnin in the system.
First indication is memory hogs, as mentioned above you can check with top. You can also check using journalctl, look for “out of memory”. Another option is using dmesg and grep for oom, if OOM killer is fired then an application is killed due to memory pressure.
Look for the process name, oom_score and cgroup(container).
1dmesg -T | grep -i oom
2journalctl -k | grep -i "out of memory"
OOM Score
This is the primary metric to determine which process has the highes priority to be killed when the system is under RAM pressure. Range from negative to 1000 as highest, negative is most likely not to be killed. It is also a good indication that the application has merory leak if it gets OOM Kill frequently. Fix the leak and adjust the OOM Score.
View OOM Score.
1cat /proc/[PID]/oom_score
To make sure your application is not killed (production environment), you can adjust the OOM score. To temporary adjust the score:
1echo 'New_Score' | sudo tee /proc/[PID]/oom_score_adj
To make it persistent, create a service for your application and add OOMScoreAdjust.
Swap
Swap acts as extention or extra memory when physical RAM is exhausted. It can be from a dedicated partition or file. It is recommended to use SSD for Swap as it is faster compared to HDD.
Like mentioned in CPU section, vmstat can be used to check pressure on swap.
- si (swap in) - KB/s swapped from disk into RAM.
- so (swap out) - KB/s swapped from RAM to disk. Any sustained non-zero values → severe memory pressure.
Find PID hogging the swap.
1grep VmSwap /proc/*/status | sort -k2 -n | tail
2
3/proc/722/status:VmSwap: 14160 kB
4/proc/3294346/status:VmSwap: 14592 kB
5/proc/2438154/status:VmSwap: 17664 kB
6/proc/2438006/status:VmSwap: 18480 kB
7/proc/2390869/status:VmSwap: 18560 kB
8/proc/2438011/status:VmSwap: 18576 kB
9/proc/2437850/status:VmSwap: 18596 kB
10/proc/3187376/status:VmSwap: 53776 kB
11/proc/3294357/status:VmSwap: 57092 kB
12/proc/2438293/status:VmSwap: 194800 kB
Disk
Status
Check what’s full.
1df -h
Find space hogs.
1du -h --max-depth=1 / | sort -hr
2du -h --max-depth=1 /var | sort -hr
Find the offender big file.
1du -sh * | sort -rh | head -n 10
Disk I/O slowness / system freezing
Check real-time disk usage.
1iotop
Using iostat. Please check iostat. Loof for:
- High %util
- Long await times
To find the offending PID, monitor iodelay.
1pidstat -d
Disk health
smartctl
SMART status.
1smartctl -x -a /dev/sdX
Here are the factors to looked at when performing smartctl scan. Always look for error in the smartctl report.
- Device Error Count
- CRC errors
- Non-CRC FIS errors
- SMART overall health
- Reallocated sectors
- Pending sectors
- Offline uncorrectable
- UDMA CRC errors
- Temperature
- Remaining lifetime
- Program_Fail_Count_Chip
- Erase_Fail_Count_Chip
- Wear_Leveling_Count
fsck
In some cases your disk might not be broken, just file system inconsistensies or errors (bad blocks etc) due to improper shutdowns. Fsck scans the file systems and attemt to fix it. Unmount first the device before scanning, unless you know what you are doing.
If the device is unmounted but throws an error message like the example below. In most cases, I always use a recovery drive or boot to recovery mode to scan device that is mounted to boot/root/recovery, there are some method which you temporary mount it to a directory, but we will not cover that.
In my case the device is mounted in /srv/ssd.
1# Check disk mounts or just use df -h
2lsblk -f
3NAME FSTYPE FSVER LABEL UUID FSAVAIL FSUSE% MOUNTPOINTS
4sda
5└─sda1 ext4 1.0 data 289429cd-429f-43b2-b6c6-f5455c6dda6c
6zram0 [SWAP]
7nvme0n1
8├─nvme0n1p1 vfat FAT32 7862-B414 424M 58% /boot/efi
9├─nvme0n1p2 vfat FAT32 7862-B38D 1.1G 71% /recovery
10├─nvme0n1p3 ext4 1.0 50fcc7ca-aedb-4f2f-9f1e-397dfeb09027 177.3G 16% /
11└─nvme0n1p4 swap 1 093c81f3-30bf-4688-9612-1a41f8955898
12 └─cryptswap swap 1 cryptswap eb66fb4b-9c2e-4825-a386-e1e512223b97 [SWAP]
13
14# try to unmount
15umount -f /srv/ssd # got an error
16fsck from util-linux 2.39.3
17e2fsck 1.47.0 (5-Feb-2023)
18/dev/sda1 is in use.
19e2fsck: Cannot continue, aborting.
20
21# to force unmount
22echo 1 | sudo tee /sys/block/sda/device/delete
23echo "- - -" | sudo tee /sys/class/scsi_host/host*/scan
24
25# check disk, notice that it is now remounted in in /dev/sdb
26sudo fdisk -l
27
28Disk /dev/sdb: 111.79 GiB, 120034123776 bytes, 234441648 sectors
29Disk model: Ramsta SSD S800
30Units: sectors of 1 * 512 = 512 bytes
31Sector size (logical/physical): 512 bytes / 512 bytes
32I/O size (minimum/optimal): 512 bytes / 512 bytes
33Disklabel type: gpt
34Disk identifier: 5BFD6FBA-560B-4166-A769-E873998917F9
35
36Device Start End Sectors Size Type
37/dev/sdb1 2048 234440703 234438656 111.8G Linux filesystem
Run fsck and try to fix errors.
1fsck -fy /dev/sdb1
2fsck from util-linux 2.39.3
3e2fsck 1.47.0 (5-Feb-2023)
4data: recovering journal
5Pass 1: Checking inodes, blocks, and sizes
6Pass 2: Checking directory structure
7Pass 3: Checking directory connectivity
8Pass 4: Checking reference counts
9Pass 5: Checking group summary information
10Free blocks count wrong (8050503, counted=27795691).
11Fix? yes
12
13Free inodes count wrong (7241024, counted=7325485).
14Fix? yes
15
16
17data: ***** FILE SYSTEM WAS MODIFIED *****
18data: 6355/7331840 files (0.3% non-contiguous), 1509141/29304832 blocks
trim
TRIM is used to optmized and increate longetivity of SSD by discarding the data blocks that no longer in use in the drive.
This is automatically started in the system, just to make sure check if the service is runing.
1systemctl status fstrim.timer
In mounting disk in /etc/fstab, make sure to add discard in the option to automatically trimmed.
1UUID="289429cd-429f-43b2-b6c6-f5455c6dda6c" /srv/ssd ext4 defaults,noatime,discard 0 2
Manually run TRIM.
1fstrim -av
Logs
logrotate
This is automatically enabled on your system, this basically archived and eventually delete your old logs. /etc/logrotate.conf
1# see "man logrotate" for details
2# global options do not affect preceding include directives
3# rotate log files weekly
4weekly
5# use the adm group by default, since this is the owning group
6# of /var/log/.
7su root adm
8# keep 1 weeks worth of backlogs
9rotate 1
10# create new (empty) log files after rotating old ones
11create
12# use date as a suffix of the rotated file
13#dateext
14# uncomment this if you want your log files compressed
15#compress
16# packages drop log rotation information into this directory
17include /etc/logrotate.d
18# system-specific logs may also be configured here.
To add a custom directory to rotate, add the config file in /etc/logrotate.d/. maddy
1cat /etc/logrotate.d/maddy
2/srv/volume/maddy/log/maddy.log {
3 weekly
4 rotate 2
5 compress
6 missingok
7 notifempty
8}
journald
Usually journald hogs disk space if left to default configuration.
Reduce to a specific size (e.g., 500MB):
1journalctl --vacuum-size=100M
1sudo journalctl --vacuum-time=2weeks
Check current disk usage:
1journalctl --disk-usage
Configure permanent limit. etc/systemd/journald.conf
1[Journal]
2SystemMaxUse=200M
Docker System Usage
image/builder/volume
Docker eating up your disk, to get an overview on you docker system. You can see below that the build cache is filling up the disk.
1docker system df
2TYPE TOTAL ACTIVE SIZE RECLAIMABLE
3Images 7 6 7.491GB 7.242GB (96%)
4Containers 5 2 938kB 159.7kB (17%)
5Local Volumes 0 0 0B 0B
6Build Cache 47 0 4.213GB 1.332GB
Some command to prune docker resources. Most of the time this would be enough to free up some space.
1docker builder prune
2docker image prune
If you are brave enough (to do this in production).
1docker system prune -a
2docker system prune -a --volumes
logs
Find space used by container logs.
1docker ps --format '{{.Names}}' | while read c; do
2 echo "$c: $(docker logs --tail 1 $c 2>&1 | wc -c) bytes";
3done
4
5docker ps --format '{{.Names}}' | while read c; do
6 echo "$c: $(docker logs --tail 1 $c 2>&1 | wc -c) bytes";
7done
8beszel-agent: 61 bytes
9autodiscover: 0 bytes
10traefik: 311 bytes
11gerbil: 67 bytes
12pangolin: 120 bytes
13traefik-certs-dumper: 0 bytes
14cloudflared: 169 bytes
15newt: 59 bytes
16dockge: 84 bytes
Limting the logs file in container. Editing docker daemon/config.
/etc/docker/daemon.json
1{
2 "log-driver": "json-file",
3 "log-opts": {
4 "max-size": "10m",
5 "max-file": "3"
6 }
7}
Defining in compose file.
1services:
2 app:
3 logging:
4 driver: json-file
5 options:
6 max-size: "10m"
7 max-file: "3"
Docker
Container overloading the system
For some reason you deployed an application more than your server can handle. First try to kill the container.
1docker stop <container_name>
2docker rm <container_name>
3```bash
4If stop command is not doing anything, try to force remove the container.
5```bash
6docker rm -f <container_name>
If all else fail, stop docker and containerd service.
1systemctl restart containerd
2systemctl restart docker
3
4# or
5systemctl kill docker
6systemctl kill containerd
7
8# do this if you know what you are doing
9pkill -9 docker
10pkill -9 containerd
11pkill -9 containerd-shim
12pkill -9 runc
If the container is still not being killed. Find the unkillable PID state.
1ps -eo pid,stat,cmd | grep D
2kill -9 <PID>
Monitor system load and restart service.
1systemctl restart containerd
2systemctl restart docker