Lessons for Production Engineering: A Linux Debugging Field Guide
Over the years of responding to production incidents and performance issues, I've accumulated a collection of commands and techniques that have proven invaluable when things go sideways. This isn't meant to be an exhaustive reference. Think of it more as a field guide compiled from battles in the trenches of production systems.
When You First Land on a Troubled Server
Picture this: you SSH into a server that's experiencing issues. Where do you start?
Know Your Context
Before diving into diagnostics, orient yourself. Who are you, and what permissions do you have?
whoami # your current username
id # your user ID, group ID, and group memberships
groups # list all groups you belong to
See who else is on the system:
who # currently logged in users
w # who is logged in and what they're doing
last # history of logins
This context matters more than you might think. If you're not root or don't have sudo access, you'll need to request elevated permissions. And if you see multiple engineers logged in during an incident, coordinate to avoid stepping on each other's troubleshooting efforts.
Check your current privileges:
sudo -l # list what sudo commands you can run
If you need to switch users or elevate privileges:
sudo -i # become root with root's environment
Figure Out What's Running
Now that you know who you are and what you can do, the first technical goal is simple: figure out what's actually running on this box.
Start by looking at the heaviest resource consumers. I typically reach for a combination of ps and top:
ps -eo pid,ppid,cmd,%mem,%cpu --sort=-%cpu | head -20
This gives you the top 20 processes sorted by CPU usage. The "core application" is usually the long-running process consuming the most resources, often owned by a dedicated service account rather than root. However, this is not always the case...
For a hierarchical view, pstree can be illuminating:
pstree -ap
Example output:
systemd(1) --system
├─sshd(742) -D
│ └─sshd(1331) -D
│ └─bash(1340)
│ ├─vim(1402) /etc/ssh/sshd_config
│ ├─python3(1455) script.py --input data.csv
│ └─pstree(1499) -ap
├─cron(601)
├─dbus-daemon(512) --system --address=systemd: --nofork --nopidfile
├─nginx(812) -g daemon off;
│ ├─nginx(813) -g daemon off;
│ └─nginx(814) -g daemon off;
└─docker-containerd(900) --config /etc/docker/containerd/config.toml
└─containerd-shim(921) -namespace moby -id abc123
└─python(950) app.py --port 8080
This shows parent-child relationships between processes. You might see something like a supervisor process managing multiple workers, which immediately tells you about the application architecture.
Don't forget to check what services systemd is managing:
systemctl list-units --type=service --state=running
And see what's listening on the network:
ss -lntp
That last command often reveals the primary application: if you see Java listening on port 8080 or Python on 5000, you've probably found your app.
Understanding CPU Pressure
When CPU becomes the bottleneck, you need a systematic approach. I always start with load averages from uptime:
$ uptime
9:04pm up 268 day(s), 10:16, 2 users, load average: 7.76, 8.32, 8.60
Those three numbers (1-minute, 5-minute, and 15-minute averages) tell a story. If they're increasing, your problem is getting worse. If they're decreasing, you might have already missed the peak of the incident.
Next, I might run vmstat with a one-second interval to watch system-wide CPU utilization:
vmstat 1
The key columns are us (user time), sy (system/kernel time), id (idle), and wa (waiting on I/O). When us + sy approaches 100%,
you're likely experiencing scheduler latency. The r column shows runnable threads — a high number here means processes are queuing for CPU time.
For more granular analysis, mpstat breaks things down per-CPU,
helping you spot thread scalability problems where a single hot CPU indicates work that can't be parallelized effectively.
Network Debugging
Network issues can be subtle. Sometimes connectivity looks fine at first glance but breaks down under specific conditions. Here's my standard toolkit:
For DNS resolution, I prefer dig over nslookup for its cleaner output:
dig +short example.fsx.us-east-1.aws.internal A
dig +short example.fsx.us-east-1.aws.internal AAAA
To test port connectivity, netcat is your friend:
# IPv4
nc -4zv target-host.example.com 2049
# IPv6
nc -6zv target-host.example.com 2049
When you need to see all active connections, ss is preferred to the older netstat:
ss -tulnp
Or filter for a specific port:
ss -ltnp 'sport = :2049'
Pay attention to the state of connections. Multiple ESTABLISHED connections to the same database port might indicate connection pooling issues or saturation. A large number of TIME_WAIT states could suggest connection churn.
For real-time bandwidth monitoring between specific hosts, iftop shows you exactly which connections are consuming bandwidth.
And when things get really mysterious, it's time to capture packets:
tcpdump -i any -nn -w capture.pcap \
'tcp and ((src port 761 and dst port 2049) or (src port 2049 and dst port 761))'
You can then analyze the resulting .pcap file in Wireshark to see exactly what's happening on the wire.
Memory Management and the OOM Killer
Few things are more frustrating than processes being killed by the Out-Of-Memory killer. When you suspect this is happening, check the kernel logs:
dmesg | grep -i "out of memory"
journalctl -k | grep -i "oom"
View current memory state with:
free -m
In a pinch, you can add temporary swap space, though this is a band-aid, not a solution:
sudo fallocate -l 2G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
Be warned: adding swap often just masks underlying memory leaks. If memory consumption continues to grow, remove the swap and address the root cause.
Filesystem Mysteries
Disk space issues can manifest in confusing ways. The df command gives you a quick overview:
df -h # disk usage
df -i # inode usage
But df is just an estimate. For accurate numbers, use du:
du -sh /var/*
A classic scenario: you get an error about /var being full. Run df to confirm, then use du to drill down through subdirectories and find the culprit. For this particular case, maybe log files weren't rotated properly or accumulated too fast.
Understanding the difference between symbolic and hard links can save you from mistakes. Symbolic links point to filenames and can cross filesystem boundaries:
ln -s /path/to/original /path/to/symlink
Hard links create multiple directory entries pointing to the same inode. They can't cross filesystems, but the file persists until all hard links are deleted:
ln /path/to/file /path/to/hardlink
Working with NFS
When debugging NFS issues, you often need to verify both connectivity and protocol specifics. Mount with explicit parameters:
# IPv4
mount -t nfs -o proto=tcp,nfsvers=4.1,rsize=1048576,wsize=1048576,timeo=600 \
nfs-server.example.com:/export /mnt/nfs
# IPv6
mount -t nfs -o proto=tcp6,nfsvers=4.1,rsize=1048576,wsize=1048576,timeo=600 \
nfs-server.example.com:/export /mnt/nfs
Check NFS statistics to see what's actually happening:
nfsstat -c # client stats
nfsstat -s # server stats
And don't forget rpcinfo to verify the RPC services are responding:
rpcinfo -p nfs-server.example.com
Process Debugging
When a process misbehaves, you need to see what it's doing. For stack traces of all threads:
PID=2038
for tid in $(ls /proc/$PID/task); do
echo "=== Thread $tid ==="
cat /proc/$PID/task/$tid/stack
done
See what files a process has open:
lsof -p <pid>
Attach strace to watch system calls in real-time:
strace -p <pid> -s 100 -f
And to find which process is using a specific port:
lsof -i :8080
Advanced Techniques: CPU Pinning
On NUMA systems or when you need predictable performance, CPU pinning can make a significant difference. The idea is to bind processes to specific CPUs, improving cache locality and reducing context switches.
Use taskset for quick experiments:
# Run a new process on CPU 0
taskset -c 0 ./my_application
# Move existing process to CPUs 0 and 1
taskset -cp 0,1 1234
For more sophisticated control, use cgroups to create exclusive CPU sets:
mkdir -p /sys/fs/cgroup/cpuset/my_app
cd /sys/fs/cgroup/cpuset/my_app
echo 2-3 > cpuset.cpus
echo 0 > cpuset.mems
echo 1234 > cgroup.procs
Now only CPUs 2 and 3 can execute that process, and other tasks won't interfere. This is particularly valuable for latency-sensitive workloads like databases or real-time applications.
Some Useful Shortcuts
A few commands I use constantly that don't fit neatly into categories:
Elevate to root and stay there:
sudo sudo su
Delete all files except one specific file:
find . -type f ! -name "file_to_keep" -delete
Start the SSH agent if it's not running:
eval "$(ssh-agent -s)"
Check firewall rules:
iptables -S
ip6tables -S
Scenario: Out-Of-Memory (OOM)
Cheat-line: Check logs for OOM → Restart service → Temporarily add swap → Increase memory limits → Escalate
What's happening: Your server has exhausted all available RAM, forcing Linux's OOM killer to terminate processes. Critical services may crash or enter a restart loop.
Red Flags:
- Sudden service crash without application errors
- Kernel logs showing "Out of memory" or "Killed process" messages
- Services repeatedly restarting
Quick Confirm:
dmesg | grep -i "out of memory"
journalctl -k | grep -i "oom"
free -m # shows memory in MB
top # press M to sort by memory usage
Immediate Mitigation:
# Restart affected service
sudo systemctl restart <service-name>
# Add temporary swap (fast local SSD only)
sudo fallocate -l 2G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
You may need to consider rolling back the pipeline and deploying a previous version of the code if you suspect it's a recent deployment that is faulty.
Warning: Creating swap may hide underlying memory leaks. Remove it as soon as possible:
sudo swapoff /swapfile && sudo rm /swapfile
Long-Term Fix:
- Diagnose memory leaks by sharing memory graphs with developers
- For C++: provide core dumps for analysis with gdb/Valgrind
- For Python: use
tracemallocin test environments - Implement cache size limits and TTL settings
- Configure proper memory limits in container orchestration
Beginner Tip: If you see "OOM" in kernel logs, restart the killed process and monitor with top to check if memory consumption climbs again.
Scenario: Detecting Memory Leaks
Memory leaks occur when programs continuously allocate memory without releasing it, eventually consuming all available RAM.
Step-by-Step Detection:
# 1. SSH into server
ssh user@your-server-ip
# 2. Check system-wide memory usage
free -m
# 3. Find memory-consuming processes
top -o %MEM # Press Shift+M to sort by memory
# Or:
ps aux --sort=-%mem | head
# 4. Confirm memory is not being freed
watch -n 2 'ps -o pid,cmd,%mem,%cpu --sort=-%mem | head'
# Let this run for several minutes—if %MEM keeps increasing, you have a leak
# 5. Check file descriptor leaks
sudo lsof -p <PID> | wc -l
# Run every 10-30 seconds—if count keeps rising, you have an FD leak
For C/C++ Applications:
# Stop the server and run under Valgrind
valgrind --leak-check=full ./myapp
Sample output showing a leak:
==1234== 10,240 bytes in 2 blocks are definitely lost in loss record 42 of 78
==1234== at 0x4C2FB55: malloc (vg_replace_malloc.c:299)
==1234== by 0x401234: some_func (main.c:25)
==1234== by 0x401567: main (main.c:45)
For Python Applications:
import tracemalloc
tracemalloc.start()
# ... run your code ...
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')
for stat in top_stats[:10]:
print(stat)
Heap Profiling:
- Use
gperftools(tcmalloc) for live heap profiling - Use
jemallocfor heap introspection and runtime stats
Scenario: Debugging Failed Mount Points
When ls fails on a mount point, you need to understand what went wrong with the filesystem attachment.
Understanding Mounting:
- Mounting makes a filesystem accessible at a directory (mount point)
- Common types: local (ext4, xfs), network (NFS, CIFS), virtual (tmpfs, proc)
Debugging Strategy:
# 1. Check current mounts
mount | grep /mnt/myshare
# Or:
findmnt /mnt/myshare
# 2. Verify mount point directory
ls -ld /mnt/myshare
# 3. Check kernel logs
dmesg | tail -50
# Sample problematic output:
# [223.938439] nfs: server 192.168.1.10 not responding, still trying
# [224.942345] nfs: mount option 'vers=3' not supported
# 4. Check syslog
grep -i mount /var/log/syslog
# 5. Check what's using the mount
lsof +D /mnt/myshare
# 6. Verify network connectivity (for NFS/CIFS)
ping 192.168.1.10
telnet 192.168.1.10 2049 # NFS port
# 7. Use strace to debug mount syscalls
strace -f mount -t nfs 192.168.1.10:/export /mnt/myshare
Common Failure Scenarios:
Scenario: ls /mnt/myshare hangs
- Root cause: NFS server unreachable or hard mount hanging
- Solution: Use soft mount with timeout
mount -t nfs -o soft,timeo=2 192.168.1.10:/export /mnt/myshare
Scenario: "mount: wrong fs type, bad option, bad superblock"
# Debug:
blkid /dev/sdb1
file -s /dev/sdb1
fsck /dev/sdb1
Scenario: Mount succeeds but ls shows I/O error
# Check for disk failure or corruption
dmesg | grep -i error
Advanced Concepts:
Lazy Unmounts:
# Useful when mount point is busy
umount -l /mnt/myshare
Remounting:
# Change mount options without unmounting
mount -o remount,ro /mnt/myshare
Mount Namespaces (for containers):
# List mount namespaces
lsns -t mnt
# Enter a namespace
nsenter -t <PID> -m
Essential Concepts & Glossary
Key Technical Terms
Autoscaling – Automatically adjusting the number of running instances based on demand metrics (CPU, request volume, etc.)
Circuit Breaker – A resilience pattern that stops requests to a failing service to prevent cascading failures
Connection Pool – A cache of reusable database connections to avoid connection establishment overhead
Core Dump – A snapshot of a program's memory when it crashes, used for debugging
Deadlock – When processes wait on each other's resources indefinitely, preventing progress
Inode – Filesystem metadata structure; you can run out of inodes even with disk space available
JWT (JSON Web Token) – Stateless authentication token with embedded timestamps
Load Average – Number of processes running or waiting; values above CPU core count indicate overload
Memory Leak – When a process allocates memory but never releases it
OOM Killer – Linux kernel mechanism that forcefully terminates processes when RAM is exhausted
P95/P99 Latency – The latency threshold below which 95%/99% of requests fall (tail latency)
TTL (Time To Live) – Duration before a cache entry or DNS record expires
%wa (I/O Wait) – CPU time spent waiting for disk I/O; high values (≥10%) indicate disk bottlenecks
Critical Command Reference
# Memory & Process Analysis
free -m # Memory usage in MB
top # Real-time process monitor (M=sort by memory, P=CPU)
ps aux --sort=-%mem # List processes by memory usage
dmesg | grep -i oom # Check for out-of-memory events
# Disk & I/O
df -h # Disk usage (human-readable)
df -i # Inode usage
du -x -m /path | sort -n # Find largest directories
iostat -x 1 5 # I/O statistics
# System Performance
uptime # Load averages
vmstat 1 5 # Virtual memory statistics
lsof +D /path # List open files in directory
# Network & Services
curl -I http://host/health # Check HTTP health endpoint
systemctl status service # Check service status
journalctl -u service # View service logs
Conclusion: Building Your Intuition
Operating production systems requires equal parts knowledge, intuition, and systematic thinking. The incidents you respond to today build the pattern recognition that will make you faster tomorrow.
These commands are tools, but knowing when to use them comes from experience. The key is developing a methodology: start broad, narrow down based on what you find, and always verify your hypotheses.
Keep this guide handy. Bookmark it, print it, adapt it to your environment. Most importantly, when you discover a new technique that saves you during an incident, add it to your own collection. The best debugging toolkit is the one you build yourself, refined through many late-night incidents and triumphant resolutions.
Stay curious, stay systematic, and may your production systems remain stable.