Troubleshooting Overview
Use a layer-by-layer process to avoid random fixes.
Standard Workflow
- Define symptom precisely (timeout, reset, DNS failure, packet loss).
- Reproduce with the smallest possible command.
- Locate failing layer (DNS, TCP, TLS, app).
- Capture evidence (
ss,tcpdump,dig, logs). - Change one variable and re-test.
Fast Triage Matrix
| Symptom | Likely Layer | First Commands |
|---|---|---|
| Name does not resolve | DNS/Application | dig, nslookup |
| Timeout before connect | Network/Transport | traceroute, nc, ss |
| TLS handshake failed | Application/TLS | openssl s_client, curl -v |
| Slow but successful | Transport/App | ss -ti, request trace |
Core Commands
Connectivity
ping -c 4 <host>
traceroute <host>
mtr -rw <host>
DNS
dig <host>
nslookup <host>
TCP and Port Reachability
ss -tulpen
nc -vz <host> <port>
Packet Capture
tcpdump -i any host <host> -nn
tcpdump -i any tcp port 443 -nn
Decision Trees
Cannot connect to service
- DNS resolves?
- Port reachable?
- TLS handshake succeeds?
- App returns expected response?
Intermittent failure
- Compare by AZ/region/path.
- Check MTU and retransmissions.
- Correlate with deployment and traffic spikes.