ENGINEERING NOTE · 3 min read
Investigating Kubernetes deployment failures before blaming CI
A decision flow for narrowing down a failed Kubernetes deployment using cluster-level evidence first, and turning to CI orchestration once the cluster-side picture is clear.
- Kubernetes
- Debugging
- CI/CD
Introduction
This note describes the order in which I investigate a failed Kubernetes deployment. In the observed deployments, starting from Kubernetes events and pod descriptions identified the cause faster than starting from the CI pipeline log, because the pipeline log usually shows the symptom rather than the underlying cluster behaviour.
The recurring problem
Deployments can fail intermittently across environments with a mix of pod-level, configuration, and pipeline-stage causes. The CI log often shows only that the deployment did not become healthy in time, which is not enough information to choose where to look next.
Why it is difficult
A pipeline failure message and an application failure message can look similar to a casual reader. Re-running the pipeline sometimes makes a real, reproducible failure appear to fix itself, which encourages classifying real issues as flaky.
Practical approach
Use cluster-level evidence first: events, pod description, current and previous container logs, and configuration comparison. Only inspect CI orchestration after the cluster picture is clear, so the pipeline log is read as confirmation rather than as the primary signal.
Decision flow
Ordered list of investigation steps, each followed by a short detail. Steps are read in order.
- 01
Deployment failed
Start from cluster-level evidence rather than the CI log.
- 02
Check Kubernetes events
Scheduling, image-pull, mounting, probe, or resource issues often appear here first.
- 03
Describe the affected pod
Container state, restart count, conditions, events, image, mounted configuration, and probe settings.
- 04
Check image, configuration, probe, resource, and scheduling errors
Map each event or condition to one of these categories before going further.
- 05
Read current and previous container logs
Previous-container logs are essential after a restart - the current log may be empty or misleading.
- 06
Compare environment-specific configuration
ConfigMaps, Secrets, image tags, environment variables, resource requests, manifests.
- 07
Inspect CI orchestration
Only after the cluster-level evidence is understood, confirm the symptom matches the cause.
Start with events
Kubernetes events often identify scheduling, image-pull, volume-mounting, probe, or resource issues earlier than a generic pipeline failure message. They are a useful first stop because they describe what the cluster tried to do and where it stopped.
Describe the pod
A pod description surfaces container state, restart count, conditions, recent events, image information, mounted configuration, and readiness and liveness probe settings. Together these usually narrow the cause to a small number of categories.
Check logs (current and previous)
Current container logs show what the running process is saying now. Previous-container logs show what the process said before the last restart. After a CrashLoopBackOff or OOMKill, the previous logs are usually the ones that explain the failure.
Compare configuration
Differences in ConfigMaps, Secrets, image tags, environment variables, resource requests, or deployment manifests can make an issue appear environment-specific. A structured comparison against a known-good environment often turns 'flaky in staging' into a specific configuration delta.
Inspect CI after Kubernetes evidence
Pipeline orchestration should be investigated after determining whether the cluster rejected, failed, or started the workload incorrectly. By that point the CI log usually confirms the cluster-side cause rather than introducing a new theory.
One important decision - Cluster evidence first, pipeline log second
Reading Kubernetes events and pod descriptions before opening the CI log added a few minutes at the start of an investigation but, in the failures I investigated, converged on the actual cause faster overall because it avoided spending time on pipeline theories that the cluster could already disprove.
Limitations
- Some failures originate in external dependencies (registries, network, cloud provider) and need evidence from outside the cluster
- Some errors disappear before investigation begins, especially after a retry
- Kubernetes events have limited retention
- A pod reporting healthy does not guarantee the application is behaving correctly
When this approach does not apply
Networking, cloud-provider, storage, or purely application-level failures may require investigation beyond the pod and the pipeline. If the failure is clearly outside the cluster - for example a build step that never produced an image - the CI log is the right starting point instead.
Conclusion
In the observed deployments, the single most useful habit was reading Kubernetes events and the pod description before opening the CI log. The decision flow above is the shape that habit ended up taking.