ENGINEERING NOTE · 3 min read

Investigating Kubernetes deployment failures before blaming CI

A decision flow for narrowing down a failed Kubernetes deployment using cluster-level evidence first, and turning to CI orchestration once the cluster-side picture is clear.

Kubernetes
Debugging
CI/CD

Introduction

This note describes the order in which I investigate a failed Kubernetes deployment. In the observed deployments, starting from Kubernetes events and pod descriptions identified the cause faster than starting from the CI pipeline log, because the pipeline log usually shows the symptom rather than the underlying cluster behaviour.

The recurring problem

Deployments can fail intermittently across environments with a mix of pod-level, configuration, and pipeline-stage causes. The CI log often shows only that the deployment did not become healthy in time, which is not enough information to choose where to look next.

Why it is difficult

A pipeline failure message and an application failure message can look similar to a casual reader. Re-running the pipeline sometimes makes a real, reproducible failure appear to fix itself, which encourages classifying real issues as flaky.

Practical approach

Use cluster-level evidence first: events, pod description, current and previous container logs, and configuration comparison. Only inspect CI orchestration after the cluster picture is clear, so the pipeline log is read as confirmation rather than as the primary signal.

Decision flow

01
Deployment failed
Start from cluster-level evidence rather than the CI log.
02
Check Kubernetes events
Scheduling, image-pull, mounting, probe, or resource issues often appear here first.
03
Describe the affected pod
Container state, restart count, conditions, events, image, mounted configuration, and probe settings.
04
Check image, configuration, probe, resource, and scheduling errors
Map each event or condition to one of these categories before going further.
05
Read current and previous container logs
Previous-container logs are essential after a restart - the current log may be empty or misleading.
06
Compare environment-specific configuration
ConfigMaps, Secrets, image tags, environment variables, resource requests, manifests.
07
Inspect CI orchestration
Only after the cluster-level evidence is understood, confirm the symptom matches the cause.

Start with events

Kubernetes events often identify scheduling, image-pull, volume-mounting, probe, or resource issues earlier than a generic pipeline failure message. They are a useful first stop because they describe what the cluster tried to do and where it stopped.

Describe the pod

A pod description surfaces container state, restart count, conditions, recent events, image information, mounted configuration, and readiness and liveness probe settings. Together these usually narrow the cause to a small number of categories.

Check logs (current and previous)

Current container logs show what the running process is saying now. Previous-container logs show what the process said before the last restart. After a CrashLoopBackOff or OOMKill, the previous logs are usually the ones that explain the failure.

Compare configuration

Differences in ConfigMaps, Secrets, image tags, environment variables, resource requests, or deployment manifests can make an issue appear environment-specific. A structured comparison against a known-good environment often turns 'flaky in staging' into a specific configuration delta.

Inspect CI after Kubernetes evidence

Pipeline orchestration should be investigated after determining whether the cluster rejected, failed, or started the workload incorrectly. By that point the CI log usually confirms the cluster-side cause rather than introducing a new theory.

One important decision - Cluster evidence first, pipeline log second

Reading Kubernetes events and pod descriptions before opening the CI log added a few minutes at the start of an investigation but, in the failures I investigated, converged on the actual cause faster overall because it avoided spending time on pipeline theories that the cluster could already disprove.

Limitations

Some failures originate in external dependencies (registries, network, cloud provider) and need evidence from outside the cluster
Some errors disappear before investigation begins, especially after a retry
Kubernetes events have limited retention
A pod reporting healthy does not guarantee the application is behaving correctly

When this approach does not apply

Networking, cloud-provider, storage, or purely application-level failures may require investigation beyond the pod and the pipeline. If the failure is clearly outside the cluster - for example a build step that never produced an image - the CI log is the right starting point instead.

Conclusion

In the observed deployments, the single most useful habit was reading Kubernetes events and the pod description before opening the CI log. The decision flow above is the shape that habit ended up taking.