ENGINEERING NOTES
Engineering notes.
Practical, sanitized engineering notes. No client names, no internal system names, no proprietary details - the engineering shape of the problem, a checklist or decision flow, and the limits of the approach.
Reducing Keycloak configuration drift with repeatable automation
A practical note on driving Keycloak realm, client, and identity-provider configuration from code so the same desired state can be re-applied across environments.
Introduction
In this note I describe a general pattern for treating Keycloak configuration as desired state and reconciling it through automation, rather than configuring each environment through the admin UI. The goal is to make configuration repeatable and reviewable, not to claim that all authentication issues come from configuration.
The recurring problem
Authentication behaviour can differ across environments when realm, client, redirect URI, identity-provider, or authentication-flow configuration is changed by hand. Small differences between environments tend to surface as intermittent login or token failures that are hard to attribute, because the runtime symptom rarely names the misconfigured field.
Why it is difficult
Manual changes through the admin UI are easy to make but invisible to source control, so there is no shared record of what changed, when, or why. Drift accumulates slowly and is usually noticed only when a specific flow breaks in one environment.
Practical approach
Express the configuration that matters as a desired state, read the existing state from the Keycloak Admin REST API, and reconcile the two in a way that is safe to re-run. Keep the scope narrow: only the fields the workflow is willing to own should be reconciled. Everything else should be left alone so the automation does not silently overwrite changes it does not understand.
Sanitized setup sequence
- 01Retrieve an administrative access token
- 02Read the current realm or client configuration
- 03Compare approved properties with the desired configuration
- 04Create resources that are missing
- 05Update approved properties that differ
- 06Validate redirect URIs and authentication settings
- 07Return a specific error when configuration validation fails
Desired state versus existing state
The general idea is to: read the desired configuration from a checked-in source, read the existing configuration from Keycloak, compare only the approved fields, create resources that are missing, update only intended differences, validate critical settings (such as redirect URIs and authentication settings), and return a clear failure message when validation fails. Each of these steps is described as a pattern; the exact implementation depends on the project.
Idempotency
Rerunning the workflow should not create duplicate clients, roles, flows, or identity-provider entries. The pattern is to look up resources by a stable identifier (for example client ID or alias), create them only when absent, and update only the approved subset of fields when they exist. This is an idempotency goal, not a guarantee — it holds only for the fields the workflow actually manages.
Token expiry during longer workflows
Administrative access tokens have a limited lifetime. Workflows that run for more than a few minutes (large realms, many clients, retries) can outlive the token they started with. A practical approach is to acquire the token close to where it is used, check for token-expiry errors from the Admin API, and reacquire the token instead of failing the whole run. Token acquisition should not be logged or echoed.
Secret management
Client secrets, admin credentials, and identity-provider secrets should not be hardcoded in scripts, committed to source control, written to logs, or shipped in client-side configuration. They should be read from the environment or a secrets manager at the point of use, and the workflow should fail with a clear, non-revealing error when a required secret is missing.
One important decision - Reconcile only the fields the workflow owns
It is tempting to push the entire Keycloak export through automation. In this workflow, scoping the reconciliation to a narrow set of approved fields was more useful: it kept the change surface small, made review easier, and avoided overwriting fields that other teams or operators set deliberately.
Limitations
- Configuration not represented in automation can still drift
- Manual changes made directly in the admin UI can still create inconsistencies
- Environment-specific secrets require separate handling outside the workflow
- Automation does not prevent Keycloak product or infrastructure failures
When this approach does not apply
A simple one-off local environment, or a short-lived experiment, may not justify building a complete desired-state workflow. In those cases a documented manual setup is usually enough.
Conclusion
In this workflow, treating Keycloak configuration as desired state and reconciling a narrow, approved set of fields reduced the kind of drift that previously caused environment-specific authentication failures. The approach is most useful when the same configuration has to exist in more than one environment.
A practical Kafka and Strimzi upgrade validation checklist
A checklist-shaped note for Kafka and Strimzi upgrade rehearsals, organised so failures get attributed to the right layer rather than to 'the upgrade'.
Introduction
This note collects the checks I have found useful when rehearsing a Kafka and Strimzi upgrade. It is structured as separate checklists for before, during the operator upgrade, during the Kafka upgrade, and after, because in these rehearsals operator-side and broker-side failures looked similar from the outside until they were observed separately.
The recurring problem
Upgrades to Kafka or the Strimzi operator can introduce message-flow regressions, operator reconciliation surprises, or pod-recovery behaviour that masks the actual impact. Without an explicit checklist, it is easy to declare an upgrade successful while a subtle regression is still in flight.
Why it is difficult
Operator reconciliation behaviour can change between minor versions, and pods can recover on their own after a few minutes. Both effects make it harder to tell whether a symptom is the upgrade itself, a transient issue, or an application-side effect reacting to a broker restart.
Practical approach
Treat the rehearsal as four distinct phases (before, operator upgrade, Kafka upgrade, after) and validate each phase before moving to the next. Keep operator state and broker state as separate signals throughout, because mixing them obscures which layer changed.
Before the upgrade
- Record current Kafka and Strimzi versions
- Review supported compatibility combinations
- Confirm operator reconciliation is healthy
- Confirm brokers and dependent applications are healthy
- Record relevant topic and consumer-group state
- Confirm producer and consumer validation paths
- Review rollback assumptions
- Capture current warnings or known issues
During the Strimzi operator upgrade
- Watch operator rollout status
- Inspect reconciliation events
- Check custom-resource status
- Confirm that expected resources remain managed
- Record unexpected warnings or errors
During the Kafka upgrade
- Observe broker restart behaviour
- Confirm brokers rejoin correctly
- Monitor application connectivity
- Validate producer behaviour
- Validate consumer behaviour
- Watch consumer-group stability
- Record message-flow failures
After the upgrade
- Produce and consume validation messages
- Confirm consumer offsets behave as expected
- Confirm applications reconnect successfully
- Verify operator and broker health
- Review logs for new warnings
- Recheck rollback assumptions
- Document observed compatibility issues
One important decision - Treat operator and broker upgrades as separate validation phases
Running the operator upgrade and the Kafka upgrade as a single combined check made it hard to attribute failures in these rehearsals. Separating them lengthened the rehearsal, but a regression in operator reconciliation no longer looked the same as a broker-side issue, which made each one easier to investigate.
Limitations
- A checklist cannot prove that all production workloads, traffic patterns, schemas, or failure modes are safe
- Rehearsal traffic is rarely identical to production traffic
- Some regressions only appear under load or over longer time windows
- The checklist is only as useful as the comparison between pre-upgrade and post-upgrade state
When this approach does not apply
Exact validation steps vary based on Kafka version, Strimzi version, deployment architecture, and the guarantees the applications need. A fully managed Kafka offering where operator and broker behaviour are not surfaced may need a different shape of checklist.
Conclusion
During these upgrade rehearsals, the most useful single change was splitting validation into operator-side and broker-side phases. The checklist above is the form that ended up being practical to run; it is intentionally not a claim of zero-risk upgrades.
Investigating Kubernetes deployment failures before blaming CI
A decision flow for narrowing down a failed Kubernetes deployment using cluster-level evidence first, and turning to CI orchestration once the cluster-side picture is clear.
Introduction
This note describes the order in which I investigate a failed Kubernetes deployment. In the observed deployments, starting from Kubernetes events and pod descriptions identified the cause faster than starting from the CI pipeline log, because the pipeline log usually shows the symptom rather than the underlying cluster behaviour.
The recurring problem
Deployments can fail intermittently across environments with a mix of pod-level, configuration, and pipeline-stage causes. The CI log often shows only that the deployment did not become healthy in time, which is not enough information to choose where to look next.
Why it is difficult
A pipeline failure message and an application failure message can look similar to a casual reader. Re-running the pipeline sometimes makes a real, reproducible failure appear to fix itself, which encourages classifying real issues as flaky.
Practical approach
Use cluster-level evidence first: events, pod description, current and previous container logs, and configuration comparison. Only inspect CI orchestration after the cluster picture is clear, so the pipeline log is read as confirmation rather than as the primary signal.
Decision flow
Ordered list of investigation steps, each followed by a short detail. Steps are read in order.
- 01
Deployment failed
Start from cluster-level evidence rather than the CI log.
- 02
Check Kubernetes events
Scheduling, image-pull, mounting, probe, or resource issues often appear here first.
- 03
Describe the affected pod
Container state, restart count, conditions, events, image, mounted configuration, and probe settings.
- 04
Check image, configuration, probe, resource, and scheduling errors
Map each event or condition to one of these categories before going further.
- 05
Read current and previous container logs
Previous-container logs are essential after a restart - the current log may be empty or misleading.
- 06
Compare environment-specific configuration
ConfigMaps, Secrets, image tags, environment variables, resource requests, manifests.
- 07
Inspect CI orchestration
Only after the cluster-level evidence is understood, confirm the symptom matches the cause.
Start with events
Kubernetes events often identify scheduling, image-pull, volume-mounting, probe, or resource issues earlier than a generic pipeline failure message. They are a useful first stop because they describe what the cluster tried to do and where it stopped.
Describe the pod
A pod description surfaces container state, restart count, conditions, recent events, image information, mounted configuration, and readiness and liveness probe settings. Together these usually narrow the cause to a small number of categories.
Check logs (current and previous)
Current container logs show what the running process is saying now. Previous-container logs show what the process said before the last restart. After a CrashLoopBackOff or OOMKill, the previous logs are usually the ones that explain the failure.
Compare configuration
Differences in ConfigMaps, Secrets, image tags, environment variables, resource requests, or deployment manifests can make an issue appear environment-specific. A structured comparison against a known-good environment often turns 'flaky in staging' into a specific configuration delta.
Inspect CI after Kubernetes evidence
Pipeline orchestration should be investigated after determining whether the cluster rejected, failed, or started the workload incorrectly. By that point the CI log usually confirms the cluster-side cause rather than introducing a new theory.
One important decision - Cluster evidence first, pipeline log second
Reading Kubernetes events and pod descriptions before opening the CI log added a few minutes at the start of an investigation but, in the failures I investigated, converged on the actual cause faster overall because it avoided spending time on pipeline theories that the cluster could already disprove.
Limitations
- Some failures originate in external dependencies (registries, network, cloud provider) and need evidence from outside the cluster
- Some errors disappear before investigation begins, especially after a retry
- Kubernetes events have limited retention
- A pod reporting healthy does not guarantee the application is behaving correctly
When this approach does not apply
Networking, cloud-provider, storage, or purely application-level failures may require investigation beyond the pod and the pipeline. If the failure is clearly outside the cluster - for example a build step that never produced an image - the CI log is the right starting point instead.
Conclusion
In the observed deployments, the single most useful habit was reading Kubernetes events and the pod description before opening the CI log. The decision flow above is the shape that habit ended up taking.