PROFESSIONAL WORK · 2025
Kafka and Strimzi Upgrade Validation
Validated Kafka and Strimzi upgrades on Kubernetes for service compatibility, message-flow stability, and recovery across version changes.
- Apache Kafka
- Strimzi
- Kubernetes
- GitLab CI
This case study is a sanitized explanation of my contribution. Internal names, architecture details, and business information have been omitted or generalized.
Context
Apache Kafka deployed via the Strimzi operator on Kubernetes. The wider team owned the Kafka architecture; this work focused on validation of upgrades, not ownership of the platform.
Problem
Kafka and Strimzi version upgrades carried real risk of message-flow regressions, operator-side surprises, and pod recovery issues, and needed systematic validation before rollout.
Constraints
- Upgrades had to be rehearsable without affecting production data
- Validation had to be reproducible across environments rather than one-shot
- Rollback paths had to be considered for each upgrade scenario
My contribution
Contributed to
Contributed to upgrade validation - verified producer and consumer behaviour, Kubernetes deployment changes, pod and service recovery, and message-flow stability across versions.
Technical approach
- Upgrade planning and pre-upgrade checks
- Compatibility verification across Kafka and Strimzi versions
- Validation of producer and consumer behaviour before and after upgrade
- Verification of Kubernetes deployment changes and operator reconciliation
- Investigation of pod and service failures observed during upgrade rehearsals
- Message-flow validation including consumer-group state
- Rollback considerations for each upgrade scenario
- Execution of validation steps through CI/CD
One important engineering decision
Decision
Validate operator-side reconciliation behaviour as a separate concern from broker behaviour, with its own checks in the upgrade rehearsal.
Why
Upgrade pain tended to come from the Strimzi operator's reconciliation between versions, not from the brokers themselves, but the two were being treated as one signal.
Trade-off
The rehearsal got longer because operator and broker checks now ran as distinct phases instead of one combined pass.
Alternatives considered
- Treating the upgrade as a single combined check (faster, but harder to attribute failures)
- Skipping rehearsals on minor version bumps (rejected because operator behaviour can still change between minors)
Failure cases and edge cases
- Pods that recovered on their own after several minutes, masking a slow operator reconciliation
- Consumer groups whose offsets appeared correct but whose membership had not stabilised yet
- Transient deployment failures during the upgrade that looked like real regressions
Technologies used
- Apache Kafka
- Strimzi
- Kubernetes
- kubectl
- GitLab CI
Challenges
- Distinguishing environment flakiness from real upgrade regressions
- Operator reconciliation behaviour changing between versions
- Reproducing transient pod and service failures observed during upgrades
Verified outcome
Provided upgrade validation evidence that supported safer Kafka and Strimzi rollouts on Kubernetes and made operator-side regressions easier to spot during rehearsal.
What I learned
During these upgrade rehearsals, several difficult failures appeared in operator reconciliation rather than broker behaviour. Validation is only useful when it clearly separates 'flaky' from 'broken'.
What I would improve
I would automate the comparison of operator state and broker state before and after the upgrade into a single diff artifact attached to the pipeline run, instead of relying on kubectl inspection by hand.
Ownership breakdown
Wider system context
- The Kafka and Strimzi architecture and the upgrade itself were owned by the wider team
My contribution
- Upgrade validation across versions
- Pre- and post-upgrade verification steps in CI/CD
Components I investigated
- Pod and service failures observed during rehearsals
- Operator-side reconciliation behaviour across versions
Components I validated
- Producer and consumer behaviour before and after upgrade
- Message-flow stability and consumer-group state