PROFESSIONAL WORK · 2025

Kafka and Strimzi Upgrade Validation

Validated Kafka and Strimzi upgrades on Kubernetes for service compatibility, message-flow stability, and recovery across version changes.

Apache Kafka
Strimzi
Kubernetes
GitLab CI

This case study is a sanitized explanation of my contribution. Internal names, architecture details, and business information have been omitted or generalized.

Context

Apache Kafka deployed via the Strimzi operator on Kubernetes. The wider team owned the Kafka architecture; this work focused on validation of upgrades, not ownership of the platform.

Problem

Kafka and Strimzi version upgrades carried real risk of message-flow regressions, operator-side surprises, and pod recovery issues, and needed systematic validation before rollout.

Constraints

Upgrades had to be rehearsable without affecting production data
Validation had to be reproducible across environments rather than one-shot
Rollback paths had to be considered for each upgrade scenario

My contribution

Contributed to

Contributed to upgrade validation - verified producer and consumer behaviour, Kubernetes deployment changes, pod and service recovery, and message-flow stability across versions.

Technical approach

Upgrade planning and pre-upgrade checks
Compatibility verification across Kafka and Strimzi versions
Validation of producer and consumer behaviour before and after upgrade
Verification of Kubernetes deployment changes and operator reconciliation
Investigation of pod and service failures observed during upgrade rehearsals
Message-flow validation including consumer-group state
Rollback considerations for each upgrade scenario
Execution of validation steps through CI/CD

One important engineering decision

Decision

Validate operator-side reconciliation behaviour as a separate concern from broker behaviour, with its own checks in the upgrade rehearsal.

Why

Upgrade pain tended to come from the Strimzi operator's reconciliation between versions, not from the brokers themselves, but the two were being treated as one signal.

Trade-off

The rehearsal got longer because operator and broker checks now ran as distinct phases instead of one combined pass.

Alternatives considered

Treating the upgrade as a single combined check (faster, but harder to attribute failures)
Skipping rehearsals on minor version bumps (rejected because operator behaviour can still change between minors)

Failure cases and edge cases

Pods that recovered on their own after several minutes, masking a slow operator reconciliation
Consumer groups whose offsets appeared correct but whose membership had not stabilised yet
Transient deployment failures during the upgrade that looked like real regressions

Technologies used

Apache Kafka
Strimzi
Kubernetes
kubectl
GitLab CI

Challenges

Distinguishing environment flakiness from real upgrade regressions
Operator reconciliation behaviour changing between versions
Reproducing transient pod and service failures observed during upgrades

Verified outcome

Provided upgrade validation evidence that supported safer Kafka and Strimzi rollouts on Kubernetes and made operator-side regressions easier to spot during rehearsal.

What I learned

During these upgrade rehearsals, several difficult failures appeared in operator reconciliation rather than broker behaviour. Validation is only useful when it clearly separates 'flaky' from 'broken'.

What I would improve

I would automate the comparison of operator state and broker state before and after the upgrade into a single diff artifact attached to the pipeline run, instead of relying on kubectl inspection by hand.

Ownership breakdown

Wider system context

The Kafka and Strimzi architecture and the upgrade itself were owned by the wider team

My contribution

Upgrade validation across versions
Pre- and post-upgrade verification steps in CI/CD

Components I investigated

Pod and service failures observed during rehearsals
Operator-side reconciliation behaviour across versions

Components I validated

Producer and consumer behaviour before and after upgrade
Message-flow stability and consumer-group state

← Back to all work