ENGINEERING NOTE ยท 2 min read
A practical Kafka and Strimzi upgrade validation checklist
A checklist-shaped note for Kafka and Strimzi upgrade rehearsals, organised so failures get attributed to the right layer rather than to 'the upgrade'.
- Kafka
- Strimzi
- Upgrades
Introduction
This note collects the checks I have found useful when rehearsing a Kafka and Strimzi upgrade. It is structured as separate checklists for before, during the operator upgrade, during the Kafka upgrade, and after, because in these rehearsals operator-side and broker-side failures looked similar from the outside until they were observed separately.
The recurring problem
Upgrades to Kafka or the Strimzi operator can introduce message-flow regressions, operator reconciliation surprises, or pod-recovery behaviour that masks the actual impact. Without an explicit checklist, it is easy to declare an upgrade successful while a subtle regression is still in flight.
Why it is difficult
Operator reconciliation behaviour can change between minor versions, and pods can recover on their own after a few minutes. Both effects make it harder to tell whether a symptom is the upgrade itself, a transient issue, or an application-side effect reacting to a broker restart.
Practical approach
Treat the rehearsal as four distinct phases (before, operator upgrade, Kafka upgrade, after) and validate each phase before moving to the next. Keep operator state and broker state as separate signals throughout, because mixing them obscures which layer changed.
Before the upgrade
- Record current Kafka and Strimzi versions
- Review supported compatibility combinations
- Confirm operator reconciliation is healthy
- Confirm brokers and dependent applications are healthy
- Record relevant topic and consumer-group state
- Confirm producer and consumer validation paths
- Review rollback assumptions
- Capture current warnings or known issues
During the Strimzi operator upgrade
- Watch operator rollout status
- Inspect reconciliation events
- Check custom-resource status
- Confirm that expected resources remain managed
- Record unexpected warnings or errors
During the Kafka upgrade
- Observe broker restart behaviour
- Confirm brokers rejoin correctly
- Monitor application connectivity
- Validate producer behaviour
- Validate consumer behaviour
- Watch consumer-group stability
- Record message-flow failures
After the upgrade
- Produce and consume validation messages
- Confirm consumer offsets behave as expected
- Confirm applications reconnect successfully
- Verify operator and broker health
- Review logs for new warnings
- Recheck rollback assumptions
- Document observed compatibility issues
One important decision - Treat operator and broker upgrades as separate validation phases
Running the operator upgrade and the Kafka upgrade as a single combined check made it hard to attribute failures in these rehearsals. Separating them lengthened the rehearsal, but a regression in operator reconciliation no longer looked the same as a broker-side issue, which made each one easier to investigate.
Limitations
- A checklist cannot prove that all production workloads, traffic patterns, schemas, or failure modes are safe
- Rehearsal traffic is rarely identical to production traffic
- Some regressions only appear under load or over longer time windows
- The checklist is only as useful as the comparison between pre-upgrade and post-upgrade state
When this approach does not apply
Exact validation steps vary based on Kafka version, Strimzi version, deployment architecture, and the guarantees the applications need. A fully managed Kafka offering where operator and broker behaviour are not surfaced may need a different shape of checklist.
Conclusion
During these upgrade rehearsals, the most useful single change was splitting validation into operator-side and broker-side phases. The checklist above is the form that ended up being practical to run; it is intentionally not a claim of zero-risk upgrades.