ENGINEERING NOTE · 2 min read

A practical Kafka and Strimzi upgrade validation checklist

A checklist-shaped note for Kafka and Strimzi upgrade rehearsals, organised so failures get attributed to the right layer rather than to 'the upgrade'.

Kafka
Strimzi
Upgrades

Introduction

This note collects the checks I have found useful when rehearsing a Kafka and Strimzi upgrade. It is structured as separate checklists for before, during the operator upgrade, during the Kafka upgrade, and after, because in these rehearsals operator-side and broker-side failures looked similar from the outside until they were observed separately.

The recurring problem

Upgrades to Kafka or the Strimzi operator can introduce message-flow regressions, operator reconciliation surprises, or pod-recovery behaviour that masks the actual impact. Without an explicit checklist, it is easy to declare an upgrade successful while a subtle regression is still in flight.

Why it is difficult

Operator reconciliation behaviour can change between minor versions, and pods can recover on their own after a few minutes. Both effects make it harder to tell whether a symptom is the upgrade itself, a transient issue, or an application-side effect reacting to a broker restart.

Practical approach

Treat the rehearsal as four distinct phases (before, operator upgrade, Kafka upgrade, after) and validate each phase before moving to the next. Keep operator state and broker state as separate signals throughout, because mixing them obscures which layer changed.

Before the upgrade

Record current Kafka and Strimzi versions
Review supported compatibility combinations
Confirm operator reconciliation is healthy
Confirm brokers and dependent applications are healthy
Record relevant topic and consumer-group state
Confirm producer and consumer validation paths
Review rollback assumptions
Capture current warnings or known issues

During the Strimzi operator upgrade

Watch operator rollout status
Inspect reconciliation events
Check custom-resource status
Confirm that expected resources remain managed
Record unexpected warnings or errors

During the Kafka upgrade

Observe broker restart behaviour
Confirm brokers rejoin correctly
Monitor application connectivity
Validate producer behaviour
Validate consumer behaviour
Watch consumer-group stability
Record message-flow failures

After the upgrade

Produce and consume validation messages
Confirm consumer offsets behave as expected
Confirm applications reconnect successfully
Verify operator and broker health
Review logs for new warnings
Recheck rollback assumptions
Document observed compatibility issues

One important decision - Treat operator and broker upgrades as separate validation phases

Running the operator upgrade and the Kafka upgrade as a single combined check made it hard to attribute failures in these rehearsals. Separating them lengthened the rehearsal, but a regression in operator reconciliation no longer looked the same as a broker-side issue, which made each one easier to investigate.

Limitations

A checklist cannot prove that all production workloads, traffic patterns, schemas, or failure modes are safe
Rehearsal traffic is rarely identical to production traffic
Some regressions only appear under load or over longer time windows
The checklist is only as useful as the comparison between pre-upgrade and post-upgrade state

When this approach does not apply

Exact validation steps vary based on Kafka version, Strimzi version, deployment architecture, and the guarantees the applications need. A fully managed Kafka offering where operator and broker behaviour are not surfaced may need a different shape of checklist.

Conclusion

During these upgrade rehearsals, the most useful single change was splitting validation into operator-side and broker-side phases. The checklist above is the form that ended up being practical to run; it is intentionally not a claim of zero-risk upgrades.