Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always clean up leadership state if the partition processor stops #2295

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

tillrohrmann
Copy link
Contributor

It is important to always fail the pending rpcs if the partition processor
stops. Otherwise the ingress might be hanging because a partition processor
failure does not entail a process crash.

The very concrete problem that I observed was when shutting down a PP, we are also shutting down the corresponding invoker. If shutting down the invoker completes first, then we would fail when become_follower where we check the abort_handle and never complete the pending rpcs.

It is important to always fail the pending rpcs if the partition processor
stops. Otherwise the ingress might be hanging because a partition processor
failure does not entail a process crash.
Copy link
Contributor

@AhmedSoliman AhmedSoliman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. 🚀

Comment on lines +272 to +273
Ok(_) => warn!("Shutting partition processor down because it stopped unexpectedly."),
Err(err) => warn!("Shutting partition processor down because it failed: {err}"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these be error! or is there a case where this is okay to happen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can happen if we fail to find the tail, for example. In this case, it's up to the PPM to handle this situation. Right now it would stop the PP and wait for further instructions from the CC. But maybe warn is too high since it can also happen when shutting down. The PPM would have the context to distinguish a shut down case from an unexpected termination.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants