Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(server): abort on panic #4026

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from
Draft

feat(server): abort on panic #4026

wants to merge 5 commits into from

Conversation

jjbayer
Copy link
Member

@jjbayer jjbayer commented Sep 11, 2024

Make the tasks spawned by services joinable, and abort the entire process if one of the join handles returns a panic.

This is a partial implementation that requires follow-up. Only EnvelopeBuffer and EnvelopeProcessor are actively monitored for panics for now.

ref: https://getsentry.atlassian.net/browse/INC-875

@jjbayer jjbayer self-assigned this Sep 11, 2024
Copy link
Member

@iambriccardo iambriccardo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea LGTM, it would be nice as I wrote if we could figure out in the future a more defensive mechanism for services to fallback in case of panics.

}
Err(e) => {
if e.is_panic() {
std::panic::resume_unwind(e.into_panic());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we maybe want in a future iteration to define a respawn behavior of services? It might be tricky to make sure existing channels are re-setup.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually re-triggers the panic and makes the process terminate. Respawning services is another option I would like to discuss on Monday, but it has its drawbacks (what if the service keeps panicking on every re-spawn?).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is something I thought of, I feel like for that we should have some global retry counters or heuristics to know when it's not possible to restart a service anymore.

}
Err(e) => {
if e.is_panic() {
std::panic::resume_unwind(e.into_panic());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually re-triggers the panic and makes the process terminate. Respawning services is another option I would like to discuss on Monday, but it has its drawbacks (what if the service keeps panicking on every re-spawn?).

@@ -1046,12 +1047,12 @@ mod tests {
impl Service for MockService {
type Interface = MockMessage;

fn spawn_handler(self, mut rx: Receiver<Self::Interface>) {
fn spawn_handler(self, mut rx: Receiver<Self::Interface>) -> JoinHandle<()> {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: by requiring that spawn_handler returns exactly one JoinHandle, we restrict the impl to define exactly one main task. Not sure if this is what we want, because the purpose of the spawn handler was to give the implementor more liberty. If we do restrict it, we might as well replace the trait method spawn_handler by a trait method run, and call tokio::spawn from the outside.

@@ -225,6 +226,8 @@ impl Service for HealthCheckService {
});
}
});

j1 // TODO: should return j1 + j2
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a few places where the spawn handler spawns more than one task. In a follow-up, we should transform these to something like

tokio::spawn(async {
    let subtask = tokio::spawn(async {...});
    /// ...
    subtask.await;
});

}
}
}
_ = Controller::shutdown_handle().finished() => {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: when every service implements a shutdown listener, awaiting on finished becomes unnecessary: We can simply await on all the join_handles and guarantee that every service finished its main task.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants