-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
job controller doesn't respect graceful shutdown send by k8s #347
job controller doesn't respect graceful shutdown send by k8s #347
Comments
My initial idea was that the from threading import *
import time
from flask import Flask
app = Flask(__name__)
@app.route("/")
def spawn():
t = Thread(target=thread)
t.daemon = True
t.start()
return "Spawned"
def thread():
while True:
print('this is daemon thread')
time.sleep(3)
app.run() Find a process with So, maybe, the problem is with how the Flask app exits? I don't have an answer yet but found one interesting comment. |
This issue can provide a possible fix. |
Handle SIGTERM in `start-scheduler` to gracefully stop consuming the workflow submission queue. Closes reanahub/reana-job-controller#347
Avoid using `/bin/sh -c` to run uwsgi, as that breaks signal propagation. Closes reanahub/reana-job-controller#347
When using the shell form of `CMD`, the provided command is executed using `/bin/sh -c`, which breaks signal propagation. Use `exec` to substitute the `sh` process and fix handling of signals by uwsgi. Also handle SIGTERM in `consume-job-queue` to gracefully stop consuming the job status queue. Closes reanahub/reana-job-controller#347
Handle SIGTERM in `start-scheduler` to gracefully stop consuming the workflow submission queue. Closes reanahub/reana-job-controller#347
Handle SIGTERM in `start-scheduler` to gracefully stop consuming the workflow submission queue. Closes reanahub/reana-job-controller#347
Fixed for reana-server and reana-workflow-controller. Still missing:
|
Use `exec` to execute RabbitMQ's server, so that the server process can receive signals such as `SIGTERM`. Closes reanahub/reana-job-controller#347
Use `exec` to execute RabbitMQ's server, so that the server process can receive signals such as `SIGTERM`. Closes reanahub/reana-job-controller#347
Use `exec` to execute RabbitMQ's server, so that the server process can receive signals such as `SIGTERM`. Closes reanahub/reana-job-controller#347
Use exec to execute job-controller, so that the server can receive signals such as `SIGTERM`. Closes reanahub/reana-job-controller#347
Use exec to execute job-controller, so that the server can receive signals such as `SIGTERM`. Closes reanahub/reana-job-controller#347
Use exec to execute job-controller, so that the server can receive signals such as `SIGTERM`. Closes reanahub/reana-job-controller#347
Use exec to execute job-controller, so that the server can receive signals such as `SIGTERM`. Closes reanahub/reana-job-controller#347
Use exec to execute job-controller, so that the server can receive signals such as `SIGTERM`. Closes reanahub/reana-job-controller#347
Use exec to execute job-controller, so that the server can receive signals such as `SIGTERM`. Closes reanahub/reana-job-controller#347
When REANA asks k8s to terminate batch pod, k8s sends SIGTERM (soft kill) signal to all containers in batch pod. At this time,
workflow-engine
is already exited with status 0, so the only running container that receives SIGTERM isjob-controller
. k8s waits for 30 seconds (default graceful shutdown period, details) butjob-controller
doesn't exit so k8s sends SIGKILL and marks batch pod'sphase
(what is phase?) asFailed
.Additional info: How k8s terminates pods?
Originated in reanahub/reana#593 (in this conversation).
Details
Example output of
kubectl get pods -o json
command:Check timestamps in
containerStatuses
.workflow-engine
finished at10:28:02
so around that time REANA asks k8s to delete pod. Take a look atjob-controller
finish time -10:28:32
, exactly 30 seconds afterworkflow-engine
finishes. In addition, checkexitCode
, it is 137 (SIGKILL).The text was updated successfully, but these errors were encountered: