Skip to content

Commit

Permalink
worker.py: Avoid letting a job get stuck if exception occurs
Browse files Browse the repository at this point in the history
Some instances of stuck jobs were observed recently for PureOS.  From
the logs, I think a Python exception may have occurred after the build
completed but before the artifacts were uploaded.  I can't tell what
might have caused that exception, if it did occur.

This change would ensure that a 'rejected' status is sent if this
occurs, rather than leaving the job stuck with no result.

I applied this change to fennel.pureos.net (the new worker) to try to
identify what was causing the stuck jobs, but this never happened
again.  No jobs got stuck after that and this exception code was never
hit, so I can't identify the root cause.

Signed-off-by: Jonathon Hall <[email protected]>
  • Loading branch information
JonathonHall-Purism committed Oct 21, 2024
1 parent 5d106b8 commit d891f9e
Showing 1 changed file with 24 additions and 11 deletions.
35 changes: 24 additions & 11 deletions spark/worker.py
Original file line number Diff line number Diff line change
Expand Up @@ -184,19 +184,32 @@ def _request_job(self):
# there are no jobs available for us
return False

job_module = job_reply.get('module')
job_kind = job_reply.get('kind')
job_id = job_reply.get('uuid')

if job_kind in self._conf.accepted_job_kinds:
return self._run_job(job_reply)
else:
log.warning(
'Received job of type {0}::{1} which we can not handle.'.format(
job_module, job_kind
# Now that we've accepted a job, we just reply to the server, even if
# an exception occurs. If we don't, the job is stuck indefinitely and
# we will not be able to accept another.
try:
job_module = job_reply.get('module')
job_kind = job_reply.get('kind')
job_id = job_reply.get('uuid')

if job_kind in self._conf.accepted_job_kinds:
return self._run_job(job_reply)
else:
log.warning(
'Received job of type {0}::{1} which we can not handle.'.format(
job_module, job_kind
)
)
)
self._conn.send_job_status(job_id, JobStatus.REJECTED)
return False
except: # noqa: E722 pylint: disable=bare-except
import traceback

tb = traceback.format_exc()
jlog.write(tb)
self._conn.send_job_status(job_id, JobStatus.REJECTED)
log.warning(tb)
log.info('Rejected job {} due to exception'.format(job_id))
return False

def _update_archive_data(self) -> bool:
Expand Down

0 comments on commit d891f9e

Please sign in to comment.