[doc] Feature request: Maybe slightly enhance the concurrency documentation? #99

nulltoken · 2024-10-14T19:16:07Z

Just a random thought. Concurrency may happen when running multiple times the same job on one instance. It may also happen would the host application be deployed in a web farm.

Designing this sort of jobs properly may require some ~~experience~~ prior painful production burn scars.

I wonder if an example (or a high level pseudo code) may help the reader understand what should be to be taken care of.

For instance, given a process feeding a database with things to process in a state A, how to safely design a job that could process those things and bring them to a state B in a conccurently-aware way:

Preventing different job instances to process the same things
Recovering from a server crash that could happen mid-process

The text was updated successfully, but these errors were encountered:

linkdotnet · 2024-10-14T19:22:49Z

More documentation can't really hurt. The current state somewhat convers all the features but doesn't give a guideline on how to reach certain things. Exactly like you explained.

I am all in if you have concrete ideas/examples.

If you want, you can easily play around with the documentation without any setup. We have devcontainer support and predefined tasks for serving the docs.

falvarez1 · 2024-10-24T18:12:41Z

painful production burn scars

Yeah, I've been there, not with NCronJob specifically, but with similar concurrency challenges like deadlocks and race conditions that would take down a production app. I'd be happy to help with this.

@nulltoken, could you provide more details about your use case for 'Recovering from a server crash that could happen mid-process'? Currently, NCronJob doesn’t include built-in crash recovery strategies, but I’d be open to discussing how we might handle this. Are you thinking of something like a checkpointing system or idempotent job design to ensure safe restarts after a failure?

nulltoken · 2024-10-30T09:30:55Z

@falvarez1 Sorry for the late response.

My question wasn't a hidden way to request for a feature. To be honest, I'm pretty happy with the lib as it is. It ticks many checkboxes on my list.

Neat abstraction on top of timer based hosted services
Easy offload of a async task from an Http handler
Orchestration of tasks based on their outcomes
Low ceremony
...

From very far, it could be seen as an alternative to Azure Durable functions.
Without all the clunky code/release process/operating model.

However, as a job lives in memory and has got no idea how many of its siblings run at the same time, some design has to be done to make this work.

Hence my initial request.

Regarding how I use it, although NCronJob has been on my watchlist for some time, I only started to use it in September when a simple enough project popped up where it could be battle tested.

Project context:
- Some document is pushed to a http endpoint
- It has to go through a workflow of validation steps
  - Some of them requiring a deep analysis of the document itself
  - Others requiring the use of external services
- When any check fails, the document is rejected with the reason of the rejection
- When a step successfully passes, the document is "tagged" as such and is eligible to be processed by the next step in line
- Once a document has successfully gone through all the steps, it's forwarded alongside the attached analysis results to a third party
- Current state of processing of the document (along with the potential rejections reasons and underlying evidences) is exposed through another endpoint
Constrainsts:
- Checks are regularly enhanced/fine-tuned by the analysts
- Incoming documents are sent with no particular frequency. Some of them may arrive in bulk.
- The deep analyses may be compute intensive. We want to not re-do them whenever possible
- The intermediate external services might be slow to respond or temporarily unavailable. Once they've been invoked for a specific document, we don't want to go through that again
- We want to avoid sending a fully validated document to the final third-party more than once

As I mentioned earlier, this is pretty basic and a good testbed for NCronJob.

As such, the design was pretty straightforward:

One validation step => one job
When a new document is provided, a routing sheet is created alongside.
Each job is designed to process a document in a particular state
When a job finishes, the routing sheet is updated with either the new state (so that it can be processed by one of the next jobs) or a final rejection reason
When a job starts, it searches in the database the top X next documents to process (with a state matching what the job can process) and marks them in a transactional way as being processed (the marking is a combination of two pieces of information:
the job name/version and the current utc date the job processing has started on). Of course, the search in the database for new documents to process excludes the marked ones.
During the processing of a job, nothing is persisted back in the database.
When a job has finished its task, the routing sheet is updated with the final decision (either valid for next step or rejected) and the marking is reset.
An observability container is fed with the metadata of each job execution and its dependencies (internal and external) which allows us to track the performance of the overall process.
An additional job is in charge to search for potential routing sheets that have been marked for too long (longer than the 150% of P95 of each job processing time) and just resets the markings. It can be seen as a "garbage collecting" job. Its goal is to "rewind" a routing sheet in case of a brutal server crash.
All validation jobs inherit from a base class that deals with the initial "fetch-and-mark" phase and the final "persist-and-unmark".

So, in brief:

Idempotency as much as possible
Some kind of mutex at the routing sheet level to prevent over consumption
A "release" mechanism to remove stale mutexes
A ton of metrics in an observability container
The only potential weak link of this is the final submission of a validated document to the third party. Would a crash occur between the submission and the "persist-and-unmark" of the routing sheet, this would eventually be reset by the garbage collecting job, re-processed and re-submitted.

But that's the chore of distributed systems.

Given the likelihood/impact matrix for this, we've decided to not special case this and only monitor/report would that happen (and deal with it through a manual compensation method).

The overall total of lines to make this work (of course not counting the "validation" code in each job) is less than 150 lines of code.

(And sorry for the wall of text...)

linkdotnet added the documentation Improvements or additions to documentation label Oct 14, 2024

nulltoken changed the title ~~[doc] Feature request: Maybe slightly enhance the conccurency documentation?~~ [doc] Feature request: Maybe slightly enhance the concurrency documentation? Oct 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[doc] Feature request: Maybe slightly enhance the concurrency documentation? #99

[doc] Feature request: Maybe slightly enhance the concurrency documentation? #99

nulltoken commented Oct 14, 2024 •

edited

Loading

linkdotnet commented Oct 14, 2024

falvarez1 commented Oct 24, 2024

nulltoken commented Oct 30, 2024

[doc] Feature request: Maybe slightly enhance the concurrency documentation? #99

[doc] Feature request: Maybe slightly enhance the concurrency documentation? #99

Comments

nulltoken commented Oct 14, 2024 • edited Loading

linkdotnet commented Oct 14, 2024

falvarez1 commented Oct 24, 2024

nulltoken commented Oct 30, 2024

nulltoken commented Oct 14, 2024 •

edited

Loading