Replies: 1 comment 1 reply
-
I would expect the load time of the RPC servers to be mainly limited by the network bandwidth, so unless you have multiple NICs with direct connections to each server, I don't think this is likely to help significantly. The best way to reduce the load time would be to implement a tensor cache in the server (as previously mentioned in #9740 (comment)). |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi everyone,
I'm working on asynchronously launching ggml_backend_tensor_set in llama-model-loader.cpp using a thread pool. Currently, these calls are executed sequentially in one group. Previously, I tried to create a unique queue for each device and process the data sequentially, but that didn't yield the expected results.
Right now, I'm temporarily using a single shared queue, but unfortunately, this doesn't solve the problem — I'm encountering a SIGSEGV error. I will continue to look for a solution.
I wanted to ask if anyone is currently working on Multithreading Offloading? I would appreciate any advice or ideas!
Why multithreading? because we have rpc-server devices (in 4 pc, load so much time)
Thank you!
Beta Was this translation helpful? Give feedback.
All reactions