-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using Tensor Parallel in the ipex-llm-serving-xpu Docker Image results in a crash. #12733
Comments
Good to know it is functioning on some systems, but on mine it still does not load the API configuration, and I am unsure why. I have checked through the docker image history: the last version to work with tensor parallel was b9. |
These two environment variables can solve the b10+ vllm serving multi-tp stuck problem, make sure they are set in the docker container of vllm serving be started. export CCL_SAME_STREAM=1
export CCL_BLOCKING_WAIT=0 If you set these two environment variables and it still doesn't work, please adjust the oneccl version. |
Thanks for your response. Both of these variables are set. Which OneCCL version would you recommend setting? Do you mean on the host machine or within the docker container? Thanks. |
Install in the docker container with this version https://sourceforge.net/projects/oneccl-wks/files/2024.0.0.6.3-release/oneccl_wks_installer_2024.0.0.6.3.sh/download. |
Downloaded the file, made it executable, ran it and then retried from within the container. Unfortunately the API still refuses to start. Did I install this correctly? |
Could you provide your docker start script and vllm serving start script. |
Here's the docker command: Heres the vllm serving start script: export CCL_WORKER_COUNT=2 export USE_XETLA=OFF export CCL_SAME_STREAM=1 source /opt/intel/1ccl-wks/setvars.sh python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server Also yes, very very very low gpu-memory-utilization, but on the b9 image it works and doesn't run that bad. Allows me to run two models on the same GPUs, one for fast queries and one for slow. |
add “--privileged” to start docker container |
Done, but still not working. For reference, I did change my OneCCL version within the container again after checking to see if 'privileged' made a difference. Both did not. |
why “--gpu-memory-utilization 0.075”,0.75 is reasonable. and lowbit is asym_int4 not sym_int4. Maybe you can have a try to use the script I offer to start model. |
Changing the memory utilization to 0.75 and the low bit to asym_int4 doesn't make a meaningful change. If it were issues with the GPUs, then it would show errors related to memory utilisation or problems with low bit. This seems to be only an issue related to the API loading. The model and context window are all loaded into VRAM, which I can see from GPU VRAM usage, but the API refuses to load. Something between B9 and the latest fork broke this, unless there is something very different about my system or docker configuration to yours. |
Any update on this? |
Could you please share the device configurations? Especially the CPU. |
Using two Intel Arc A770's clocked at 2400MHz as per a previous support thread. Do you need anything else? |
We recommend you use |
Amazing, thank you. I thought I was going crazy for a while there, but glad it has been identified as a bug on core series CPUs. |
Hello!
When using the ipex-llm-serving-xpu docker image (latest), attempting to start an AWQ model with vLLM on a single GPU works just fine, however increasing the Tensor Parallel number to 2 (to use two GPUs) results in a crash right before the OpenAI compatible API loads. Below is the log to show how the model loads.
/usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'libpng16.so.16: cannot open shared object file: No such file or directory'If you don't plan on using image functionality from
torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have
libjpegor
libpnginstalled before building
torchvisionfrom source? warn( 2025-01-22 15:44:17,944 - INFO - intel_extension_for_pytorch auto imported INFO 01-22 15:44:19 api_server.py:529] vLLM API server version 0.6.2+ipexllm INFO 01-22 15:44:19 api_server.py:530] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, load_in_low_bit='asym_int4', model='/llm/models/llama-3.2-3b-instruct-awq', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', config_format='auto', dtype='float16', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=2048, guided_decoding_backend='outlines', distributed_executor_backend='ray', worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=8, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.95, num_gpu_blocks_override=None, max_num_batched_tokens=4000, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization='awq', rope_scaling=None, rope_theta=None, enforce_eager=True, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='xpu', num_scheduler_steps=1, multi_step_stream_outputs=False, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['llama'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=True, override_neuron_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False) INFO 01-22 15:44:19 api_server.py:166] Multiprocessing frontend to use ipc:///tmp/271150ef-d7c8-40e2-b6a5-f69fdab67938 for IPC Path. INFO 01-22 15:44:19 api_server.py:180] Started engine process with PID 1186 INFO 01-22 15:44:19 awq_marlin.py:94] Detected that the model can run with awq_marlin, however you specified quantization=awq explicitly, so forcing awq. Use quantization=awq_marlin for faster inference WARNING 01-22 15:44:19 config.py:319] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models. /usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'libpng16.so.16: cannot open shared object file: No such file or directory'If you don't plan on using image functionality from
torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have
libjpegor
libpnginstalled before building
torchvisionfrom source? warn( 2025-01-22 15:44:21,614 - INFO - intel_extension_for_pytorch auto imported INFO 01-22 15:44:23 awq_marlin.py:94] Detected that the model can run with awq_marlin, however you specified quantization=awq explicitly, so forcing awq. Use quantization=awq_marlin for faster inference WARNING 01-22 15:44:23 config.py:319] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models. 2025-01-22 15:44:23,770 INFO worker.py:1821 -- Started a local Ray instance. INFO 01-22 15:44:24 llm_engine.py:226] Initializing an LLM engine (v0.6.2+ipexllm) with config: model='/llm/models/llama-3.2-3b-instruct-awq', speculative_config=None, tokenizer='/llm/models/llama-3.2-3b-instruct-awq', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=awq, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=xpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=llama, use_v2_block_manager=False, num_scheduler_steps=1, multi_step_stream_outputs=False, enable_prefix_caching=False, use_async_output_proc=False, use_cached_outputs=True, mm_processor_kwargs=None) INFO 01-22 15:44:24 ray_gpu_executor.py:135] use_ray_spmd_worker: False (pid=1439) /usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'libpng16.so.16: cannot open shared object file: No such file or directory'If you don't plan on using image functionality from
torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have
libjpegor
libpnginstalled before building
torchvision` from source?(pid=1439) warn(
(pid=1439) 2025-01-22 15:44:26,909 - INFO - intel_extension_for_pytorch auto imported
observability_config is ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False)
INFO 01-22 15:44:31 selector.py:193] Cannot use _Backend.FLASH_ATTN backend on XPU.
INFO 01-22 15:44:31 selector.py:138] Using IPEX attention backend.
(WrapperWithLoadBit pid=1441) observability_config is ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False)
(WrapperWithLoadBit pid=1441) INFO 01-22 15:44:31 selector.py:193] Cannot use _Backend.FLASH_ATTN backend on XPU.
(WrapperWithLoadBit pid=1441) INFO 01-22 15:44:31 selector.py:138] Using IPEX attention backend.
INFO 01-22 15:44:31 shm_broadcast.py:241] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7e8e7c096a50>, local_subscribe_port=52313, remote_subscribe_port=None)
INFO 01-22 15:44:31 selector.py:193] Cannot use _Backend.FLASH_ATTN backend on XPU.
INFO 01-22 15:44:31 selector.py:138] Using IPEX attention backend.
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
(WrapperWithLoadBit pid=1441) INFO 01-22 15:44:31 selector.py:193] Cannot use _Backend.FLASH_ATTN backend on XPU.
(WrapperWithLoadBit pid=1441) INFO 01-22 15:44:31 selector.py:138] Using IPEX attention backend.
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 5.60it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 5.59it/s]
2025-01-22 15:44:31,698 - INFO - Converting the current model to asym_int4 format......
2025-01-22 15:44:31,699 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
(WrapperWithLoadBit pid=1441) 2025-01-22 15:44:31,997 - INFO - Converting the current model to asym_int4 format......
(WrapperWithLoadBit pid=1441) 2025-01-22 15:44:31,997 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
(pid=1441) /usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'libpng16.so.16: cannot open shared object file: No such file or directory'If you don't plan on using image functionality from
torchvision.io
, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you havelibjpeg
orlibpng
installed before buildingtorchvision
from source?(pid=1441) warn(
(pid=1441) 2025-01-22 15:44:30,162 - INFO - intel_extension_for_pytorch auto imported
2025-01-22 15:44:34,049 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
2025-01-22 15:44:34,385 - INFO - Loading model weights took 1.2498 GB
(WrapperWithLoadBit pid=1441) 2025-01-22 15:44:37,683 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
(WrapperWithLoadBit pid=1441) 2025-01-22 15:44:38,424 - INFO - Loading model weights took 1.2498 GB
2025:01:22-15:44:39:( 2209) |CCL_WARN| no membind support for NUMA node 0, skip thread membind
2025:01:22-15:44:39:( 2211) |CCL_WARN| no membind support for NUMA node 0, skip thread membind
2025:01:22-15:44:39:( 1186) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2025:01:22-15:44:39:( 1186) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
-----> current rank: 0, world size: 2, byte_count: 24576000
(WrapperWithLoadBit pid=1441) 2025:01:22-15:44:39:( 2208) |CCL_WARN| no membind support for NUMA node 0, skip thread membind
(WrapperWithLoadBit pid=1441) 2025:01:22-15:44:39:( 2210) |CCL_WARN| no membind support for NUMA node 0, skip thread membind
(WrapperWithLoadBit pid=1441) 2025:01:22-15:44:39:( 1441) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
(WrapperWithLoadBit pid=1441) 2025:01:22-15:44:39:( 1441) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
(WrapperWithLoadBit pid=1441) -----> current rank: 1, world size: 2, byte_count: 24576000
^CProcess ForkProcess-22:
Process ForkProcess-24:
Process ForkProcess-15:
Process ForkProcess-23:
Process ForkProcess-21:
Process ForkProcess-10:
Process ForkProcess-12:
Process ForkProcess-19:
Process ForkProcess-7:
Process ForkProcess-18:
Process ForkProcess-3:
Process ForkProcess-20:
Process ForkProcess-6:
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.11/concurrent/futures/process.py", line 249, in _process_worker
call_item = call_queue.get(block=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/multiprocessing/queues.py", line 102, in get
with self._rlock:
File "/usr/lib/python3.11/multiprocessing/synchronize.py", line 95, in enter
return self._semlock.enter()
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
Traceback (most recent call last):
Process ForkProcess-15:
File "/usr/lib/python3.11/concurrent/futures/process.py", line 249, in _process_worker
call_item = call_queue.get(block=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/multiprocessing/queues.py", line 102, in get
with self._rlock:
File "/usr/lib/python3.11/multiprocessing/synchronize.py", line 95, in enter
return self._semlock.enter()
^^^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt
Process ForkProcess-5:
Process ForkProcess-16:
Process ForkProcess-2:
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
Process ForkProcess-14:
File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.11/concurrent/futures/process.py", line 249, in _process_worker
call_item = call_queue.get(block=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^
Process ForkProcess-19:
Process ForkProcess-8:
Process ForkProcess-17:
Process ForkProcess-13:
Traceback (most recent call last):
Process ForkProcess-23:
File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
*** SIGTERM received at time=1737531905 on cpu 7 ***
Process ForkProcess-4:
Process ForkProcess-24:
Process ForkProcess-8:
File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
Process ForkProcess-16:
Traceback (most recent call last):
Process ForkProcess-22:
Process ForkProcess-20:
Process ForkProcess-11:
Traceback (most recent call last):
Process ForkProcess-1:
Process ForkProcess-21:
Process ForkProcess-3:
Process ForkProcess-17:
Process ForkProcess-18:
File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
Process ForkProcess-10:
Process ForkProcess-5:
Process ForkProcess-13:
Process ForkProcess-14:
Traceback (most recent call last):
Process ForkProcess-9:
Traceback (most recent call last):
Process ForkProcess-12:
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
Process ForkProcess-9:
Process ForkProcess-2:
File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.11/concurrent/futures/process.py", line 249, in _process_worker
call_item = call_queue.get(block=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^
Process ForkProcess-7:
Process ForkProcess-6:
Process ForkProcess-11:
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.11/concurrent/futures/process.py", line 249, in _process_worker
call_item = call_queue.get(block=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/multiprocessing/queues.py", line 102, in get
with self._rlock:
File "/usr/lib/python3.11/multiprocessing/synchronize.py", line 95, in enter
return self._semlock.enter()
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
KeyboardInterrupt
File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.11/concurrent/futures/process.py", line 249, in _process_worker
call_item = call_queue.get(block=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/multiprocessing/queues.py", line 102, in get
with self._rlock:
File "/usr/lib/python3.11/multiprocessing/synchronize.py", line 95, in enter
return self._semlock.enter()
^^^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt
File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
Process ForkProcess-1:
Traceback (most recent call last):
Traceback (most recent call last):
File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.11/concurrent/futures/process.py", line 249, in _process_worker
call_item = call_queue.get(block=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/multiprocessing/queues.py", line 102, in get
with self._rlock:
File "/usr/lib/python3.11/multiprocessing/synchronize.py", line 95, in enter
return self._semlock.enter()
^^^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt
File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.11/concurrent/futures/process.py", line 249, in _process_worker
call_item = call_queue.get(block=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/multiprocessing/queues.py", line 102, in get
with self._rlock:
File "/usr/lib/python3.11/multiprocessing/synchronize.py", line 95, in enter
return self._semlock.enter()
^^^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt
Process ForkProcess-4:
File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.11/concurrent/futures/process.py", line 249, in _process_worker
call_item = call_queue.get(block=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/multiprocessing/queues.py", line 102, in get
with self._rlock:
File "/usr/lib/python3.11/multiprocessing/synchronize.py", line 95, in enter
return self._semlock.enter()
^^^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.11/concurrent/futures/process.py", line 249, in _process_worker
call_item = call_queue.get(block=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/multiprocessing/queues.py", line 102, in get
with self._rlock:
File "/usr/lib/python3.11/multiprocessing/synchronize.py", line 95, in enter
return self._semlock.enter()
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/concurrent/futures/process.py", line 249, in _process_worker
call_item = call_queue.get(block=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/multiprocessing/queues.py", line 103, in get
res = self._recv_bytes()
^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/multiprocessing/connection.py", line 430, in _recv_bytes
buf = self._recv(4)
^^^^^^^^^^^^^
File "/usr/lib/python3.11/multiprocessing/connection.py", line 395, in _recv
chunk = read(handle, remaining)
^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt
KeyboardInterrupt
File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.11/concurrent/futures/process.py", line 249, in _process_worker
call_item = call_queue.get(block=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/multiprocessing/queues.py", line 102, in get
with self._rlock:
File "/usr/lib/python3.11/multiprocessing/synchronize.py", line 95, in enter
return self._semlock.enter()
^^^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt
File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.11/concurrent/futures/process.py", line 249, in _process_worker
call_item = call_queue.get(block=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/multiprocessing/queues.py", line 102, in get
with self._rlock:
File "/usr/lib/python3.11/multiprocessing/synchronize.py", line 95, in enter
return self._semlock.enter()
^^^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt
File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.11/concurrent/futures/process.py", line 249, in _process_worker
call_item = call_queue.get(block=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/multiprocessing/queues.py", line 102, in get
with self._rlock:
File "/usr/lib/python3.11/multiprocessing/synchronize.py", line 95, in enter
return self._semlock.enter()
^^^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt
PC: @ 0x7e8a241c1e21 (unknown) ccl_yield()
@ 0x7e8fb6226520 (unknown) (unknown)
[2025-01-22 15:45:05,656 E 1186 1186] logging.cc:447: *** SIGTERM received at time=1737531905 on cpu 7 ***
[2025-01-22 15:45:05,656 E 1186 1186] logging.cc:447: PC: @ 0x7e8a241c1e21 (unknown) ccl_yield()
[2025-01-22 15:45:05,656 E 1186 1186] logging.cc:447: @ 0x7e8fb6226520 (unknown) (unknown)
Traceback (most recent call last):
File "/usr/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/usr/local/lib/python3.11/dist-packages/uvloop/init.py", line 61, in wrapper
return await main
^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/entrypoints/openai/api_server.py", line 541, in run_server
async with build_async_engine_client(args) as engine_client:
File "/usr/lib/python3.11/contextlib.py", line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/entrypoints/openai/api_server.py", line 105, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
File "/usr/lib/python3.11/contextlib.py", line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/entrypoints/openai/api_server.py", line 191, in build_async_engine_client_from_engine_args
await mp_engine_client.setup()
File "/usr/local/lib/python3.11/dist-packages/vllm/engine/multiprocessing/client.py", line 225, in setup
response = await self._wait_for_server_rpc(socket)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/engine/multiprocessing/client.py", line 328, in _wait_for_server_rpc
return await self._send_get_data_rpc_request(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/engine/multiprocessing/client.py", line 259, in _send_get_data_rpc_request
if await socket.poll(timeout=VLLM_RPC_TIMEOUT) == 0:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/entrypoints/openai/api_server.py", line 574, in
uvloop.run(run_server(args))
File "/usr/local/lib/python3.11/dist-packages/uvloop/init.py", line 105, in run
return runner.run(wrapper())
^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/asyncio/runners.py", line 123, in run
raise KeyboardInterrupt()
KeyboardInterrupt
root@llmserver:/llm# /usr/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/usr/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
^C
`
When it hangs right before starting the server, I send a keyboard interrupt. This kind of behaviour did not happen with the last image I used which hadn't been updated since November.
Wondering if anyone has any insights into this.
I am using two Intel Arc A770's with an 11th Generation i5 CPU, Kernel 6.5 and using the i915-dkms driver.
Thank you!
The text was updated successfully, but these errors were encountered: