Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using Tensor Parallel in the ipex-llm-serving-xpu Docker Image results in a crash. #12733

Open
HumerousGorgon opened this issue Jan 22, 2025 · 17 comments

Comments

@HumerousGorgon
Copy link

Hello!

When using the ipex-llm-serving-xpu docker image (latest), attempting to start an AWQ model with vLLM on a single GPU works just fine, however increasing the Tensor Parallel number to 2 (to use two GPUs) results in a crash right before the OpenAI compatible API loads. Below is the log to show how the model loads.

/usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'libpng16.so.16: cannot open shared object file: No such file or directory'If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpegorlibpnginstalled before buildingtorchvisionfrom source? warn( 2025-01-22 15:44:17,944 - INFO - intel_extension_for_pytorch auto imported INFO 01-22 15:44:19 api_server.py:529] vLLM API server version 0.6.2+ipexllm INFO 01-22 15:44:19 api_server.py:530] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, load_in_low_bit='asym_int4', model='/llm/models/llama-3.2-3b-instruct-awq', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', config_format='auto', dtype='float16', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=2048, guided_decoding_backend='outlines', distributed_executor_backend='ray', worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=8, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.95, num_gpu_blocks_override=None, max_num_batched_tokens=4000, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization='awq', rope_scaling=None, rope_theta=None, enforce_eager=True, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='xpu', num_scheduler_steps=1, multi_step_stream_outputs=False, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['llama'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=True, override_neuron_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False) INFO 01-22 15:44:19 api_server.py:166] Multiprocessing frontend to use ipc:///tmp/271150ef-d7c8-40e2-b6a5-f69fdab67938 for IPC Path. INFO 01-22 15:44:19 api_server.py:180] Started engine process with PID 1186 INFO 01-22 15:44:19 awq_marlin.py:94] Detected that the model can run with awq_marlin, however you specified quantization=awq explicitly, so forcing awq. Use quantization=awq_marlin for faster inference WARNING 01-22 15:44:19 config.py:319] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models. /usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'libpng16.so.16: cannot open shared object file: No such file or directory'If you don't plan on using image functionality fromtorchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpegorlibpnginstalled before buildingtorchvisionfrom source? warn( 2025-01-22 15:44:21,614 - INFO - intel_extension_for_pytorch auto imported INFO 01-22 15:44:23 awq_marlin.py:94] Detected that the model can run with awq_marlin, however you specified quantization=awq explicitly, so forcing awq. Use quantization=awq_marlin for faster inference WARNING 01-22 15:44:23 config.py:319] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models. 2025-01-22 15:44:23,770 INFO worker.py:1821 -- Started a local Ray instance. INFO 01-22 15:44:24 llm_engine.py:226] Initializing an LLM engine (v0.6.2+ipexllm) with config: model='/llm/models/llama-3.2-3b-instruct-awq', speculative_config=None, tokenizer='/llm/models/llama-3.2-3b-instruct-awq', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=awq, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=xpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=llama, use_v2_block_manager=False, num_scheduler_steps=1, multi_step_stream_outputs=False, enable_prefix_caching=False, use_async_output_proc=False, use_cached_outputs=True, mm_processor_kwargs=None) INFO 01-22 15:44:24 ray_gpu_executor.py:135] use_ray_spmd_worker: False (pid=1439) /usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'libpng16.so.16: cannot open shared object file: No such file or directory'If you don't plan on using image functionality fromtorchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpegorlibpnginstalled before buildingtorchvision` from source?
(pid=1439) warn(
(pid=1439) 2025-01-22 15:44:26,909 - INFO - intel_extension_for_pytorch auto imported
observability_config is ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False)
INFO 01-22 15:44:31 selector.py:193] Cannot use _Backend.FLASH_ATTN backend on XPU.
INFO 01-22 15:44:31 selector.py:138] Using IPEX attention backend.
(WrapperWithLoadBit pid=1441) observability_config is ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False)
(WrapperWithLoadBit pid=1441) INFO 01-22 15:44:31 selector.py:193] Cannot use _Backend.FLASH_ATTN backend on XPU.
(WrapperWithLoadBit pid=1441) INFO 01-22 15:44:31 selector.py:138] Using IPEX attention backend.
INFO 01-22 15:44:31 shm_broadcast.py:241] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7e8e7c096a50>, local_subscribe_port=52313, remote_subscribe_port=None)
INFO 01-22 15:44:31 selector.py:193] Cannot use _Backend.FLASH_ATTN backend on XPU.
INFO 01-22 15:44:31 selector.py:138] Using IPEX attention backend.
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
(WrapperWithLoadBit pid=1441) INFO 01-22 15:44:31 selector.py:193] Cannot use _Backend.FLASH_ATTN backend on XPU.
(WrapperWithLoadBit pid=1441) INFO 01-22 15:44:31 selector.py:138] Using IPEX attention backend.
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 5.60it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 5.59it/s]

2025-01-22 15:44:31,698 - INFO - Converting the current model to asym_int4 format......
2025-01-22 15:44:31,699 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
(WrapperWithLoadBit pid=1441) 2025-01-22 15:44:31,997 - INFO - Converting the current model to asym_int4 format......
(WrapperWithLoadBit pid=1441) 2025-01-22 15:44:31,997 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
(pid=1441) /usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'libpng16.so.16: cannot open shared object file: No such file or directory'If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source?
(pid=1441) warn(
(pid=1441) 2025-01-22 15:44:30,162 - INFO - intel_extension_for_pytorch auto imported
2025-01-22 15:44:34,049 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
2025-01-22 15:44:34,385 - INFO - Loading model weights took 1.2498 GB
(WrapperWithLoadBit pid=1441) 2025-01-22 15:44:37,683 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
(WrapperWithLoadBit pid=1441) 2025-01-22 15:44:38,424 - INFO - Loading model weights took 1.2498 GB
2025:01:22-15:44:39:( 2209) |CCL_WARN| no membind support for NUMA node 0, skip thread membind
2025:01:22-15:44:39:( 2211) |CCL_WARN| no membind support for NUMA node 0, skip thread membind
2025:01:22-15:44:39:( 1186) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2025:01:22-15:44:39:( 1186) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
-----> current rank: 0, world size: 2, byte_count: 24576000
(WrapperWithLoadBit pid=1441) 2025:01:22-15:44:39:( 2208) |CCL_WARN| no membind support for NUMA node 0, skip thread membind
(WrapperWithLoadBit pid=1441) 2025:01:22-15:44:39:( 2210) |CCL_WARN| no membind support for NUMA node 0, skip thread membind
(WrapperWithLoadBit pid=1441) 2025:01:22-15:44:39:( 1441) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
(WrapperWithLoadBit pid=1441) 2025:01:22-15:44:39:( 1441) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
(WrapperWithLoadBit pid=1441) -----> current rank: 1, world size: 2, byte_count: 24576000
^CProcess ForkProcess-22:
Process ForkProcess-24:
Process ForkProcess-15:
Process ForkProcess-23:
Process ForkProcess-21:
Process ForkProcess-10:
Process ForkProcess-12:
Process ForkProcess-19:
Process ForkProcess-7:
Process ForkProcess-18:
Process ForkProcess-3:
Process ForkProcess-20:
Process ForkProcess-6:
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.11/concurrent/futures/process.py", line 249, in _process_worker
call_item = call_queue.get(block=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/multiprocessing/queues.py", line 102, in get
with self._rlock:
File "/usr/lib/python3.11/multiprocessing/synchronize.py", line 95, in enter
return self._semlock.enter()
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
Traceback (most recent call last):
Process ForkProcess-15:
File "/usr/lib/python3.11/concurrent/futures/process.py", line 249, in _process_worker
call_item = call_queue.get(block=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/multiprocessing/queues.py", line 102, in get
with self._rlock:
File "/usr/lib/python3.11/multiprocessing/synchronize.py", line 95, in enter
return self._semlock.enter()
^^^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt
Process ForkProcess-5:
Process ForkProcess-16:
Process ForkProcess-2:
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
Process ForkProcess-14:
File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.11/concurrent/futures/process.py", line 249, in _process_worker
call_item = call_queue.get(block=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^
Process ForkProcess-19:
Process ForkProcess-8:
Process ForkProcess-17:
Process ForkProcess-13:
Traceback (most recent call last):
Process ForkProcess-23:
File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
*** SIGTERM received at time=1737531905 on cpu 7 ***
Process ForkProcess-4:
Process ForkProcess-24:
Process ForkProcess-8:
File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
Process ForkProcess-16:
Traceback (most recent call last):
Process ForkProcess-22:
Process ForkProcess-20:
Process ForkProcess-11:
Traceback (most recent call last):
Process ForkProcess-1:
Process ForkProcess-21:
Process ForkProcess-3:
Process ForkProcess-17:
Process ForkProcess-18:
File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
Process ForkProcess-10:
Process ForkProcess-5:
Process ForkProcess-13:
Process ForkProcess-14:
Traceback (most recent call last):
Process ForkProcess-9:
Traceback (most recent call last):
Process ForkProcess-12:
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
Process ForkProcess-9:
Process ForkProcess-2:
File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.11/concurrent/futures/process.py", line 249, in _process_worker
call_item = call_queue.get(block=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^
Process ForkProcess-7:
Process ForkProcess-6:
Process ForkProcess-11:
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.11/concurrent/futures/process.py", line 249, in _process_worker
call_item = call_queue.get(block=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/multiprocessing/queues.py", line 102, in get
with self._rlock:
File "/usr/lib/python3.11/multiprocessing/synchronize.py", line 95, in enter
return self._semlock.enter()
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
KeyboardInterrupt
File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.11/concurrent/futures/process.py", line 249, in _process_worker
call_item = call_queue.get(block=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/multiprocessing/queues.py", line 102, in get
with self._rlock:
File "/usr/lib/python3.11/multiprocessing/synchronize.py", line 95, in enter
return self._semlock.enter()
^^^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt
File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
Process ForkProcess-1:
Traceback (most recent call last):
Traceback (most recent call last):
File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.11/concurrent/futures/process.py", line 249, in _process_worker
call_item = call_queue.get(block=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/multiprocessing/queues.py", line 102, in get
with self._rlock:
File "/usr/lib/python3.11/multiprocessing/synchronize.py", line 95, in enter
return self._semlock.enter()
^^^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt
File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.11/concurrent/futures/process.py", line 249, in _process_worker
call_item = call_queue.get(block=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/multiprocessing/queues.py", line 102, in get
with self._rlock:
File "/usr/lib/python3.11/multiprocessing/synchronize.py", line 95, in enter
return self._semlock.enter()
^^^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt
Process ForkProcess-4:
File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.11/concurrent/futures/process.py", line 249, in _process_worker
call_item = call_queue.get(block=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/multiprocessing/queues.py", line 102, in get
with self._rlock:
File "/usr/lib/python3.11/multiprocessing/synchronize.py", line 95, in enter
return self._semlock.enter()
^^^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.11/concurrent/futures/process.py", line 249, in _process_worker
call_item = call_queue.get(block=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/multiprocessing/queues.py", line 102, in get
with self._rlock:
File "/usr/lib/python3.11/multiprocessing/synchronize.py", line 95, in enter
return self._semlock.enter()
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/concurrent/futures/process.py", line 249, in _process_worker
call_item = call_queue.get(block=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/multiprocessing/queues.py", line 103, in get
res = self._recv_bytes()
^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/multiprocessing/connection.py", line 430, in _recv_bytes
buf = self._recv(4)
^^^^^^^^^^^^^
File "/usr/lib/python3.11/multiprocessing/connection.py", line 395, in _recv
chunk = read(handle, remaining)
^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt
KeyboardInterrupt
File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.11/concurrent/futures/process.py", line 249, in _process_worker
call_item = call_queue.get(block=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/multiprocessing/queues.py", line 102, in get
with self._rlock:
File "/usr/lib/python3.11/multiprocessing/synchronize.py", line 95, in enter
return self._semlock.enter()
^^^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt
File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.11/concurrent/futures/process.py", line 249, in _process_worker
call_item = call_queue.get(block=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/multiprocessing/queues.py", line 102, in get
with self._rlock:
File "/usr/lib/python3.11/multiprocessing/synchronize.py", line 95, in enter
return self._semlock.enter()
^^^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt
File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.11/concurrent/futures/process.py", line 249, in _process_worker
call_item = call_queue.get(block=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/multiprocessing/queues.py", line 102, in get
with self._rlock:
File "/usr/lib/python3.11/multiprocessing/synchronize.py", line 95, in enter
return self._semlock.enter()
^^^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt
PC: @ 0x7e8a241c1e21 (unknown) ccl_yield()
@ 0x7e8fb6226520 (unknown) (unknown)
[2025-01-22 15:45:05,656 E 1186 1186] logging.cc:447: *** SIGTERM received at time=1737531905 on cpu 7 ***
[2025-01-22 15:45:05,656 E 1186 1186] logging.cc:447: PC: @ 0x7e8a241c1e21 (unknown) ccl_yield()
[2025-01-22 15:45:05,656 E 1186 1186] logging.cc:447: @ 0x7e8fb6226520 (unknown) (unknown)
Traceback (most recent call last):
File "/usr/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/usr/local/lib/python3.11/dist-packages/uvloop/init.py", line 61, in wrapper
return await main
^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/entrypoints/openai/api_server.py", line 541, in run_server
async with build_async_engine_client(args) as engine_client:
File "/usr/lib/python3.11/contextlib.py", line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/entrypoints/openai/api_server.py", line 105, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
File "/usr/lib/python3.11/contextlib.py", line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/entrypoints/openai/api_server.py", line 191, in build_async_engine_client_from_engine_args
await mp_engine_client.setup()
File "/usr/local/lib/python3.11/dist-packages/vllm/engine/multiprocessing/client.py", line 225, in setup
response = await self._wait_for_server_rpc(socket)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/engine/multiprocessing/client.py", line 328, in _wait_for_server_rpc
return await self._send_get_data_rpc_request(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/engine/multiprocessing/client.py", line 259, in _send_get_data_rpc_request
if await socket.poll(timeout=VLLM_RPC_TIMEOUT) == 0:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/entrypoints/openai/api_server.py", line 574, in
uvloop.run(run_server(args))
File "/usr/local/lib/python3.11/dist-packages/uvloop/init.py", line 105, in run
return runner.run(wrapper())
^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/asyncio/runners.py", line 123, in run
raise KeyboardInterrupt()
KeyboardInterrupt
root@llmserver:/llm# /usr/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/usr/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
^C
`

When it hangs right before starting the server, I send a keyboard interrupt. This kind of behaviour did not happen with the last image I used which hadn't been updated since November.
Wondering if anyone has any insights into this.

I am using two Intel Arc A770's with an 11th Generation i5 CPU, Kernel 6.5 and using the i915-dkms driver.

Thank you!

@ACupofAir
Copy link
Collaborator

Can not reproduce your problem.
My image version:
Image

Here is my serving start script:

#!/bin/bash
model="/llm/models/Llama-3.2-3B-Instruct-AWQ"
served_model_name="Llama-3.2-3B-Instruct-AWQ"

export CCL_WORKER_COUNT=2
export SYCL_CACHE_PERSISTENT=1
export FI_PROVIDER=shm
export CCL_ATL_TRANSPORT=ofi
export CCL_ZE_IPC_EXCHANGE=sockets
export CCL_ATL_SHM=1

export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2
export TORCH_LLM_ALLREDUCE=0

export CCL_SAME_STREAM=1
export CCL_BLOCKING_WAIT=0

source /opt/intel/1ccl-wks/setvars.sh

python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
  --served-model-name $served_model_name \
  --port 8000 \
  --model $model \
  --trust-remote-code \
  --block-size 8 \
  --gpu-memory-utilization 0.95 \
  --device xpu \
  --dtype float16 \
  --enforce-eager \
  --quantization awq \
  --load-in-low-bit asym_int4 \
  --max-model-len 2000 \
  --max-num-batched-tokens 3000 \
  --max-num-seqs 256 \
  --tensor-parallel-size 2 \
  --disable-async-output-proc \
  --distributed-executor-backend ray

Here is curl script:

 curl http://localhost:8000/v1/completions \p://localhost:8001/v1/completions \
     -H "Content-Type: application/json" \
     -d '{"model": "Llama-3.2-3B-Instruct-AWQ",
          "prompt": "Introduce Shanghai",
          "max_tokens": 256
         }'

Output:

Image

@HumerousGorgon
Copy link
Author

HumerousGorgon commented Jan 26, 2025

Good to know it is functioning on some systems, but on mine it still does not load the API configuration, and I am unsure why.
Everything in my script is identical to yours, however my setup stalls right before it's meant to load the config. Previous versions of the Docker image worked fine; it's only recent ones that have broken something.
Is there diagnostic data I can give you to determine what might be the cause?

I have checked through the docker image history: the last version to work with tensor parallel was b9.

@ACupofAir
Copy link
Collaborator

These two environment variables can solve the b10+ vllm serving multi-tp stuck problem, make sure they are set in the docker container of vllm serving be started.

export CCL_SAME_STREAM=1
export CCL_BLOCKING_WAIT=0

If you set these two environment variables and it still doesn't work, please adjust the oneccl version.

@HumerousGorgon
Copy link
Author

Thanks for your response. Both of these variables are set. Which OneCCL version would you recommend setting? Do you mean on the host machine or within the docker container?

Thanks.

@ACupofAir
Copy link
Collaborator

Thanks for your response. Both of these variables are set. Which OneCCL version would you recommend setting? Do you mean on the host machine or within the docker container?

Thanks.

Install in the docker container with this version https://sourceforge.net/projects/oneccl-wks/files/2024.0.0.6.3-release/oneccl_wks_installer_2024.0.0.6.3.sh/download.

@HumerousGorgon
Copy link
Author

Downloaded the file, made it executable, ran it and then retried from within the container. Unfortunately the API still refuses to start.

Did I install this correctly?
All it tells me is:
Verifying archive integrity... 100% MD5 checksums are OK. All good. Uncompressing oneccl for workstation with multi ARC - v2024.0.0.6.3 100%

@ACupofAir
Copy link
Collaborator

Could you provide your docker start script and vllm serving start script.

@HumerousGorgon
Copy link
Author

HumerousGorgon commented Jan 27, 2025

Here's the docker command:
docker run -itd --net host -v /home/llm/models:/llm/models --device /dev/dri/renderD128:/dev/dri/renderD128 --device /dev/dri/renderD129:/dev/dri/renderD129 --device /dev/dri/card0:/dev/dri/card0 --device /dev/dri/card1:/dev/dri/card1 --shm-size="16g" --name ipex-llm2 intelanalytics/ipex-llm-serving-xpu:latest

Heres the vllm serving start script:
#!/bin/bash
model="/llm/models/DeepSeek-R1-Distill-Qwen-1.5B"
served_model_name="DeepSeek-R1-Qwen-1.5B"

export CCL_WORKER_COUNT=2
export SYCL_CACHE_PERSISTENT=1
export FI_PROVIDER=shm
export CCL_ATL_TRANSPORT=ofi
export CCL_ZE_IPC_EXCHANGE=sockets
export CCL_ATL_SHM=1

export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export TORCH_LLM_ALLREDUCE=0

export CCL_SAME_STREAM=1
export CCL_BLOCKING_WAIT=0

source /opt/intel/1ccl-wks/setvars.sh

python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server
--served-model-name $served_model_name
--port 8000
--model $model
--trust-remote-code
--block-size 8
--gpu-memory-utilization 0.075
--device xpu
--dtype float16
--enforce-eager
--load-in-low-bit sym_int4
--max-model-len 2048
--max-num-batched-tokens 2500
--max-num-seqs 256
--tensor-parallel-size 2
--disable-async-output-proc
--distributed-executor-backend ray

Also yes, very very very low gpu-memory-utilization, but on the b9 image it works and doesn't run that bad. Allows me to run two models on the same GPUs, one for fast queries and one for slow.

@ACupofAir
Copy link
Collaborator

ACupofAir commented Jan 27, 2025

add “--privileged” to start docker container

@HumerousGorgon
Copy link
Author

Done, but still not working. For reference, I did change my OneCCL version within the container again after checking to see if 'privileged' made a difference. Both did not.

@ACupofAir
Copy link
Collaborator

why “--gpu-memory-utilization 0.075”,0.75 is reasonable. and lowbit is asym_int4 not sym_int4. Maybe you can have a try to use the script I offer to start model.

@HumerousGorgon
Copy link
Author

Changing the memory utilization to 0.75 and the low bit to asym_int4 doesn't make a meaningful change. If it were issues with the GPUs, then it would show errors related to memory utilisation or problems with low bit. This seems to be only an issue related to the API loading. The model and context window are all loaded into VRAM, which I can see from GPU VRAM usage, but the API refuses to load. Something between B9 and the latest fork broke this, unless there is something very different about my system or docker configuration to yours.

@HumerousGorgon
Copy link
Author

Any update on this?

@glorysdj
Copy link
Contributor

glorysdj commented Feb 6, 2025

Could you please share the device configurations? Especially the CPU.

@HumerousGorgon
Copy link
Author

Using two Intel Arc A770's clocked at 2400MHz as per a previous support thread.
The CPU is an 11600KF clocked at 3.9GHz.

Do you need anything else?

@ACupofAir
Copy link
Collaborator

Using two Intel Arc A770's clocked at 2400MHz as per a previous support thread. The CPU is an 11600KF clocked at 3.9GHz.

Do you need anything else?

We recommend you use intelanalytics/ipex-llm-serving-xpu:2.2.0-b9 first, b10 and above version multi-card TP crash problem occurs on the core series CPU, we are solving it.

@HumerousGorgon
Copy link
Author

Amazing, thank you. I thought I was going crazy for a while there, but glad it has been identified as a bug on core series CPUs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants