Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error of running llama3.2-vision with the ipex-llm's built-in ollama. #12707

Open
1ngram433 opened this issue Jan 15, 2025 · 4 comments
Open

Comments

@1ngram433
Copy link

I followed the steps in #https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/ollama_quickstart.md to set up ollama. I can successfully run llama3.2 and qwen. However, when I use ollama run llama3.2-vision, I get an error:
GGML_ASSERT(ggml_nelements(a) == ne0ne1ne2) failed
time=2025-01-15T10:43:34.636+08:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error"
time=2025-01-15T10:43:34.887+08:00 level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: GGML_ASSERT(ggml_nelements(a) == ne0ne1ne2) failed"

@1ngram433
Copy link
Author

1ngram433 commented Jan 15, 2025

The following is the complete and detailed error message:

time=2025-01-15T10:43:29.059+08:00 level=INFO source=gpu.go:221 msg="looking for compatible GPUs"
time=2025-01-15T10:43:29.059+08:00 level=INFO source=gpu_windows.go:167 msg=packages count=1
time=2025-01-15T10:43:29.059+08:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=6 efficiency=0 threads=12
time=2025-01-15T10:43:29.097+08:00 level=INFO source=server.go:105 msg="system memory" total="31.9 GiB" free="24.4 GiB" free_swap="25.2 GiB"
time=2025-01-15T10:43:29.097+08:00 level=INFO source=memory.go:356 msg="offload to device" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=0 layers.split="" memory.available="[24.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="10.9 GiB" memory.required.partial="0 B" memory.required.kv="656.2 MiB" memory.required.allocations="[10.9 GiB]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB"
time=2025-01-15T10:43:29.111+08:00 level=INFO source=server.go:401 msg="starting llama server" cmd="C:\\Users\\1ngram4\\llama-cpp\\dist\\windows-amd64\\lib\\ollama\\runners\\ipex_llm\\ollama_llama_server.exe --model C:\\Users\\1ngram4\\.ollama\\models\\blobs\\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 --ctx-size 2048 --batch-size 512 --n-gpu-layers 999 --mmproj C:\\Users\\1ngram4\\.ollama\\models\\blobs\\sha256-ece5e659647a20a5c28ab9eea1c12a1ad430bc0f2a27021d00ad103b3bf5206f --threads 6 --no-mmap --parallel 1 --port 51907"
time=2025-01-15T10:43:29.116+08:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2025-01-15T10:43:29.116+08:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding"
time=2025-01-15T10:43:29.117+08:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error"
time=2025-01-15T10:43:29.151+08:00 level=INFO source=runner.go:963 msg="starting go runner"
time=2025-01-15T10:43:29.159+08:00 level=INFO source=runner.go:964 msg=system info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | cgo(clang)" threads=6
time=2025-01-15T10:43:29.160+08:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:51907"
llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from C:\Users\1ngram4\.ollama\models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = mllama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Model
llama_model_loader: - kv   3:                         general.size_label str              = 10B
llama_model_loader: - kv   4:                         mllama.block_count u32              = 40
llama_model_loader: - kv   5:                      mllama.context_length u32              = 131072
llama_model_loader: - kv   6:                    mllama.embedding_length u32              = 4096
llama_model_loader: - kv   7:                 mllama.feed_forward_length u32              = 14336
llama_model_loader: - kv   8:                mllama.attention.head_count u32              = 32
llama_model_loader: - kv   9:             mllama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  10:                      mllama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  11:    mllama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  12:                          general.file_type u32              = 15
llama_model_loader: - kv  13:                          mllama.vocab_size u32              = 128256
llama_model_loader: - kv  14:                mllama.rope.dimension_count u32              = 128
llama_model_loader: - kv  15:    mllama.attention.cross_attention_layers arr[i32,8]       = [3, 8, 13, 18, 23, 28, 33, 38]
llama_model_loader: - kv  16:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  17:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  18:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  19:                      tokenizer.ggml.tokens arr[str,128257]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  20:                  tokenizer.ggml.token_type arr[i32,128257]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  21:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  23:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  24:            tokenizer.ggml.padding_token_id u32              = 128004
llama_model_loader: - kv  25:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  26:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  114 tensors
llama_model_loader: - type q4_K:  245 tensors
llama_model_loader: - type q6_K:   37 tensors
time=2025-01-15T10:43:29.368+08:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens cache size = 257
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = mllama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 11B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 9.78 B
llm_load_print_meta: model size       = 5.55 GiB (4.87 BPW)
llm_load_print_meta: general.name     = Model
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: PAD token        = 128004 '<|finetune_right_pad_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token        = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token        = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab mismatch 128256 !- 128257 ...
ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 1 SYCL devices:
llm_load_tensors: ggml ctx size =    0.36 MiB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors:      SYCL0 buffer size =  5397.51 MiB
llm_load_tensors:  SYCL_Host buffer size =   281.83 MiB
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |
    |
|  |                   |                                       |       |compute|Max work|sub  |mem    |
    |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Arc A750 Graphics|    1.6|    448|    1024|   32|  8319M|            1.3.31441|
llama_kv_cache_init:      SYCL0 KV buffer size =   656.25 MiB
llama_new_context_with_model: KV self size  =  656.25 MiB, K (f16):  328.12 MiB, V (f16):  328.12 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.50 MiB
llama_new_context_with_model:      SYCL0 compute buffer size =   258.50 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =    12.01 MiB
llama_new_context_with_model: graph nodes  = 902
llama_new_context_with_model: graph splits = 2
mllama_model_load: model name:   Llama-3.2-11B-Vision-Instruct
mllama_model_load: description:  vision encoder for Mllama
mllama_model_load: GGUF version: 3
mllama_model_load: alignment:    32
mllama_model_load: n_tensors:    512
mllama_model_load: n_kv:         17
mllama_model_load: ftype:        f16
mllama_model_load:
mllama_model_load: vision using CPU backend
D:\actions-runner\release-cpp-oneapi_2024_2\_work\llm.cpp\llm.cpp\ollama-llama-cpp\ggml\src\ggml.c:5964: GGML_ASSERT(ggml_nelements(a) == ne0*ne1*ne2) failed
time=2025-01-15T10:43:34.636+08:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error"
time=2025-01-15T10:43:34.887+08:00 level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: GGML_ASSERT(ggml_nelements(a) == ne0*ne1*ne2) failed"
[GIN] 2025/01/15 - 10:43:34 | 500 |    5.8456854s |       127.0.0.1 | POST     "/api/generate"

@sgwhat
Copy link
Contributor

sgwhat commented Jan 16, 2025

Hi @1ngram433, what's your ollama version? And could you pls provide the specific information of your device?

@1ngram433
Copy link
Author

Thank you, my ollama version is 0.5.1-ipexllm-20250114, and ipex-llm version is 2.2.0b20250114

AMD Ryzen 5 5500
GPU Intel Arc 750 Graphics
RAM 32GB

@sgwhat
Copy link
Contributor

sgwhat commented Jan 16, 2025

Well, I cannot reproduce your issue. I suspect it is related to the use of an AMD CPU on your device.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants