[BUG] [CPU Memory OOM] DeekSpeek R1 got os oom-kill when packing model.layers #1355

ShiningMaker · 2025-02-27T03:16:28Z

Describe the bug

From my dmesg output, it is evident that the GPTQ Python process (PID 1179327) was killed by the kernel due to the system running out of memory (Out of Memory, OOM).

[659992.292163] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,global_oom,task_memcg=/default/dbfe5de8b17117d4ce1260b30ec75b84a6fb13e34205aa8e361b6e086648f779,task=python3,pid=1179327,uid=0
[659992.293749] Out of memory: Killed process 1179327 (python3) total-vm:2189763248kB, anon-rss:2081193264kB, file-rss:430540kB, shmem-rss:17288kB, UID:0 pgtables:4122204kB oom_score_adj:-998
[659992.468190] systemd[1]: [email protected]: Succeeded.
[659992.468649] systemd[1]: rdma-ndd.service: Main process exited, code=killed, status=9/KILL
[659992.478226] systemd[1]: rdma-ndd.service: Failed with result 'signal'.
[659992.487228] systemd[1]: [email protected]: Succeeded.
[659992.487563] systemd[1]: AssistDaemon.service: Main process exited, code=killed, status=9/KILL
[659992.497382] systemd[1]: AssistDaemon.service: Failed with result 'signal'.
[659992.506469] systemd[1]: pingmesh-lingjun-agent.service: Failed with result 'signal'.
[659992.516165] systemd[1]: systemd-logind.service: Service has no hold-off time (RestartSec=0), scheduling restart.
[659992.516548] systemd[1]: systemd-logind.service: Scheduled restart job, restart counter is at 7.
[659992.516556] systemd[1]: systemd-journald.service: Service has no hold-off time (RestartSec=0), scheduling restart.
[660026.497156] oom_reaper: reaped process 1179327 (python3), now anon-rss:0kB, file-rss:79868kB, shmem-rss:17288kB

GPU Info

NVIDIA H20

Software Info

Show output of:

pip show gptqmodel torch transformers accelerate triton

Name: gptqmodel
Version: 2.0.0.dev0
Summary: A LLM quantization package with user-friendly apis. Based on GPTQ algorithm.
Home-page: https://github.com/ModelCloud/GPTQModel
Author: ModelCloud
Author-email: [email protected]
License: Apache 2.0
Location: /usr/local/lib/python3.12/dist-packages
Requires: accelerate, datasets, device-smi, hf_transfer, huggingface_hub, lm-eval, numpy, packaging, pillow, protobuf, safetensors, threadpoolctl, tokenicer, torch, transformers
Required-by: 
---
Name: torch
Version: 2.5.1
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: [email protected]
License: BSD-3-Clause
Location: /usr/local/lib/python3.12/dist-packages
Requires: filelock, fsspec, jinja2, networkx, nvidia-cublas-cu12, nvidia-cuda-cupti-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-runtime-cu12, nvidia-cudnn-cu12, nvidia-cufft-cu12, nvidia-curand-cu12, nvidia-cusolver-cu12, nvidia-cusparse-cu12, nvidia-nccl-cu12, nvidia-nvjitlink-cu12, nvidia-nvtx-cu12, setuptools, sympy, triton, typing-extensions
Required-by: accelerate, compressed-tensors, flashinfer-python, gptqmodel, lm_eval, outlines, peft, torchaudio, torchvision, vllm, xformers, xgrammar
---
Name: transformers
Version: 4.49.0
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: [email protected]
License: Apache 2.0 License
Location: /usr/local/lib/python3.12/dist-packages
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm
Required-by: compressed-tensors, gptqmodel, lm_eval, peft, tokenicer, vllm, xgrammar
---
Name: accelerate
Version: 1.4.0
Summary: Accelerate
Home-page: https://github.com/huggingface/accelerate
Author: The HuggingFace team
Author-email: [email protected]
License: Apache
Location: /usr/local/lib/python3.12/dist-packages
Requires: huggingface-hub, numpy, packaging, psutil, pyyaml, safetensors, torch
Required-by: gptqmodel, lm_eval, peft
---
Name: triton
Version: 3.1.0
Summary: A language and compiler for custom Deep Learning operations
Home-page: https://github.com/triton-lang/triton/
Author: Philippe Tillet
Author-email: [email protected]
License: 
Location: /usr/local/lib/python3.12/dist-packages
Requires: filelock
Required-by: torch

Model

DeepSeek-R1-BF16 from huggingface

To Reproduce

quant_config = QuantizeConfig(bits = 8, group_size = 128, desc_act = False)

model = GPTQModel.load(model_path, quant_config, device_map='auto', device = "cuda",
                           trust_remote_code=True, low_cpu_mem_usage=True)
    
model.quantize(calibration_dataset, calibration_dataset_concat_size = 1024, buffered_fwd = True, batch_size = 2)

Expected behavior

A clear and concise description of what you expected to happen.

Screenshots

If applicable, add screenshots to help explain your problem.

Additional context

When I asked GPT-4o, its reply was as follows:

Possible reasons:

Memory leak: Your Python program might have a memory leak, causing it to continuously consume memory during processing without releasing some objects that are no longer needed.

Handling large amounts of data: The program may be loading or processing a large amount of data, and if the memory requirements exceed the available memory of the system, it will trigger an OOM (Out of Memory) error.

Concurrent operations: If you are dealing with multiple processes or threads, it could lead to increased memory requirements.

Summary

OOM (Out of Memory) issues that lead to process termination due to insufficient memory are common problems, especially in tasks involving large datasets. It is recommended that you start by checking your code and optimizing memory usage, looking for potential memory leaks or improving the way memory is utilized.

The text was updated successfully, but these errors were encountered:

Qubitium · 2025-02-27T04:20:57Z

@ShiningMaker We need the following:

How much VRAM do you have?
How much CPU RAM do you have?
How much CPU swap do you have?

For DeepSeek V3/R1 BF16, you should have 1.5TB of CPU memory to avoid OOM.

Qubitium · 2025-02-27T05:27:49Z

@ShiningMaker I noticed the OOM process had 2TB of memory which includes mmap/disk memory. What is the max cpu memory in your vm or computer instance? 2TB should be more than enough even for DeepSeek R1.

[659992.293749] Out of memory: Killed process 1179327 (python3) total-vm:2189763248kB, anon-rss:2081193264kB, file-rss:430540kB, shmem-rss:17288kB, UID:0 pgtables:4122204kB oom_score_adj:-998

ShiningMaker · 2025-02-27T06:46:00Z

@ShiningMaker I noticed the OOM process had 2TB of memory which includes mmap/disk memory. What is the max cpu memory in your vm or computer instance? 2TB should be more than enough even for DeepSeek R1.
[659992.293749] Out of memory: Killed process 1179327 (python3) total-vm:2189763248kB, anon-rss:2081193264kB, file-rss:430540kB, shmem-rss:17288kB, UID:0 pgtables:4122204kB oom_score_adj:-998

I restarted GPTQ to perform int8 quantization. While quantizing the 11/60 layers, I checked the memory information using free -h. My understanding is that 2TB memory should be sufficient for R1's memory requirements. However, during the packing process, there may also be an int8 model loaded on the CPU. This issue occurred while I was packing the 8/60 layers.

              total        used        free      shared  buff/cache   available
Mem:           2.0Ti       1.6Ti        12Gi        24Mi       402Gi       403Gi
Swap:             0B          0B          0B

Qubitium · 2025-02-27T07:21:05Z

I restarted GPTQ to perform int8 quantization. While quantizing the 11/60 layers, I checked the memory information using free -h. My understanding is that 2TB memory should be sufficient for R1's memory requirements. However, during the packing process, there may also be an int8 model loaded on the CPU. This issue occurred while I was packing the 8/60 layers.

Are you saying while packing, another, non-quant related INT8 R1 model was loaded into memory by accident? Is this correct? That's bad news.

Also, I see that there are 400GB of buffer cache (disk cache). You can free those by calling linux cli free before and during the packing code to see if you can release more memory. Linux does not release buffers as expected vs MacOS or other systems.

ShiningMaker · 2025-02-27T08:10:14Z

I restarted GPTQ to perform int8 quantization. While quantizing the 11/60 layers, I checked the memory information using free -h. My understanding is that 2TB memory should be sufficient for R1's memory requirements. However, during the packing process, there may also be an int8 model loaded on the CPU. This issue occurred while I was packing the 8/60 layers.

Are you saying while packing, another, non-quant related INT8 R1 model was loaded into memory by accident? Is this correct? That's bad news.

Also, I see that there are 400GB of buffer cache (disk cache). You can free those by calling linux cli free before and during the packing code to see if you can release more memory. Linux does not release buffers as expected vs MacOS or other systems.

No, no, no, what I mean is that the int8 model generated during the packing process coexists with the original bf16 model (is this situation possible?). And for release memory, will explicitly calling gc.collect() in the packing code of gptqmodel work?

Qubitium · 2025-02-27T23:17:56Z

@ShiningMaker You can try and test calling torch_empty_cache from our utils, it will release both gpu and cpu memory and call gc at the same time.

1773226512 · 2025-03-03T14:54:28Z

@Qubitium I also encountered the same problem. I believe that after packing a layer, deleting the corresponding FP32 fake quantized weights and releasing CPU memory could help when packing large models like DeepSeek-V3. However, I'm not sure how to achieve this.

@ShiningMaker You can try and test calling torch_empty_cache from our utils, it will release both gpu and cpu memory and call gc at the same time.

ShiningMaker added the bug Something isn't working label Feb 27, 2025

Qubitium changed the title ~~[BUG] oom-kill when packing model.layers~~ [BUG] [CPU Memory OOM] DeekSpeek R1 got os oom-kill when packing model.layers Feb 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] [CPU Memory OOM] DeekSpeek R1 got os oom-kill when packing model.layers #1355

[BUG] [CPU Memory OOM] DeekSpeek R1 got os oom-kill when packing model.layers #1355

ShiningMaker commented Feb 27, 2025

Qubitium commented Feb 27, 2025 •

edited

Loading

Qubitium commented Feb 27, 2025

ShiningMaker commented Feb 27, 2025

Qubitium commented Feb 27, 2025 •

edited

Loading

ShiningMaker commented Feb 27, 2025

Qubitium commented Feb 27, 2025

1773226512 commented Mar 3, 2025 •

edited

Loading

[BUG] [CPU Memory OOM] DeekSpeek R1 got os oom-kill when packing model.layers #1355

[BUG] [CPU Memory OOM] DeekSpeek R1 got os oom-kill when packing model.layers #1355

Comments

ShiningMaker commented Feb 27, 2025

Qubitium commented Feb 27, 2025 • edited Loading

Qubitium commented Feb 27, 2025

ShiningMaker commented Feb 27, 2025

Qubitium commented Feb 27, 2025 • edited Loading

ShiningMaker commented Feb 27, 2025

Qubitium commented Feb 27, 2025

1773226512 commented Mar 3, 2025 • edited Loading

Qubitium commented Feb 27, 2025 •

edited

Loading

Qubitium commented Feb 27, 2025 •

edited

Loading

1773226512 commented Mar 3, 2025 •

edited

Loading