Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] [CPU Memory OOM] DeekSpeek R1 got os oom-kill when packing model.layers #1355

Open
ShiningMaker opened this issue Feb 27, 2025 · 7 comments
Labels
bug Something isn't working

Comments

@ShiningMaker
Copy link

Describe the bug

From my dmesg output, it is evident that the GPTQ Python process (PID 1179327) was killed by the kernel due to the system running out of memory (Out of Memory, OOM).

[659992.292163] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,global_oom,task_memcg=/default/dbfe5de8b17117d4ce1260b30ec75b84a6fb13e34205aa8e361b6e086648f779,task=python3,pid=1179327,uid=0
[659992.293749] Out of memory: Killed process 1179327 (python3) total-vm:2189763248kB, anon-rss:2081193264kB, file-rss:430540kB, shmem-rss:17288kB, UID:0 pgtables:4122204kB oom_score_adj:-998
[659992.468190] systemd[1]: [email protected]: Succeeded.
[659992.468649] systemd[1]: rdma-ndd.service: Main process exited, code=killed, status=9/KILL
[659992.478226] systemd[1]: rdma-ndd.service: Failed with result 'signal'.
[659992.487228] systemd[1]: [email protected]: Succeeded.
[659992.487563] systemd[1]: AssistDaemon.service: Main process exited, code=killed, status=9/KILL
[659992.497382] systemd[1]: AssistDaemon.service: Failed with result 'signal'.
[659992.506469] systemd[1]: pingmesh-lingjun-agent.service: Failed with result 'signal'.
[659992.516165] systemd[1]: systemd-logind.service: Service has no hold-off time (RestartSec=0), scheduling restart.
[659992.516548] systemd[1]: systemd-logind.service: Scheduled restart job, restart counter is at 7.
[659992.516556] systemd[1]: systemd-journald.service: Service has no hold-off time (RestartSec=0), scheduling restart.
[660026.497156] oom_reaper: reaped process 1179327 (python3), now anon-rss:0kB, file-rss:79868kB, shmem-rss:17288kB

GPU Info

NVIDIA H20

Software Info

Show output of:

pip show gptqmodel torch transformers accelerate triton

Name: gptqmodel
Version: 2.0.0.dev0
Summary: A LLM quantization package with user-friendly apis. Based on GPTQ algorithm.
Home-page: https://github.com/ModelCloud/GPTQModel
Author: ModelCloud
Author-email: [email protected]
License: Apache 2.0
Location: /usr/local/lib/python3.12/dist-packages
Requires: accelerate, datasets, device-smi, hf_transfer, huggingface_hub, lm-eval, numpy, packaging, pillow, protobuf, safetensors, threadpoolctl, tokenicer, torch, transformers
Required-by: 
---
Name: torch
Version: 2.5.1
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: [email protected]
License: BSD-3-Clause
Location: /usr/local/lib/python3.12/dist-packages
Requires: filelock, fsspec, jinja2, networkx, nvidia-cublas-cu12, nvidia-cuda-cupti-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-runtime-cu12, nvidia-cudnn-cu12, nvidia-cufft-cu12, nvidia-curand-cu12, nvidia-cusolver-cu12, nvidia-cusparse-cu12, nvidia-nccl-cu12, nvidia-nvjitlink-cu12, nvidia-nvtx-cu12, setuptools, sympy, triton, typing-extensions
Required-by: accelerate, compressed-tensors, flashinfer-python, gptqmodel, lm_eval, outlines, peft, torchaudio, torchvision, vllm, xformers, xgrammar
---
Name: transformers
Version: 4.49.0
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: [email protected]
License: Apache 2.0 License
Location: /usr/local/lib/python3.12/dist-packages
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm
Required-by: compressed-tensors, gptqmodel, lm_eval, peft, tokenicer, vllm, xgrammar
---
Name: accelerate
Version: 1.4.0
Summary: Accelerate
Home-page: https://github.com/huggingface/accelerate
Author: The HuggingFace team
Author-email: [email protected]
License: Apache
Location: /usr/local/lib/python3.12/dist-packages
Requires: huggingface-hub, numpy, packaging, psutil, pyyaml, safetensors, torch
Required-by: gptqmodel, lm_eval, peft
---
Name: triton
Version: 3.1.0
Summary: A language and compiler for custom Deep Learning operations
Home-page: https://github.com/triton-lang/triton/
Author: Philippe Tillet
Author-email: [email protected]
License: 
Location: /usr/local/lib/python3.12/dist-packages
Requires: filelock
Required-by: torch

Model

DeepSeek-R1-BF16 from huggingface

To Reproduce

quant_config = QuantizeConfig(bits = 8, group_size = 128, desc_act = False)

model = GPTQModel.load(model_path, quant_config, device_map='auto', device = "cuda",
                           trust_remote_code=True, low_cpu_mem_usage=True)
    
model.quantize(calibration_dataset, calibration_dataset_concat_size = 1024, buffered_fwd = True, batch_size = 2)

Expected behavior

A clear and concise description of what you expected to happen.

Screenshots

If applicable, add screenshots to help explain your problem.

Additional context

When I asked GPT-4o, its reply was as follows:

Possible reasons:

Memory leak: Your Python program might have a memory leak, causing it to continuously consume memory during processing without releasing some objects that are no longer needed.

Handling large amounts of data: The program may be loading or processing a large amount of data, and if the memory requirements exceed the available memory of the system, it will trigger an OOM (Out of Memory) error.

Concurrent operations: If you are dealing with multiple processes or threads, it could lead to increased memory requirements.

Summary

OOM (Out of Memory) issues that lead to process termination due to insufficient memory are common problems, especially in tasks involving large datasets. It is recommended that you start by checking your code and optimizing memory usage, looking for potential memory leaks or improving the way memory is utilized.

@ShiningMaker ShiningMaker added the bug Something isn't working label Feb 27, 2025
@Qubitium
Copy link
Collaborator

Qubitium commented Feb 27, 2025

@ShiningMaker We need the following:

  1. How much VRAM do you have?
  2. How much CPU RAM do you have?
  3. How much CPU swap do you have?

For DeepSeek V3/R1 BF16, you should have 1.5TB of CPU memory to avoid OOM.

@Qubitium Qubitium changed the title [BUG] oom-kill when packing model.layers [BUG] [CPU Memory OOM] DeekSpeek R1 got os oom-kill when packing model.layers Feb 27, 2025
@Qubitium
Copy link
Collaborator

@ShiningMaker I noticed the OOM process had 2TB of memory which includes mmap/disk memory. What is the max cpu memory in your vm or computer instance? 2TB should be more than enough even for DeepSeek R1.

[659992.293749] Out of memory: Killed process 1179327 (python3) total-vm:2189763248kB, anon-rss:2081193264kB, file-rss:430540kB, shmem-rss:17288kB, UID:0 pgtables:4122204kB oom_score_adj:-998

@ShiningMaker
Copy link
Author

@ShiningMaker I noticed the OOM process had 2TB of memory which includes mmap/disk memory. What is the max cpu memory in your vm or computer instance? 2TB should be more than enough even for DeepSeek R1.

[659992.293749] Out of memory: Killed process 1179327 (python3) total-vm:2189763248kB, anon-rss:2081193264kB, file-rss:430540kB, shmem-rss:17288kB, UID:0 pgtables:4122204kB oom_score_adj:-998

I restarted GPTQ to perform int8 quantization. While quantizing the 11/60 layers, I checked the memory information using free -h. My understanding is that 2TB memory should be sufficient for R1's memory requirements. However, during the packing process, there may also be an int8 model loaded on the CPU. This issue occurred while I was packing the 8/60 layers.

              total        used        free      shared  buff/cache   available
Mem:           2.0Ti       1.6Ti        12Gi        24Mi       402Gi       403Gi
Swap:             0B          0B          0B

@Qubitium
Copy link
Collaborator

Qubitium commented Feb 27, 2025

I restarted GPTQ to perform int8 quantization. While quantizing the 11/60 layers, I checked the memory information using free -h. My understanding is that 2TB memory should be sufficient for R1's memory requirements. However, during the packing process, there may also be an int8 model loaded on the CPU. This issue occurred while I was packing the 8/60 layers.

Are you saying while packing, another, non-quant related INT8 R1 model was loaded into memory by accident? Is this correct? That's bad news.

Also, I see that there are 400GB of buffer cache (disk cache). You can free those by calling linux cli free before and during the packing code to see if you can release more memory. Linux does not release buffers as expected vs MacOS or other systems.

@ShiningMaker
Copy link
Author

I restarted GPTQ to perform int8 quantization. While quantizing the 11/60 layers, I checked the memory information using free -h. My understanding is that 2TB memory should be sufficient for R1's memory requirements. However, during the packing process, there may also be an int8 model loaded on the CPU. This issue occurred while I was packing the 8/60 layers.

Are you saying while packing, another, non-quant related INT8 R1 model was loaded into memory by accident? Is this correct? That's bad news.

Also, I see that there are 400GB of buffer cache (disk cache). You can free those by calling linux cli free before and during the packing code to see if you can release more memory. Linux does not release buffers as expected vs MacOS or other systems.

No, no, no, what I mean is that the int8 model generated during the packing process coexists with the original bf16 model (is this situation possible?). And for release memory, will explicitly calling gc.collect() in the packing code of gptqmodel work?

@Qubitium
Copy link
Collaborator

@ShiningMaker You can try and test calling torch_empty_cache from our utils, it will release both gpu and cpu memory and call gc at the same time.

@1773226512
Copy link

1773226512 commented Mar 3, 2025

@Qubitium I also encountered the same problem. I believe that after packing a layer, deleting the corresponding FP32 fake quantized weights and releasing CPU memory could help when packing large models like DeepSeek-V3. However, I'm not sure how to achieve this.

@ShiningMaker You can try and test calling torch_empty_cache from our utils, it will release both gpu and cpu memory and call gc at the same time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants