Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SEVG fault from PState #1

Open
markhagemann opened this issue Jul 20, 2024 · 5 comments
Open

SEVG fault from PState #1

markhagemann opened this issue Jul 20, 2024 · 5 comments

Comments

@markhagemann
Copy link

I imported faulthandler and got some info:

Jul 20 22:41:46 drache nvml-undervolt[86317]: Warning: Persistence mode is already enabled - make sure no oth>
Jul 20 22:43:35 drache nvml-undervolt[86317]: Fatal Python error: Segmentation fault
Jul 20 22:43:35 drache nvml-undervolt[86317]: Current thread 0x000076ad66104740 (most recent call first):
Jul 20 22:43:35 drache nvml-undervolt[86317]:   File "/home/drache/.pyenv/versions/3.12.4/lib/python3.12/site>
Jul 20 22:43:35 drache nvml-undervolt[86317]:   File "/usr/local/sbin/nvml-undervolt", line 154 in set_pstate>
Jul 20 22:43:35 drache nvml-undervolt[86317]:   File "/usr/local/sbin/nvml-undervolt", line 377 in main
Jul 20 22:43:35 drache nvml-undervolt[86317]:   File "/usr/local/sbin/nvml-undervolt", line 441 in <module>
Jul 20 22:43:35 drache systemd[1]: nvml-undervolt.service: Main process exited, code=dumped, status=11/SEGV
Jul 20 22:43:35 drache systemd[1]: nvml-undervolt.service: Failed with result 'core-dump'.

Problem seems to be with setting pstate clocks

       def set_pstate_clocks(handle, clock_type, clock_offset, target_pstates):
        for pstate in range(0, target_pstates + 1):
        struct = c_nvmlClockOffset_t()
        struct.version = nvmlClockOffset_v1
        struct.type = clock_type
        struct.pstate = pstate
        struct.clockOffsetMHz = clock_offset
        return nvmlDeviceSetClockOffsets(handle, struct)
@jacklul
Copy link
Owner

jacklul commented Jul 20, 2024

I would like to see what happens before it segfaults - try running with verbose mode turned on and with sudo (verbose mode does not output to systemd journal).

@markhagemann
Copy link
Author

Seems to hang indefinitely with verbose mode.

➜ sudo python3 nvml-undervolt.py --core-offset 100 --target-clock 1725 --transition-clock 1500 --power-limit 285 --temperature-limit 72
Detected NVIDIA GeForce RTX 3080 Ti (GPU-a2cb5b35-c9ba-34eb-f3ae-1f4687448ffa)
Warning: Persistence mode is already enabled - make sure no other script is controlling clocks
Running main loop (sleep = 0.5)...
[1]    18719 segmentation fault  sudo python3 nvml-undervolt.py --core-offset 100 --target-clock 1725  1500
➜ sudo python3 nvml-undervolt.py --verbose --core-offset 100 --target-clock 1725 --transition-clock 1500 --power-limit 285 --temperature-limit 72
Namespace(env=None, index=0, uuid='', core_offset=100, memory_offset=0, target_clock=1725, transition_clock=1500, curve=False, curve_increment=0.0, clock_step=0.0, power_limit=285, temperature_limit=72, pstates=0, sleep=0.5, verbose=True, test=False)
Detected NVIDIA GeForce RTX 3080 Ti (GPU-a2cb5b35-c9ba-34eb-f3ae-1f4687448ffa)
Supported core clocks: [2100, 2085, 2070, 2055, 2040, 2025, 2010, 1995, 1980, 1965, 1950, 1935, 1920, 1905, 1890, 1875, 1860, 1845, 1830, 1815, 1800, 1785, 1770, 1755, 1740, 1725, 1710, 1695, 1680, 1665, 1650, 1635, 1620, 1605, 1590, 1575, 1560, 1545, 1530, 1515, 1500, 1485, 1470, 1455, 1440, 1425, 1410, 1395, 1380, 1365, 1350, 1335, 1320, 1305, 1290, 1275, 1260, 1245, 1230, 1215, 1200, 1185, 1170, 1155, 1140, 1125, 1110, 1095, 1080, 1065, 1050, 1035, 1020, 1005, 990, 975, 960, 945, 930, 915, 900, 885, 870, 855, 840, 825, 810, 795, 780, 765, 750, 735, 720, 705, 690, 675, 660, 645, 630, 615, 600, 585, 570, 555, 540, 525, 510, 495, 480, 465, 450, 435, 420, 405, 390, 375, 360, 345, 330, 315, 300, 285, 270, 255, 240, 225, 210]
Clock step is 15.0 MHz
Warning: Persistence mode is already enabled - make sure no other script is controlling clocks
Setting power limit to 285 W
Setting temperature limit to 72 C
Running main loop (sleep = 0.5)...
Disabling undervolt settings at P5 1500
Setting core offset to 0
Locking core clocks at 0 - 1500
Enabling undervolt settings at P0 1500
Locking core clocks at 1500 - 1725
Setting core offset to 100
Disabling undervolt settings at P5 1500
Setting core offset to 0
Locking core clocks at 0 - 1500

@jacklul
Copy link
Owner

jacklul commented Jul 20, 2024

Disabling undervolt settings at P5 1500
Setting core offset to 0
Locking core clocks at 0 - 1500

Is the card under load at this point?
Perhaps your card just isn't stable at 1500MHz with +100 offset ?

@markhagemann
Copy link
Author

markhagemann commented Jul 20, 2024

Possibly but I've experimented with other values including the stable ones that work in Windows with Afterburner (will attach screenshot) such as 1410 and it doesn't make a difference in verbose mode. It all still seems to work though so I am quite content for the moment but will be interesting to if anyone else encounters it.

undervolt

@jacklul
Copy link
Owner

jacklul commented Jul 20, 2024

That segfault might also just be a bug in the NVIDIA's lib, that wouldn't be a surprise.
My script is very simple - using the API as documented - so really nothing else comes to my mind right now...

Perhaps there is a difference in the API behavior between Windows and Linux - I tested the script on Windows only.

Edit: Tested it on Linux myself and it does indeed seems to happen, this has to be platform specific bug because I was able to keep the script running for over an hour while playing a game on Windows and it did it job fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants