Fix IDAKLU solver crash when running in multiple threads + when telemetry is prompted for #4583

agriyakhetarpal · 2024-11-13T00:32:40Z

Description

This came up as an unrelated issue in #4582 when testing the wheels and apparently stems from #4441 where a thread was being used to check for the input. This caused the Linux wheel tests to fail: https://github.com/agriyakhetarpal/PyBaMM/actions/runs/11805727556/job/32890103930. This PR redoes that with a multi-platform approach and avoids the use of threads altogether by checking for different environments: Windows via msvcrt and Linux/macOS via termios.

~~A possible corner case for Jupyter Notebooks (running in VS Code versus Jupyter Notebook/JupyterLab) has also been fixed.~~ I've resorted to a more rudimentary fix using the previous method for now, since this fix was erroneous and did not work intermittently.

Type of change

Please add a line in the relevant section of CHANGELOG.md to document the change (include PR #) - note reverse order of PR #s. If necessary, also add to the list of breaking changes.

New feature (non-breaking change which adds functionality)
Bug fix (non-breaking change which fixes an issue)

Key checklist:

No style issues: $ pre-commit run (or $ nox -s pre-commit) (see CONTRIBUTING.md for how to set this up to run automatically when committing locally, in just two lines of code)
All tests pass: $ python run-tests.py --all (or $ nox -s tests)
The documentation builds: $ python run-tests.py --doctest (or $ nox -s doctests)

You can run integration tests, unit tests, and doctests together at once, using $ python run-tests.py --quick (or $ nox -s quick).

Further checks:

Code is commented, particularly in hard-to-understand areas
Tests added that prove fix is effective or that feature works

noxfile.py

agriyakhetarpal · 2024-11-13T00:41:20Z

This approach might seem overkill but this is what came to mind first and seems like the most robust way to me, so I'd like to test this a bit more – I've converted this to a draft for now. I triggered a wheel build here, let's see if it passes: https://github.com/agriyakhetarpal/PyBaMM/actions/runs/11808256997

I've confirmed the fix locally with a fresh PyBaMM installation (i.e., with the pybamm/config.yml file not present in my Application Support folder) every time before importing PyBaMM:

inside a terminal
in an IPython shell
in a notebook running in JupyterLab
in a notebook running in VS Code

across the following scenarios:

telemetry disabled
telemetry enabled (which should be the same thing)
a tim out (no input was given)

codecov · 2024-11-13T00:57:22Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.26%. Comparing base (bcdd0b5) to head (5495814).
Report is 95 commits behind head on develop.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #4583      +/-   ##
===========================================
- Coverage    99.26%   99.26%   -0.01%     
===========================================
  Files          302      302              
  Lines        22889    22866      -23     
===========================================
- Hits         22721    22698      -23     
  Misses         168      168

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

agriyakhetarpal · 2024-11-13T01:10:35Z

The buttons aren't working in the notebooks, actually – I'm quite sure they were working when I was testing earlier, so I guess I made some additional changes which made them break. I'll debug later in the day, and I'll consider dropping the buttons in favour of a simple input in case I don't get them to work.

agriyakhetarpal · 2024-11-13T14:46:53Z

I switched to a more basic implementation for Jupyter notebooks, and I've seen bug reports about how inputs don't work properly with JupyterLab <4, but I don't think (too) many people are using version 3 now, so we should be good.

agriyakhetarpal · 2024-11-13T14:49:02Z

Additionally, I tested on Google Colab by installing PyBaMM in editable mode from my branch – we should be ready to go with this. Suggestions on cleaning up the code are welcome. The only place I haven't been able to test is Windows, because I don't have a Windows machine.

agriyakhetarpal · 2024-11-13T14:56:48Z

Linux wheel builds passing: https://github.com/agriyakhetarpal/PyBaMM/actions/runs/11819747790/job/32930388878

kratman · 2024-11-13T14:57:14Z

Ok I will take a close look at this in an hour or so

src/pybamm/config.py

kratman · 2024-11-13T22:48:04Z

src/pybamm/config.py

+    if is_notebook():  # pragma: no cover
+        try:
+            from IPython.display import clear_output
+
+            user_input = input("Do you want to enable telemetry? (Y/n): ")
+            clear_output()
+
+            return user_input
+
+        except Exception:  # pragma: no cover
+            return None


Not a major issue, but this has no timeout so people notebooks will just hang if they don't realize this popped up

Yes, this is intentional. I opine that notebooks are meant to be used interactively by default – I expect this to be a one-time issue, since importing PyBaMM and choosing "yes" or "no" once would mean that the config gets saved and this never comes up again (until the config file gets deleted, of course).

kratman · 2024-11-13T22:51:43Z

src/pybamm/config.py

+                while time.time() - start_time < timeout:
+                    rlist, _, _ = select.select([sys.stdin], [], [], 0.1)


If we are already using select, do we need the loop? The timeout isn't very long

Yes, I checked and this loop ended up being required – since while normal input() in Python blocks indefinitely, I found that there's no (built-in) way to timeout if we are in raw terminal mode (set above in tty.setraw()). So we need to manually handle everything, which is to read the "y" or "n" or empty characters and flush them to the screen.

select lets us say "wake me up when, either":

there's input to read, or

one-tenth of a second has passed

which the loop apparently allows us to do. But if you have something else in mind, I'm happy to try that.

kratman · 2024-11-13T22:52:25Z

src/pybamm/config.py

+            import select
+
+            # Save terminal settings for later
+            old_settings = termios.tcgetattr(sys.stdin)


It seems excessive that we need to change the terminal settings for this

What this allows you to do is that you don't get new indentation when you write to the terminal again. I wrapped this in a try-finally block based on the Python docs: https://docs.python.org/3/library/termios.html#example.

So, by saving the attributes, I'm trying to disable the fact that sys.stdout.write("\n") will not return the cursor to the leftmost position (because we are in raw mode and we need to do that manually instead – for which I want to avoid).

We end up with this, otherwise:

❯ python -c "import pybamm; print(pybamm.IDAKLUSolver())" PyBaMM can collect usage data and send it to the PyBaMM team to help us improve the software. We do not collect any sensitive information such as models, parameters, or simulation results - only information on which parts of the code are being used and how frequently. This is entirely optional and does not impact the functionality of PyBaMM. For more information, see https://docs.pybamm.org/en/latest/source/user_guide/index.html#telemetry Do you want to enable telemetry? (Y/n): n Telemetry disabled. <pybamm.solvers.idaklu_solver.IDAKLUSolver object at 0x102de3ad0> %

and this would break one's terminal until one restarts it – a more robust solution could be to use TCSAFLUSH instead of TCSADRAIN, I'll switch to that.

src/pybamm/config.py

kratman · 2024-11-13T22:57:16Z

src/pybamm/config.py

+        else:
+            print("Invalid input. Please enter 'Y/y' for yes or 'n/N' for no:")
+            user_input = get_input_or_timeout(timeout)
+            if user_input is None:
+                print("\nTimeout reached. Defaulting to not enabling telemetry.")
+                return False


I know I brought it up before, but if we just check for the yes options, we can skip having an infinite loop

If a person presses "Enter" in haste, that enables telemetry for them (because of your suggestion that I applied above), which might not be what they want – I'm trying to reduce the chance of that happening

So, should I keep it as "yes" if Enter is pressed or "no"? I feel the latter is the better option

src/pybamm/config.py

Co-authored-by: Eric G. Kratz <[email protected]>

src/pybamm/config.py

agriyakhetarpal · 2024-11-14T10:52:08Z

Continuing from my question above: though I pushed e4859dc which fixes the failing test, it also changes the behaviour on pressing "Enter" to enable telemetry – which, to me, does not seem like good UX because it can cause an inadvertent "yes" from a user. It might be better to be explicit rather than implicit: either allow only a "yes" or a "no" in the response and ask them to enter their response again if it's empty, or let "Enter" disable telemetry rather than enabling it). I understand that it would be good for us to get a reasonable amount of telemetry data from as many users as we can, but we should also be conscientious with our approach. @valentinsulzer, do you have any thoughts? I might have missed this when reviewing #4441. That said, if prompts in other applications do regard pressing Enter/an empty input as a "yes", then this approach is completely fine with me.

kratman · 2024-11-19T03:27:49Z

Crashes should be resolved. I could not reproduce any of the notebook issues

agriyakhetarpal · 2024-11-19T04:08:01Z

I'd say that this PR could still be useful because it avoids using threading altogether for telemetry. Could you please re-review it? I was under the assumption that #4591 would go in after this PR went in.

kratman · 2024-11-20T02:50:00Z

@agriyakhetarpal I built the wheel with develop and did some testing on Windows.

A few things I noticed:

The tests in CI and wheel build all pass: Build wheel 24.11.0 kratman/PyBaMM#5
Some tests fail if I try to run the tests locally with the wheel, but the same tests fail when I repeat the process for the v24.9.0 release. The failing tests all appear to be from Jax
Forcing posthog to be active during tests does show posthog in some of the tracebacks, but they appear to be some of the same tests that I saw failing in (2).
When I forced posthog to run, I also disabled the pybamm.config.generate() function, so any remaining problems are likely in the internals of posthog, not from the use of threading while generating the config.
The timeout for creating the config is likely too short. With the slow import of a fresh PyBaMM install, it is not hard to miss the window for a 10 second opt-in.

The fact that tests are failing locally on windows bothers me, but it looks like some of this was already present and not caught by all of our tests. It is also possible that the ~45 failing Jax tests are due to issues with my Windows setup since the tests pass in CI. I won't know for sure until I do additional testing for the release.

I will continue investigating the failures on windows, but I don't think removing the threading or adding extra logic to text output is needed at the moment. The changes to the installation guide, noxfile.py, and the pyproject.toml look good.

Next steps:

Try to trigger crashes in notebooks and scripts instead of tests. This will rule out any threading issues related to pytest
Compare my Windows dev environment to the CI dev environment to see if the failing Jax tests are from a solvable issue.

agriyakhetarpal · 2024-11-20T03:50:57Z

Just a quick comment for now; I'll get back to the rest in a while: @kratman, if you can't reproduce the Windows crashes in CI, could you also check how old the machine that you are using locally is? Some JAX operations might crash or simply not work on very old machines because of the lack of AVX512 instructions, so maybe this is indeed just something that is for your setup and should be fine for slightly newer Windows computers.

kratman · 2024-11-20T14:36:34Z

Some JAX operations might crash or simply not work on very old machines because of the lack of AVX512 instructions, so maybe this is indeed just something that is for your setup and should be fine for slightly newer Windows computers.

This is not a particularly old machine. It is an i5-1235U (2022), which is the Alder Lake group of chips that do not have AVX-512, it does have AVX and AVX2 though

kratman · 2024-12-26T21:06:34Z

Can we close this now? The threading issues have not reappeared for telemetry

agriyakhetarpal · 2024-12-28T09:32:24Z

Thanks, sure, we can close. I'll keep my branch in case we ever need it.

agriyakhetarpal added 8 commits November 13, 2024 04:45

Make is_notebook a utility function

8709969

Fix import order

6fccf68

Add cross-platform input/timeout retrieval

9348522

Mark missing dependency on platformdirs

c89ee9c

Fix pytest invocation

b675502

Fix coverage and return value for prompt

e8680df

Fix tests and parametrize them

af0de0f

Append to CHANGELOG entry

925cbd6

agriyakhetarpal requested review from martinjrobins, Saransh-cpp, kratman, arjxn-py and a team as code owners November 13, 2024 00:32

agriyakhetarpal marked this pull request as draft November 13, 2024 00:32

style: pre-commit fixes

d8d37ed

kratman reviewed Nov 13, 2024

View reviewed changes

noxfile.py Outdated Show resolved Hide resolved

agriyakhetarpal added 4 commits November 13, 2024 19:46

Something simpler for Jupyter notebooks

fbb623e

Remove duplicate outputs from print statements

0826afe

Merge branch 'develop' into fix/solver-crash-telemetry

1da3d91

Simplify command for test invocation

4f8b48c

agriyakhetarpal changed the title ~~Fix IDAKLU solver crash when running in multiple threads + telemetry is prompted for~~ Fix IDAKLU solver crash when running in multiple threads + when telemetry is prompted for Nov 13, 2024

agriyakhetarpal marked this pull request as ready for review November 13, 2024 14:49

agriyakhetarpal requested review from valentinsulzer and kratman November 13, 2024 14:49

kratman reviewed Nov 13, 2024

View reviewed changes

src/pybamm/config.py Outdated Show resolved Hide resolved

agriyakhetarpal added 3 commits November 13, 2024 21:59

Return either input as a string, or None

91379bd

Keep prompting on invalid inputs

de60647

Fix tests for invalid inputs

083184b

kratman requested changes Nov 13, 2024

View reviewed changes

Code review suggestions

6f077e3

Co-authored-by: Eric G. Kratz <[email protected]>

agriyakhetarpal commented Nov 13, 2024

View reviewed changes

src/pybamm/config.py Outdated Show resolved Hide resolved

agriyakhetarpal added 3 commits November 14, 2024 05:22

Remove duplicate checks

7433555

Discard previous outputs as well

57553df

Fix failing test

e4859dc

Merge branch 'develop' into fix/solver-crash-telemetry

5495814

agriyakhetarpal linked an issue Nov 15, 2024 that may be closed by this pull request

Fortnightly build for wheels failed #4588

Closed

agriyakhetarpal closed this Dec 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix IDAKLU solver crash when running in multiple threads + when telemetry is prompted for #4583

Fix IDAKLU solver crash when running in multiple threads + when telemetry is prompted for #4583

agriyakhetarpal commented Nov 13, 2024 •

edited

Loading

agriyakhetarpal commented Nov 13, 2024 •

edited

Loading

codecov bot commented Nov 13, 2024 •

edited

Loading

agriyakhetarpal commented Nov 13, 2024

agriyakhetarpal commented Nov 13, 2024

agriyakhetarpal commented Nov 13, 2024 •

edited

Loading

agriyakhetarpal commented Nov 13, 2024

kratman commented Nov 13, 2024

kratman Nov 13, 2024

agriyakhetarpal Nov 13, 2024

kratman Nov 13, 2024

agriyakhetarpal Nov 14, 2024

kratman Nov 13, 2024

agriyakhetarpal Nov 14, 2024

kratman Nov 13, 2024

agriyakhetarpal Nov 13, 2024

agriyakhetarpal Nov 14, 2024

agriyakhetarpal commented Nov 14, 2024

kratman commented Nov 19, 2024

agriyakhetarpal commented Nov 19, 2024

kratman commented Nov 20, 2024

agriyakhetarpal commented Nov 20, 2024

kratman commented Nov 20, 2024

kratman commented Dec 26, 2024

agriyakhetarpal commented Dec 28, 2024

		while time.time() - start_time < timeout:
		rlist, _, _ = select.select([sys.stdin], [], [], 0.1)

Fix IDAKLU solver crash when running in multiple threads + when telemetry is prompted for #4583

Fix IDAKLU solver crash when running in multiple threads + when telemetry is prompted for #4583

Conversation

agriyakhetarpal commented Nov 13, 2024 • edited Loading

Description

Type of change

Key checklist:

Further checks:

agriyakhetarpal commented Nov 13, 2024 • edited Loading

codecov bot commented Nov 13, 2024 • edited Loading

Codecov Report

agriyakhetarpal commented Nov 13, 2024

agriyakhetarpal commented Nov 13, 2024

agriyakhetarpal commented Nov 13, 2024 • edited Loading

agriyakhetarpal commented Nov 13, 2024

kratman commented Nov 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agriyakhetarpal commented Nov 14, 2024

kratman commented Nov 19, 2024

agriyakhetarpal commented Nov 19, 2024

kratman commented Nov 20, 2024

agriyakhetarpal commented Nov 20, 2024

kratman commented Nov 20, 2024

kratman commented Dec 26, 2024

agriyakhetarpal commented Dec 28, 2024

agriyakhetarpal commented Nov 13, 2024 •

edited

Loading

agriyakhetarpal commented Nov 13, 2024 •

edited

Loading

codecov bot commented Nov 13, 2024 •

edited

Loading

agriyakhetarpal commented Nov 13, 2024 •

edited

Loading