-
-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
p2p functional test failure #9755
Comments
A quick question before I dig deeper: have you seen this on the |
It did not trigger on |
I haven't gotten this to trigger locally, so I have no daemon logs to investigate. And I don't see a way to download the daemon logs from Github. My best guess is that this is a timing bug in the test itself. The sleep timeout is 5 seconds, but the fluff (incoming) timer for transactions is typically going to be in the 3-7.25 second range. The counter argument is that it seems like the test should fail more frequently if this were the culprit. I don't have any other theories as to the cause at the moment. #9459 definitely seems unrelated, unless there was a p2p connection loss + reconnect just before |
Thank you for looking into it. If we can't reproduce it locally and if doesn't show up again on CI then not sure if we can do anything else. |
I think the test can be improved quite a bit to avoid such spurious failures. |
See #9762 |
I did a bit of archaeology: Roughly three weeks ago, the transfer functional_tests started failing (first PR I could find failing the test suite is #9491). Here are two sample runs with failures: A week after that, the tx propagation test starts failing (first two failed runs I can find): I also note that around Jan 1, this test (unit_tests.node_server.race_condition) started to break: @tobtoht Do you know if the transfer test failure had to do with a broken commit, which was subsequently patched, or if it was just another badly written test? Because the observation that multiple timing-sensitive tests, which previously passed, started failing around the same time, leads me to think that perhaps GitHub simply changed the runner environment (or the various build changes impacted it, or both) enough to expose the flaws in these tests, which had been relying on coincidence/timing in a specific environment. |
I don't doubt it (added in this PR from 2021, https://github.com/monero-project/monero/pull/7873/files), but the logs on GitHub actions only go back about three months, so I can't say the p2p test (added in 2020, if I recall) has been any better or worse behaved than this unit test. |
|
This would be a prototypical candidate for |
Findings so far: The transaction propagation piece by itself works fine; it's the combination with |
@vtnerd @iamamyth I investigated the issue by running it in a loop on Github Actions. #9459 introduced it and #9762 fixes it. Now the question is why does a low level network code change cause this behaviour? Master branch, failed: https://github.com/selsta/monero/actions/runs/13269225338/job/37044415465 |
Your conclusions match my own. 7e766e1 (aka #9459) introduced the problem. Taking into account the finding that |
So the plausible theory is that the incorrect throttling code caused daemon operations to slow down, which in turn meant that waiting for the maximum delay happened unintentionally. After the throttling code got fixed, there was a chance that the random delay was longer than daemon operations took, which caused the sporadic test failure. Still not sure why it happened only on CI and not locally, maybe simply because it's slower hardware. |
As for #9762, I think the patch still makes sense, but, once the cause of the slowdown has been established, I'll revisit the appropriate timeout, as a more strict value may help catch rate limiting bugs, but, if it's too strict, it becomes a source of test noise. |
I cannot narrow down the hypothesis any more than I already have, but I will just note that these tests, structurally, generate a ton of interference: For example, the end of the reorg test waits 10s in each pooling loop iteration to confirm the two daemons have synchronized. So, a slight timing change has potentially very large consequences with respect to rate limiting, because a +10s wait of course impacts the rate limiting state in the next test. And the entire "functional test" suite shares one daemon and has pathological behaviors like "connection per request" and "sleep for some preset amount and hope that's enough". In a sense, it does an OK job of surfacing bugs because it's a networking nightmare, but it gives very little direction about the nature of the bugs (e.g. test suite 2 could be failing because its predecessor did something, so good luck finding the cause). |
I agree, the code prior to #9459 unintentionally rate limited RPC. This could've caused a slight down in processing times that made the tests work. There isn't really an explanation otherwise, unless UB was triggered before and/or after the PR. |
https://github.com/monero-project/monero/actions/runs/12969219043/job/36173269466?pr=9723#step:11:1380
Before we can put out v0.18.4.0 we have investigate this to make sure this is not a regression with recent merges. The test failure does not show up consistently.
The text was updated successfully, but these errors were encountered: