Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky test with macOS platform #287

Open
hsbt opened this issue May 24, 2024 · 5 comments
Open

Flaky test with macOS platform #287

hsbt opened this issue May 24, 2024 · 5 comments

Comments

@hsbt
Copy link
Member

hsbt commented May 24, 2024

I sometimes got the following failure with ruby/ruby and make test-bundled-gems.

Testing the net-imap gem
/Users/hsbt/Documents/github.com/ruby/ruby/exe/ruby -C /Users/hsbt/Documents/github.com/ruby/ruby/gems/src/net-imap /Users/hsbt/Documents/github.com/ruby/ruby/.bundle/bin/rake test
Loaded suite /Users/hsbt/Documents/github.com/ruby/ruby/.bundle/gems/rake-13.2.1/lib/rake/rake_test_loader
Started
/Users/hsbt/Documents/github.com/ruby/ruby/.bundle/gems/test-unit-3.6.2/lib/test/unit/fixture.rb:284: warning: the block passed to 'priority_setup' defined at /Users/hsbt/Documents/github.com/ruby/ruby/.bundle/gems/test-unit-3.6.2/lib/test/unit/priority.rb:183 may be ignored
F
====================================================================================================================================================================================
Failure: test_imaps_with_ca_file(IMAPTest):
  Exception raised:
  <#<OpenSSL::SSL::SSLError: SSL_read: tlsv1 alert unknown ca (SSL alert number 48)>>
  Backtrace:
    /Users/hsbt/Documents/github.com/ruby/ruby/.ext/common/openssl/buffering.rb:80:in 'OpenSSL::SSL::SSLSocket#sysread'
    /Users/hsbt/Documents/github.com/ruby/ruby/.ext/common/openssl/buffering.rb:80:in 'OpenSSL::Buffering#fill_rbuff'
    /Users/hsbt/Documents/github.com/ruby/ruby/.ext/common/openssl/buffering.rb:236:in 'OpenSSL::Buffering#gets'
    /Users/hsbt/Documents/github.com/ruby/ruby/gems/src/net-imap/lib/net/imap.rb:2720:in 'Net::IMAP#get_response'
    /Users/hsbt/Documents/github.com/ruby/ruby/gems/src/net-imap/lib/net/imap.rb:2629:in 'Net::IMAP#receive_responses'
    /Users/hsbt/Documents/github.com/ruby/ruby/gems/src/net-imap/lib/net/imap.rb:2606:in 'block in Net::IMAP#start_receiver_thread'.
/Users/hsbt/Documents/github.com/ruby/ruby/.ext/common/openssl/buffering.rb:80:in 'OpenSSL::SSL::SSLSocket#sysread'
/Users/hsbt/Documents/github.com/ruby/ruby/.ext/common/openssl/buffering.rb:80:in 'OpenSSL::Buffering#fill_rbuff'
/Users/hsbt/Documents/github.com/ruby/ruby/.ext/common/openssl/buffering.rb:236:in 'OpenSSL::Buffering#gets'
/Users/hsbt/Documents/github.com/ruby/ruby/gems/src/net-imap/lib/net/imap.rb:2720:in 'Net::IMAP#get_response'
/Users/hsbt/Documents/github.com/ruby/ruby/gems/src/net-imap/lib/net/imap.rb:2629:in 'Net::IMAP#receive_responses'
/Users/hsbt/Documents/github.com/ruby/ruby/gems/src/net-imap/lib/net/imap.rb:2606:in 'block in Net::IMAP#start_receiver_thread'
====================================================================================================================================================================================
Finished in 8.730846 seconds.
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
818 tests, 8689 assertions, 1 failures, 0 errors, 0 pendings, 0 omissions, 0 notifications
99.8778% passed
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
93.69 tests/s, 995.21 assertions/s
rake aborted!
Command failed with status (1)
/Users/hsbt/Documents/github.com/ruby/ruby/.bundle/gems/rake-13.2.1/exe/rake:27:in '<top (required)>'
Tasks: TOP => test
(See full trace by running task with --trace)
Tests failed with exit code 1

My environment is macOS 14.5 with Apple Silicon and using OpenSSL 3 library provided from homebrew.

@nevans
Copy link
Collaborator

nevans commented May 24, 2024

@hsbt That's strange... is it only that one test? How easy is it to reproduce? Is it failing ~5% of the time or 95% or? I haven't noticed it failing in the CI workflow. I'm only running on Linux, and haven't seen it myself. I wonder if it's unique to Apple Silicon or a specific version of OpenSSL or something else that's different from macos-latest in GH actions.

I can see one thing that's unique about that test. For all of the tests that create an SSL connection:

This test is one of the four that uses imaps_test. And of those four, this is one of only two that expects a successful connection. And this is the only one that verifies the server's cert. So, if this is the only test that's flaky, the problem is probably in imaps_test.

At first, I couldn't see anything significant between how the server is configured for imaps_test vs any of the others. We're using the exact same SSL fixture files (ca_path, key, and cert) in all three places, and we seem to be creating the SSLContext identically. But, after looking more closely, I see that imaps_test is using OpenSSL::SSL::SSLServer.new(tcp_server, ssl_ctx) and the other two are using OpenSSL::SSL::SSLSocket.new(tcp_socket, ssl_ctx).

Looking at SSLServer, it seems like there's no difference between what it's doing and what FakeServer is doing with SSLSocket. The only difference I can discern is that SSLServer#initialize contains the following:

        unless ctx.session_id_context
          # see #6137 - session id may not exceed 32 bytes
          prng = ::Random.new($0.hash)
          session_id = prng.bytes(16).unpack1('H*')
          @ctx.session_id_context = session_id
        end

I have no idea why session_id_context would matter for this (it probably doesn't), but it's the only difference I could see.

We could experiment with simply converting imaps_test to use SSLSocket directly, or even convert those four tests over to run_fake_server_in_thread. But I'm still baffled and curious about why this would make a difference.

Should we ask about this over at https://github.com/ruby/openssl/issues?

@rhenium
Copy link
Member

rhenium commented May 26, 2024

I ran make test-bundled-gems on ruby/ruby ~10 times with no luck. I'm also running on Linux.

<#<OpenSSL::SSL::SSLError: SSL_read: tlsv1 alert unknown ca (SSL alert number 48)>>

This OpenSSL::SSL::SSLError on the client side seems weird. The error message says that it received the "unknown_ca" alert. It is usually sent when the server doesn't trust the client certificate, but I don't see any test cases in net-imap that use a client certificate.


session_id_context

I believe it's irrelevant. The seesion ID context is only used for TLS session resumption, which net-imap doesn't use.

@hsbt
Copy link
Member Author

hsbt commented May 27, 2024

That's strange... is it only that one test? How easy is it to reproduce? Is it failing ~5% of the time or 95% or?

I see this fail once every day.

I haven't noticed it failing in the CI workflow.

Yes, I also didn't see this fail our GitHub Actions. It may be caused by my development environment.

@hsbt
Copy link
Member Author

hsbt commented Sep 10, 2024

I got the following failure today.

Failure: test: Convert deprecated usessl (= true) and verify (= false), with warning(DeprecatedClientOptionsTest::InitializeTests):
  exceptions on 1 threads:
  #<Thread:0x00000001450976a8 /Users/hsbt/Documents/github.com/ruby/ruby/gems/src/net-imap/test/net/imap/fake_server/test_helper.rb:10 dead>:
  /Users/hsbt/Documents/github.com/ruby/ruby/lib/timeout.rb:40:in 'Timeout::Error.handle_timeout': execution expired (Timeout::Error)
        from /Users/hsbt/Documents/github.com/ruby/ruby/lib/timeout.rb:192:in 'Timeout.timeout'
        from /Users/hsbt/Documents/github.com/ruby/ruby/gems/src/net-imap/test/net/imap/fake_server.rb:73:in 'Net::IMAP::FakeServer#run'
        from /Users/hsbt/Documents/github.com/ruby/ruby/gems/src/net-imap/test/net/imap/fake_server/test_helper.rb:11:in 'block (2 levels) in Net::IMAP::FakeServer::TestHelper#run_fake_server_in_thread'
  /Users/hsbt/Documents/github.com/ruby/ruby/gems/src/net-imap/test/net/imap/fake_server.rb:74:in 'TCPServer#accept': execution expired (Timeout::ExitException)
        from /Users/hsbt/Documents/github.com/ruby/ruby/gems/src/net-imap/test/net/imap/fake_server.rb:74:in 'block in Net::IMAP::FakeServer#run'
        from /Users/hsbt/Documents/github.com/ruby/ruby/lib/timeout.rb:183:in 'block in Timeout.timeout'
        from /Users/hsbt/Documents/github.com/ruby/ruby/lib/timeout.rb:38:in 'Timeout::Error.handle_timeout'
        from /Users/hsbt/Documents/github.com/ruby/ruby/lib/timeout.rb:192:in 'Timeout.timeout'
        from /Users/hsbt/Documents/github.com/ruby/ruby/gems/src/net-imap/test/net/imap/fake_server.rb:73:in 'Net::IMAP::FakeServer#run'
        from /Users/hsbt/Documents/github.com/ruby/ruby/gems/src/net-imap/test/net/imap/fake_server/test_helper.rb:11:in 'block (2 levels) in Net::IMAP::FakeServer::TestHelper#run_fake_server_in_thread'

@nevans
Copy link
Collaborator

nevans commented Sep 11, 2024

@hsbt 😦 This is still very confusing, and I still don't know if I've ever seen it. Is this still flaky, or does it happen every time?

The failure you just posted is a different error on the surface. Also, this is the server thread's timeout, and it would be very useful to know where the client thread failed, too. The server's timeout is probably caused by whatever happened in the client thread. On the other hand, run_fake_server_in_thread calls server&.shutdown from a different thread, and there's no mutex inside FakeServer#shutdown, so maybe we're triggering a race condition there?

But it seems reasonable to guess that it has the same root cause. And both failures you've posted are in tests that use SSL connections. And, unlike the first failure you reported, this time it's coming from run_fake_server_in_thread. So, assuming its the same root cause, we can rule out any strange differences between the different SSL server socket code.

It seems like it must somehow be related to your local openssl library configuration or version, but I'm not sure how to diagnose that. To start, maybe you can post your versions?

%i[VERSION OPENSSL_LIBRARY_VERSION OPENSSL_VERSION].to_h { [_1, OpenSSL.const_get(_1)] }

I assume that you have not had this sort of issue with the openssl gem's tests. I'd guess that they are more careful about constraining the OpenSSL environment and config. So perhaps I'll take a look at those and maybe we can update net-imap's tests to behave similarly? But if there's some issue there, e.g in how we've set up our test CA and cert or , I don't know why it would make flaky tests rather than just always failing.

nevans added a commit that referenced this issue Sep 13, 2024
While trying to track down the cause of #287 for @hsbt, I noticed that
this is likely a race condition.  While this can't be the cause of that
issue (the failures come from tests that don't even use FakeServer), it
seems like a good idea to fix anyway, and safe.
nevans added a commit that referenced this issue Sep 13, 2024
While trying to track down the cause of #287 for @hsbt, I noticed that
this is likely a race condition.  While this can't be the cause of that
issue (the failures come from tests that don't even use FakeServer), it
seems like a good idea to fix anyway, and safe.
nevans added a commit that referenced this issue Oct 13, 2024
While trying to track down the cause of #287 for @hsbt, I noticed that
this is likely a race condition.  While this can't be the cause of that
issue (the failures come from tests that don't even use FakeServer), it
seems like a good idea to fix anyway, and safe.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants