-
-
Notifications
You must be signed in to change notification settings - Fork 764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Threadsafe configuration can cause deadlocks #2064
Comments
Thanks for the clear bug report -- having it bisected down to one commit is a huge help! @JoshCheek -- this report makes it sound like there may be a deadlock condition in the thread safety stuff you added. Are you able to take a look at this? |
Checking it out now. I have it reproduced locally. Difficult to tell what's going on, though, b/c Celluloid is doing some sort of metaprogramming that causes objects to not necessarily be what you think they are, and methods to disappear depending on where you're looking from. I'll spend another hour or two, but really hope I don't have to read all of the Celluloid source code. ...I should probably figure out what this gem actually does :P |
CCing @tarcieri, b/c it's not improbable that more people are going to hit a similar thing and report an issue like this. SummaryThe main thread acquires the mutex lock, but the code it is running instantiates a Celluloid actor (think that's the right word) which puts it to sleep and then winds up trying to acquire the same lock. High level explanation(note: low level explanation is down below)
Kinda shitty, but easy solutions
Detailed description of the problemFrom the main thread, we enter the spec here, which calls it "does not start with Ruby object notation" do
expect(subject.to_s).to_not start_with("#<")
end We enter the definition of define_method(name) { __memoized.fetch_or_store(name) { super(&nil) } } We enter def fetch_or_store(key)
@memoized.fetch(key) do # only first access pays for synchronization
@mutex.synchronize do
@memoized.fetch(key) { @memoized[key] = yield }
end
end
end This takes us back to the block above, which calls subject do
expect(Octokit::Client).to receive(:new) { connection }
described_class.new(organization: "opscode-cookbooks", access_token: "asdf")
end This takes it into def new(*args, &block)
proxy = Cell.new(allocate, behavior_options, actor_options).proxy
proxy._send_(:initialize, *args, &block)
proxy
end It delegates def response
Celluloid.suspend(:callwait, self)
end
def value
response.value
end I think a pool eventually pulls it out here or maybe here (I had followed it for a while, but wound up getting lost again, and don't feel like going through it again). At this point, it is in one of thread = Thread.new do
while proc = queue.pop
proc.call
# ...
end
end This eventually winds up calling into def initialize(options = {})
@connection = Octokit::Client.new(access_token: options[:access_token], auto_paginate: true,
api_endpoint: options[:api_endpoint], web_endpoint: options[:web_endpoint],
connection_options: {ssl: {verify: options[:ssl_verify].nil? ? true : options[:ssl_verify]}})
# ...
end That goes through some layers of RSpec::Mocks, until it eventually finds its way back to the block defined in the spec, which calls subject do
expect(Octokit::Client).to receive(:new) { connection }
described_class.new(organization: "opscode-cookbooks", access_token: "asdf")
end The call to connection goes through the same path as the call to def fetch_or_store(key)
@memoized.fetch(key) do # only first access pays for synchronization
@mutex.synchronize do
@memoized.fetch(key) { @memoized[key] = yield }
end
end
end Sidestep the problem by changing designFrankly, this code is much better than the vast majority of code that I see. But I get quite excited about testing and design, so I'll share the things I thought about as I considered how you could solve this issue. There are certain design issues that will make tests painful. In some regard, there will always be pain (if you want to do anything interesting), but the trick is to segregate the painful thing from the rest of the code. In this case, the pain comes from... well, a lot of things, TBH, but within code you control, it comes from having to mock out singleton methods on global objects. There are any number of ways to get out of having to do this, but the one that seems apparent to me right now is passing the connection in rather than taking all of its arguments and constructing it via the constant: This is better because it absolves the code from having to know contextual details that it can't always know. It lets the user to choose how it is instantiated, so if they had some need to give it more or different options, then they can easily do so. It also allows them to choose what your code will collaborate with... eg, to introduce a mock connection in a test, or talk to GH using some lib other than Now, people generally have this opinion that the lib should have some slim interface, and thus dislike that a user might have to configure Octokit. I generally disagree, and would rather have access to the wiring than an initialization method that does all the work for me (I find such code usually has a high cost and low value when I write it, and usually prevents me from accomplishing my goal when I other people's code that does this as it makes incorrect assumptions about my context). But sometimes there is value in convenience like this. Fortunately, it's very easy to provide both, we just move that convenience code into a different constructor: Interestingly, this one change fixes the issue, as well, because the The core insight to all of this is that when there is unit test pain, it should be looked at as an indication that there are issues with the design. And the solution is almost always the same: take the cause of the pain and push up the callstack. This was one of probably 3 big realizations that took me forever to discover, but ultimately made testing easy, valuable, and compelling. Once we push this pain up the callstack, it becomes the test's responsibility to provide the connection, which makes it trivial to mock out without needing to stub methods on singleton objects. Integration codeAnother thing to think about is that this is a wrapper around a 3rd party lib ( We have very low confidence that it actually works, because we don't control My general experience is that code this close to the boundaries is best tested at the integration level. There are simply too many reasons for it to change, and we control very few of them. We could introduce more abstraction, but there doesn't appear to be much value at this point: the problem code is already pulled out of the rest of the system and into this worker. The code in the worker deals almost entirely with wiring, not with business logic. So, probably something like I probably wouldn't put in the effort to do this right now, given that you can also get it up to date by just adding a |
Wow, amazing, sleuthing @JoshCheek! I think a simpler fix is to change: expect(Octokit::Client).to receive(:new) { connection } ...to: expect(Octokit::Client).to receive(:new).and_return(connection) This should cause Is there something we can do in RSpec to either work around this issue or detect it and warn the user or something? |
Aye, good call.
Maybe. I doubt we could tell them what the right thing to do is, but it might be possible to detect the recursive deadlock and give them enough information to resolve it themselves. If the thread is going to ask for the lock from the ReentrantMutex, and the lock is held by one of the thread's ancestors, then we can at least know the deadlock will occur, and possibly give them some useful info about the context. It's then a question of whether we can know a thread's ancestor threads (I don't see any way to get this info), how expensive it is to track / query, and whether we can get enough information to be useful (ie get the callstacks of the relevant threads). Interestingly, a deadlock situation like this would normally raise |
Is this a ticket that should still be open? |
I don't believe we've solved it yet. |
I guess it depends on your definition of solved. I think we can put some amount of work in to inform a user when they hit this situation, but IDK if it is worth it, unless lots of users wind up hitting this. While I have a desire to get huffy and think "write better code", that doesn't really help anyone or negate the fact that it's systemic. Alternatively, maybe it's not systemic, and showing them how is fine. Anyway, if it pops up a lot, then probably trying to automate the conveyance of useful debugging information is worthwhile. If not, then pointers here and there are probably fine. Ping me next time it happens, I'll evaluate how effortful it would be to automate the deadlock detection (though I have a general opinion that communicating nuanced info like this is very difficult). |
@tarcieri @JoshCheek I think I am running into a similar problem here: square/sqeduler@12babd9. Threads deadlocking if I don't eagerly load the inputs which are provided via |
Alternate fix to deadlock issue rspec/rspec-core#2064 (comment)
Yeah, # Here is a minimal reproduction
# note that `subject` here is a red herring, it is really the same as a let-block
RSpec.describe do
subject { Thread.new { b }.join }
let(:b) { }
example { subject }
end In your example, only the |
Thanks @JoshCheek! This makes a ton of sense after looking over the code last night. Thanks for explanation. |
Alternate fix to deadlock issue rspec/rspec-core#2064 (comment)
Alternate fix to deadlock issue rspec/rspec-core#2064 (comment)
So I'm thinking:
I don't want to leave this ticket open as is - it's not really clear what we'd fix inside RSpec. Thoughts? |
I'm on board. Ideally #2340 would have worked, providing the user with some feedback about why they're hitting the issue. However, it's definitely a complex domain, so I'm unsure if it's possible ("complex" here means that small changes in my understanding lead to wildly different conclusions). |
Oh, I think @myronmarston added the config, not me. Guess I should clarify whether we're talking about the configuration option or threadsafety as a feature? |
I requested it, but you actually added it in #1858. At the time, the point of the config was to allow users to opt-out of threadsafe
My read of @xaviershay's comment is that he's suggesting we deprecate threadsafe The config is another matter; if we plan to remove the threadsafe Anyhow, I too appreciate the great work @JoshCheek did on this but agree with @xaviershay's conclusion. |
I opened an issue for this discussion so that we don't wind hijacking this issue. |
Note that this problem is much more far-reaching with Ruby 3, because Ruby 3 changed locking semantics to be per-fiber instead of per-thread with this commit: ruby/ruby@3a3b19b This means that if # frozen_string_literal: true
RSpec.describe 'Ruby 3 dead locks' do
describe 'with fibers' do
let(:value) { 42 }
subject do
# At this point, the test thread (and fiber) hold the lock to memoize subject.
Fiber.new do
value # since the inner Fiber is now `current`, it will try to memoize the let block on the same lock
end.resume
end
it 'locks up' do
subject
end
end
end Running this with Ruby 2 works: $ rbenv shell 2.7.5
$ rspec ruby3_fiber_deadlock.rb
.
Finished in 0.00238 seconds (files took 0.06749 seconds to load)
1 example, 0 failures With Ruby 3.0.4+, it fails: $ rbenv shell 3.0.4
$ rspec ruby3_fiber_deadlock.rb
F
Failures:
1) Ruby 3 dead locks with fibers locks up
Failure/Error: value
ThreadError:
Attempt to unlock a mutex which is locked by another thread/fiber
# ./ruby3_fiber_deadlock.rb:10:in `block (4 levels) in <top (required)>'
# ------------------
# --- Caused by: ---
# fatal:
# No live threads left. Deadlock?
# 1 threads, 1 sleeps current:0x00000000011fb080 main thread:0x00000000011fb080
# * #<Thread:0x0000000001266ff0 sleep_forever>
# rb_thread_t:0x00000000011fb080 native:0x00007f2a62680740 int:0
#
# ./ruby3_fiber_deadlock.rb:10:in `block (4 levels) in <top (required)>'
Finished in 0.10269 seconds (files took 0.05068 seconds to load)
1 example, 1 failure
Failed examples:
rspec ./ruby3_fiber_deadlock.rb:14 # Ruby 3 dead locks with fibers locks up From a Ruby perspective, this is intended behavior; if a co-routine reads a value from memory and then suspends, then even on the same thread without locking primitives another co-routine may resume and mutate that value, invalidating assumptions the first co-routine may have about that value. So even in the absence of multi-threading this can be an issue. I am not sure what RSpec should do about that; I agree that the |
@xaviershay Aside from the consideration to deprecate |
You just surprised me twice, @mttkay, and both are valid statements ([1], [2]. The upcoming RSpec 4.0 is a perfect time to change the default. However, without rushing with this decision, what are the implications of making this option opt-in instead of opt-out by default? In theory, an accessor defined by a non-threadsafe Another thing that doesn't seem related is parallel testing, as most of the current and WIP tools resort to sub-processes, not threads, for obvious reasons of mocking. |
Yep, agree!
This example, and discussion, is just another instance of the old pro-GIL and contra-GIL argument. MRI takes an opinionated approach and says that you should free developers from thinking about these problems and not require a lock in this case. Others disagree and say that providing mutual exclusion under presence of concurrency is not the VM's job, but the developer's job (or: a library's job -- there are many ways to ensure thread-safety that do not require you to use locking primitives directly. Examples are atomic CAS, actors, reactive streams, async/await etc.) So, changing the default for Another way to look at this is to apply the pareto principle: which use case is rspec trying to optimize for? The 0.1% or so who write multi-threaded test implementations, or the 99.9% who don't? Locking also creates overhead; for very large test suites like the one GitLab runs, there will be hundreds of thousands of locking operations due to synchronized let and subject blocks. I am still looking into how
Hm, is this even affected? My understanding of parallel testing is that a test suite is subdivided and distributed across worker threads, but let and subject memoization only applies at the per-test level. So you can still distribute test methods across multiple threads and CPUs without running into integrity issues; or are there cases where rspec memoizes variables that will in fact be read across multiple test methods? Happy to be corrected here since I am not familiar with rspec internals, I just stumbled into this by working on Ruby 3 compatibility. |
We are not changing the default to false, we actually should remove the false option entirely in RSpec 4 per the original plan for its implementation, it was introduced originally as opt in and if my memory is correct because default in RSpec 3. Without threadsafe mode what happens is you can end up with different values memoized in different threads (and thus presumedly fibers) depending on race conditions. We should fix the implementation to better detect or prevent deadlocks. |
Do you have an example of how that could happen when multiple fibers but only a single thread are involved? If that is true, then I agree it may be safer to maintain the threadsafe toggle. Otherwise, you're essentially optimizing for the minority of users and use cases that use multi-threaded test implementations and/or Ruby VMs that do not use a GIL. |
@JonRowe Here is an executable test case that should demonstrate the sample code posted in the original addition of Updated the example to interleave each count to increment, which goes to show that even when switching between fibers for each call to increment, there are no integrity problems in the absence of synchronized let blocks: # frozen_string_literal: true
require 'fiber'
RSpec.configure do |config|
config.threadsafe = false
end
class Counter
def initialize
@counter = 0
end
def increment
puts "#{Fiber.current}: incr"
@counter += 1
end
def count
@counter
end
end
RSpec.describe 'Consistent reads' do
describe 'with fibers' do
let(:counter) { Counter.new }
it 'increments counter consistently' do
fibers = 10.times.map do
Fiber.new do
1000.times do
counter.increment
Fiber.yield
end
end
end
while fibers.any?(&:alive?) do
fibers.each(&:resume)
end
expect(counter.count).to eq 10_000
end
end
end This test passes for me in both Ruby 2 and Ruby 3 (MRI). This is because all fibers are serialized in the current test thread and execute sequentially. There are no race conditions. The problem described in the original MR is, as far as I understand, specific to multi-threaded test code in Rubies that do not use a GIL. In other words, deciding to enable |
Side-note: this may of course change if someone uses a different |
Hi.
I working on debian package for berkshelf-api, in debian we have rspec-* 3.3.0 and I have problem to run testsuite because it hangs (on this line: https://github.com/berkshelf/berkshelf-api/blob/master/spec/unit/berkshelf/api/cache_builder/worker/github_spec.rb#L51, https://github.com/berkshelf/berkshelf-api/blob/master/spec/support/human_reaable.rb#L3), While it works perfectly with 2.99.0.
I started to bisect and found that hangs are caused by rspec-core commit ffe00a1.
The text was updated successfully, but these errors were encountered: