|
|
Created:
4 years, 2 months ago by prashant.n Modified:
4 years, 1 month ago CC:
chromium-reviews Target Ref:
refs/pending/heads/master Project:
chromium Visibility:
Public. |
Descriptionbase: Take care of unexpected wakeups in condition_variable_win.cc.
As per MSDN documentation (https://goo.gl/tPHPkY), condition variables
are subject to spurious wakeups (those not associated with an explicit
wake) and stolen wakeups (another thread manages to run before the
woken thread). These unexpected wakeups may cause bizarre behaviours in
functionalities where condition variables are used. This patch adds a
function which checks whether it's a real timeout and if not it retries
until some defined number of times after which it crashes.
BUG=640577, 617010
Patch Set 1 #Patch Set 2 : build break #Patch Set 3 : nits #Patch Set 4 : build break #
Total comments: 3
Messages
Total messages: 45 (11 generated)
The CQ bit was checked by prashant.n@samsung.com to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
Description was changed from ========== base: Take care of unexpected wakeups in condition_variable_win.cc. As per MSDN documentation (https://goo.gl/tPHPkY), condition variables are subject to spurious wakeups (those not associated with an explicit wake) and stolen wakeups (another thread manages to run before the woken thread). These unexpected wakeups may cause bizarre behaviours in functionalities where condition variables are used. This patch adds a function which checks whether it's a real timeout and if not it retries until some defined number of times. BUG=640577, 617010 ========== to ========== base: Take care of unexpected wakeups in condition_variable_win.cc. As per MSDN documentation (https://goo.gl/tPHPkY), condition variables are subject to spurious wakeups (those not associated with an explicit wake) and stolen wakeups (another thread manages to run before the woken thread). These unexpected wakeups may cause bizarre behaviours in functionalities where condition variables are used. This patch adds a function which checks whether it's a real timeout and if not it retries until some defined number of times. BUG=640577, 617010 ==========
prashant.n@samsung.com changed reviewers: + danakj@chromium.org
For bug "Chrome: Crash Report - [Renderer hang] cc::TaskGraphWorkQueue::HasReadyToRunTasksInNamespace" suspect spurious wakeups of SleepConditionVariableSRW() as per call stack seen in crash. So adding this patch. ptal.
For bug "Chrome: Crash Report - [Renderer hang] cc::TaskGraphWorkQueue::HasReadyToRunTasksInNamespace" suspect spurious wakeups of SleepConditionVariableSRW() as per call stack seen in crash. So adding this patch. ptal.
Description was changed from ========== base: Take care of unexpected wakeups in condition_variable_win.cc. As per MSDN documentation (https://goo.gl/tPHPkY), condition variables are subject to spurious wakeups (those not associated with an explicit wake) and stolen wakeups (another thread manages to run before the woken thread). These unexpected wakeups may cause bizarre behaviours in functionalities where condition variables are used. This patch adds a function which checks whether it's a real timeout and if not it retries until some defined number of times. BUG=640577, 617010 ========== to ========== base: Take care of unexpected wakeups in condition_variable_win.cc. As per MSDN documentation (https://goo.gl/tPHPkY), condition variables are subject to spurious wakeups (those not associated with an explicit wake) and stolen wakeups (another thread manages to run before the woken thread). These unexpected wakeups may cause bizarre behaviours in functionalities where condition variables are used. This patch adds a function which checks whether it's a real timeout and if not it retries until some defined number of times after which it crashes. BUG=640577, 617010 ==========
The CQ bit was checked by prashant.n@samsung.com to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
danakj@chromium.org changed reviewers: + robliao@chromium.org
+robliao
On 2016/09/29 19:22:33, danakj wrote: > +robliao Two questions: 1) Like most condition variables, the caller has to recheck the condition in any case before proceeding. If there is a spurious wakeup, the caller should recheck and go back to sleep. What is the goal of this retry logic? 2) What's the role of http://crbug.com/617010 for this bug?
On 2016/09/29 19:25:38, robliao wrote: > On 2016/09/29 19:22:33, danakj wrote: > > +robliao > > Two questions: > 1) Like most condition variables, the caller has to recheck the condition in any > case before proceeding. If there is a spurious wakeup, the caller should recheck > and go back to sleep. What is the goal of this retry logic? As per MSDN documentation (https://goo.gl/tPHPkY and https://msdn.microsoft.com/en-us/library/windows/desktop/ms682052(v=vs.85).aspx ), we should continue to wait in a while loop until predicate returns true. I've used here until function successfully returns or returns expected error. Instead of infinite loop I had put some predefined number of tries after which we would crash. > 2) What's the role of http://crbug.com/617010 for this bug? This bug also had a hang with similar kinda call stack where hang crash was observed.
On 2016/09/30 00:35:18, prashant.n wrote: > On 2016/09/29 19:25:38, robliao wrote: > > On 2016/09/29 19:22:33, danakj wrote: > > > +robliao > > > > Two questions: > > 1) Like most condition variables, the caller has to recheck the condition in > any > > case before proceeding. If there is a spurious wakeup, the caller should > recheck > > and go back to sleep. What is the goal of this retry logic? > As per MSDN documentation (https://goo.gl/tPHPkY and > https://msdn.microsoft.com/en-us/library/windows/desktop/ms682052(v=vs.85).aspx > ), we should continue to wait in a while loop until predicate returns true. I've > used here until function successfully returns or returns expected error. > Instead of infinite loop I had put some predefined number of tries after which > we would crash. > > > > 2) What's the role of http://crbug.com/617010 for this bug? > This bug also had a hang with similar kinda call stack where hang crash was > observed. > As per MSDN documentation (https://goo.gl/tPHPkY and > https://msdn.microsoft.com/en-us/library/windows/desktop/ms682052(v=vs.85).aspx > ), we should continue to wait in a while loop until predicate returns true. I've > used here until function successfully returns or returns expected error. > Instead of infinite loop I had put some predefined number of tries after which > we would crash. MSDN is referring to the normal checking that developers already have to do when using condition variables. Generally the pattern is... [Lock] while ([Check State Synchronized by Lock]) { [Condition Variable Sleep and Wait to be Woken Up] } At any wake legitimate or spurious, you have to recheck the state. It's not clear to me that this is needed. Also, by indirecting the CV one level, you potentially introduce some races where the CV is signaled during error retry. In the normal case, if the CV is signaled during a spurious wakeup, generally there's not a problem as the conditions are rechecked. In the retry case, if the CV is signaled, that signal gets lost during the retry logic and we sleep until timeout or forever.
> Also, by indirecting the CV one level, you potentially introduce some races > where the CV is signaled during error retry. > > In the normal case, if the CV is signaled during a spurious wakeup, generally > there's not a problem as the conditions are rechecked. But here there is no guarantee of whether lock is acquired or not. And there is also the case of stolen wakeups, which would occur in case of Broadcast(). How about having function something like below which would give us whether lock is acquired or not after the call to SleepConditionVariableSRW() and if not continue sleep again until we get expected return values in case of retries. void Lock::IsAcquired() const { return owning_thread_ref_ == PlatformThread::CurrentRef(); } > > In the retry case, if the CV is signaled, that signal gets lost during the > retry logic and we sleep until timeout or forever. This scenario would be - Suppose there are two threads t1 and t2 using condition variable. 1. t1 is sleeping on condition variable and due to spurious wakeup it comes out of sleep function. 2. t2 has signalled the condition variable. 3. t1 is retrying to sleep. In such case, if we check whether lock is acquired or not before retrying would help. If the lock is not acquired and we sleep, then that sleep should immediately return as it can easily acquire the lock. In case of non-infinite timeouts, with delta of Time::Now(), it will timeout or succeed. In case of infinite timeouts, we can keep next retry timeouts smaller (e.g. 50 ms), so that we may sleep momentarily.
> In such case, if we check whether lock is acquired or not before retrying would > help. If the lock is not acquired and we sleep, then that sleep should > immediately return as it can easily acquire the lock. The signal once lost, it will not be usable further as condition variables are stateless. The situation of spurious and stolen wakeups would not arise if there is one producer and one consumer. When there are multiple consumers, then signal should ideally wake up only one consumer. But due to multiprocessor architecture, there can be multiple wakeups out of which only one succeeds and others fail. Due to unexpected wakeup, the SleepConditionVariableSRW() may return different error. IMO, in such case we can continue to wait again if the expected error is not returned.
IMO, the patch should work provided that SleepConditionVariableSRW() fails on spurious wakeups. If the function does not fail, then we can have mechanism explained below. 1. Suppose there are three threads t1, t2 and t3 running on three different processor cores. t1 has acquired lock and other two t2 and t3 are waiting for condition variable. 2. t1 is done and it signals the condition variable. 3. t2 and t3 both wake up at the same time. 4. Suppose t2 succeeds acquiring lock, then sleep function of t3 fails and we need to continue to wait for the remaining timeout as if there was no signalling. If we don't want to handle this mechanism at src/base level, we can handle it in client side where condition variable's spurious wakeups are causing problems. +++++++++++++++++++++++++++++ If sleep function does not fail, we can have logic something like below - ConditionVariable::Wait() { // check held and unmark. owning_thread_ref_ = PlatformThreadRef(); // lock owning ref is 0. while (1) { SleepConditionVariableSRW(timeout); // only we acquired lock. if (owning_thread_ref_ == PlatformThreadRef()) { // check unheld and mark owning_thread_ref_ = PlatformThread::CurrentRef(); // current thread is owner. break; } // compute remaining timeout. } }
On 2016/09/30 11:34:37, prashant.n wrote: > IMO, the patch should work provided that SleepConditionVariableSRW() fails on > spurious wakeups. If the function does not fail, then we can have mechanism > explained below. > > 1. Suppose there are three threads t1, t2 and t3 running on three different > processor cores. t1 has acquired lock and other two t2 and t3 are waiting for > condition variable. > 2. t1 is done and it signals the condition variable. > 3. t2 and t3 both wake up at the same time. > 4. Suppose t2 succeeds acquiring lock, then sleep function of t3 fails and we > need to continue to wait for the remaining timeout as if there was no > signalling. > > If we don't want to handle this mechanism at src/base level, we can handle it in > client side where condition variable's spurious wakeups are causing problems. > > +++++++++++++++++++++++++++++ > > If sleep function does not fail, we can have logic something like below - > > ConditionVariable::Wait() { > // check held and unmark. > owning_thread_ref_ = PlatformThreadRef(); // lock owning ref is 0. > > while (1) { > SleepConditionVariableSRW(timeout); > > // only we acquired lock. > if (owning_thread_ref_ == PlatformThreadRef()) { > // check unheld and mark > owning_thread_ref_ = PlatformThread::CurrentRef(); // current thread is > owner. > break; > } > > // compute remaining timeout. > } > } > But here there is no guarantee of whether lock is acquired or not. And there is > also the case of stolen wakeups, which would occur in case of Broadcast(). > > How about having function something like below which would give us whether lock > is acquired or not after the call to SleepConditionVariableSRW() and if not > continue sleep again until we get expected return values in case of retries. > > void Lock::IsAcquired() const { > return owning_thread_ref_ == PlatformThread::CurrentRef(); > } SleepConditionVariableSRW guarantees the thread owns the lock regardless of return value. The last thing it does is attempt to reacquire the lock on success or failure. > In the retry case, if the CV is signaled, that signal gets lost during the > retry logic and we sleep until timeout or forever. > This scenario would be - > Suppose there are two threads t1 and t2 using condition variable. > 1. t1 is sleeping on condition variable and due to spurious wakeup it comes out > of sleep function. > 2. t2 has signalled the condition variable. > 3. t1 is retrying to sleep. > > In such case, if we check whether lock is acquired or not before retrying would > help. If the lock is not acquired and we sleep, then that sleep should > immediately return as it can easily acquire the lock. > > In case of non-infinite timeouts, with delta of Time::Now(), it will timeout or > succeed. > > In case of infinite timeouts, we can keep next retry timeouts smaller (e.g. 50 > ms), so that we may sleep momentarily. > > The signal once lost, it will not be usable further as condition variables are > stateless. This is indistinguishable from a normal timeout, suggesting that there is a race condition in the code. t2 should only signal the condition variable once it has updated the requisite state to allow t1 to move forward. t1 should only check the state in a lock. > The situation of spurious and stolen wakeups would not arise if there is one > producer and one consumer. When there are multiple consumers, then signal should > ideally wake up only one consumer. But due to multiprocessor architecture, there > can be multiple wakeups out of which only one succeeds and others fail. Due to > unexpected wakeup, the SleepConditionVariableSRW() may return different error. > IMO, in such case we can continue to wait again if the expected error is not > returned. Designs that have multiple consumers need to account for this by rechecking their own state before proceeding forward. > IMO, the patch should work provided that SleepConditionVariableSRW() fails on > spurious wakeups. If the function does not fail, then we can have mechanism > explained below. > > 1. Suppose there are three threads t1, t2 and t3 running on three different > processor cores. t1 has acquired lock and other two t2 and t3 are waiting for condition variable. > 2. t1 is done and it signals the condition variable. > > 3. t2 and t3 both wake up at the same time. > 4. Suppose t2 succeeds acquiring lock, then sleep function of t3 fails and we > need to continue to wait for the remaining timeout as if there was no > signalling. > > If we don't want to handle this mechanism at src/base level, we can handle it in > client side where condition variable's spurious wakeups are causing problems. There are two main scenarios to consider when dealing with the CV: 1) A CV sleep with timeout - A spurious wakeup is indistinguishable from a timeout. 2) A CV sleep with timeout of infinite - A spurious wakeup is indistinguishable from a real wakeup. In both cases, the code using the CV is required to check its state before proceeding or sleeping on success or failure. The retry code doesn't help in either case. Taking a step back, what evidence is there that spurious wakeups are actually a problem?
There is nothing windows-specific about spuriously waking up from wait either. pthreads says this: "When using condition variables there is always a Boolean predicate involving shared variables associated with each condition wait that is true if the thread should proceed. Spurious wakeups from the pthread_cond_timedwait() or pthread_cond_wait() functions may occur. Since the return from pthread_cond_timedwait() or pthread_cond_wait() does not imply anything about the value of this predicate, the predicate should be re-evaluated upon such return." [1] <http://pubs.opengroup.org/onlinepubs/009695399/functions/pthread_cond_timedwa...> We probably have code that is waiting and failing to check anything, I think I can name a couple of places like this actually, and I guess they are wrong. On Fri, Sep 30, 2016 at 4:34 AM, <prashant.n@samsung.com> wrote: > IMO, the patch should work provided that SleepConditionVariableSRW() fails > on > spurious wakeups. If the function does not fail, then we can have mechanism > explained below. > > 1. Suppose there are three threads t1, t2 and t3 running on three different > processor cores. t1 has acquired lock and other two t2 and t3 are waiting > for > condition variable. > 2. t1 is done and it signals the condition variable. > 3. t2 and t3 both wake up at the same time. > 4. Suppose t2 succeeds acquiring lock, then sleep function of t3 fails and > we > need to continue to wait for the remaining timeout as if there was no > signalling. > > If we don't want to handle this mechanism at src/base level, we can handle > it in > client side where condition variable's spurious wakeups are causing > problems. > > +++++++++++++++++++++++++++++ > > If sleep function does not fail, we can have logic something like below - > > ConditionVariable::Wait() { > // check held and unmark. > owning_thread_ref_ = PlatformThreadRef(); // lock owning ref is 0. > > while (1) { > SleepConditionVariableSRW(timeout); > > // only we acquired lock. > if (owning_thread_ref_ == PlatformThreadRef()) { > // check unheld and mark > owning_thread_ref_ = PlatformThread::CurrentRef(); // current thread is > owner. > break; > } > > // compute remaining timeout. > } > } > > https://codereview.chromium.org/2381893002/ > -- You received this message because you are subscribed to the Google Groups "Chromium-reviews" group. To unsubscribe from this group and stop receiving emails from it, send an email to chromium-reviews+unsubscribe@chromium.org.
> There are two main scenarios to consider when dealing with the CV: > 1) A CV sleep with timeout - A spurious wakeup is indistinguishable from a > timeout. > 2) A CV sleep with timeout of infinite - A spurious wakeup is > indistinguishable from a real wakeup. > > In both cases, the code using the CV is required to check its state before > proceeding or sleeping > on success or failure. The retry code doesn't help in either case. In case of spurious wakeup, is acquiring lock after signal is received guaranteed? The spurious wakeup means single signal acted by two consumer threads simultaneously. In such case only 1 can acquire the lock and other may fail as it got signalled but may or may not acquire the lock. > > Taking a step back, what evidence is there that spurious wakeups are actually a > problem? The way condition variable is used in categorized worker pool is correct and the same mechanism works perfectly fine on linux. But only on few Windows machines it crashes and call stack shows calls which are of wait() and function to be called after wait(). This is just the suspect, the root cause may be different.
On Fri, Sep 30, 2016 at 11:12 AM, <danakj@chromium.org> wrote: > There is nothing windows-specific about spuriously waking up from wait > either. pthreads says this: "When using condition variables there is always > a Boolean predicate involving shared variables associated with each > condition wait that is true if the thread should proceed. Spurious wakeups > from the pthread_cond_timedwait() or pthread_cond_wait() functions may > occur. Since the return from pthread_cond_timedwait() or > pthread_cond_wait() does not imply anything about the value of this > predicate, the predicate should be re-evaluated upon such return." [1] > <http://pubs.opengroup.org/onlinepubs/009695399/functions/pthread_cond_timedwa...> > > We probably have code that is waiting and failing to check anything, I > think I can name a couple of places like this actually, and I guess they > are wrong. > Oh, I'm thinking of base::WaitableEvent, but you're talking about ConditionVariable. Maybe you want to be using WaitableEvent? > > On Fri, Sep 30, 2016 at 4:34 AM, <prashant.n@samsung.com> wrote: > >> IMO, the patch should work provided that SleepConditionVariableSRW() >> fails on >> spurious wakeups. If the function does not fail, then we can have >> mechanism >> explained below. >> >> 1. Suppose there are three threads t1, t2 and t3 running on three >> different >> processor cores. t1 has acquired lock and other two t2 and t3 are waiting >> for >> condition variable. >> 2. t1 is done and it signals the condition variable. >> 3. t2 and t3 both wake up at the same time. >> 4. Suppose t2 succeeds acquiring lock, then sleep function of t3 fails >> and we >> need to continue to wait for the remaining timeout as if there was no >> signalling. >> >> If we don't want to handle this mechanism at src/base level, we can >> handle it in >> client side where condition variable's spurious wakeups are causing >> problems. >> >> +++++++++++++++++++++++++++++ >> >> If sleep function does not fail, we can have logic something like below - >> >> ConditionVariable::Wait() { >> // check held and unmark. >> owning_thread_ref_ = PlatformThreadRef(); // lock owning ref is 0. >> >> while (1) { >> SleepConditionVariableSRW(timeout); >> >> // only we acquired lock. >> if (owning_thread_ref_ == PlatformThreadRef()) { >> // check unheld and mark >> owning_thread_ref_ = PlatformThread::CurrentRef(); // current thread is >> owner. >> break; >> } >> >> // compute remaining timeout. >> } >> } >> >> https://codereview.chromium.org/2381893002/ >> > > -- You received this message because you are subscribed to the Google Groups "Chromium-reviews" group. To unsubscribe from this group and stop receiving emails from it, send an email to chromium-reviews+unsubscribe@chromium.org.
> In case of spurious wakeup, is acquiring lock after signal is received > guaranteed? > The spurious wakeup means single signal acted by two consumer threads > simultaneously. In such case only 1 can acquire the lock and other may fail as > it got signalled but may or may not acquire the lock. Yes. It is an error not to reacquire the lock because the thread presumes it had the lock before. It is a necessary postcondition under all cases, success or failure to reacquire the lock. The code in Windows and in posix does this today. > The way condition variable is used in categorized worker pool is correct and the same mechanism works perfectly fine on linux. But only on few Windows machines it crashes and call stack shows calls which are of wait() and function to be called after wait(). This is just the suspect, the root cause may be different. If you're referring to https://crbug.com/617010 for the crashes, those aren't actually crashes. Those are hangs reported by the hang detector. It is known that it's misclassifying these as threads as hangs. Incidentally, that bug is marked as fixed. > Oh, I'm thinking of base::WaitableEvent, but you're talking about ConditionVariable. Maybe you want to be using WaitableEvent? Agreed. If you don't need external state to proceed and simply need a wakeup call, WaitableEvent is the one to use. CategorizedWorkerPool::Run does the correct thing and rechecks its state on CV wakes. On Fri, Sep 30, 2016 at 11:24 AM, <danakj@chromium.org> wrote: > On Fri, Sep 30, 2016 at 11:12 AM, <danakj@chromium.org> wrote: > >> There is nothing windows-specific about spuriously waking up from wait >> either. pthreads says this: "When using condition variables there is always >> a Boolean predicate involving shared variables associated with each >> condition wait that is true if the thread should proceed. Spurious wakeups >> from the pthread_cond_timedwait() or pthread_cond_wait() functions may >> occur. Since the return from pthread_cond_timedwait() or >> pthread_cond_wait() does not imply anything about the value of this >> predicate, the predicate should be re-evaluated upon such return." [1] >> <http://pubs.opengroup.org/onlinepubs/009695399/functions/pthread_cond_timedwa...> >> >> We probably have code that is waiting and failing to check anything, I >> think I can name a couple of places like this actually, and I guess they >> are wrong. >> > > Oh, I'm thinking of base::WaitableEvent, but you're talking about > ConditionVariable. Maybe you want to be using WaitableEvent? > > >> >> On Fri, Sep 30, 2016 at 4:34 AM, <prashant.n@samsung.com> wrote: >> >>> IMO, the patch should work provided that SleepConditionVariableSRW() >>> fails on >>> spurious wakeups. If the function does not fail, then we can have >>> mechanism >>> explained below. >>> >>> 1. Suppose there are three threads t1, t2 and t3 running on three >>> different >>> processor cores. t1 has acquired lock and other two t2 and t3 are >>> waiting for >>> condition variable. >>> 2. t1 is done and it signals the condition variable. >>> 3. t2 and t3 both wake up at the same time. >>> 4. Suppose t2 succeeds acquiring lock, then sleep function of t3 fails >>> and we >>> need to continue to wait for the remaining timeout as if there was no >>> signalling. >>> >>> If we don't want to handle this mechanism at src/base level, we can >>> handle it in >>> client side where condition variable's spurious wakeups are causing >>> problems. >>> >>> +++++++++++++++++++++++++++++ >>> >>> If sleep function does not fail, we can have logic something like below - >>> >>> ConditionVariable::Wait() { >>> // check held and unmark. >>> owning_thread_ref_ = PlatformThreadRef(); // lock owning ref is 0. >>> >>> while (1) { >>> SleepConditionVariableSRW(timeout); >>> >>> // only we acquired lock. >>> if (owning_thread_ref_ == PlatformThreadRef()) { >>> // check unheld and mark >>> owning_thread_ref_ = PlatformThread::CurrentRef(); // current thread is >>> owner. >>> break; >>> } >>> >>> // compute remaining timeout. >>> } >>> } >>> >>> https://codereview.chromium.org/2381893002/ >>> >> >> > -- You received this message because you are subscribed to the Google Groups "Chromium-reviews" group. To unsubscribe from this group and stop receiving emails from it, send an email to chromium-reviews+unsubscribe@chromium.org.
> Yes. It is an error not to reacquire the lock because the thread presumes > it had > the lock before. It is a necessary postcondition under all cases, success > or failure > to reacquire the lock. The code in Windows and in posix does this today. > So my query is can Sleep function return with some error if it fails to re-acquire lock? But in case of Broadcast, all the consumers waiting on condition variable are signalled in one shot and they acquire lock one by one after first one releases. So the above case is highly unlikely.
prashant.n@samsung.com changed reviewers: + reveman@chromium.org, vmpstr@chromium.org
+reveman@ and vmpstr@
My conclusion on this is - We cannot have spurious wakeups (as we don't see them in case of posix impl.). The bug 640577 is also looks to be the hang and not crash similar to 617010. I'll try to get some more info in bug 640577, regarding if this crash is observed when user sees the "Kill/Wait" view.
A while loop for this on windows seems like the appropriate fix based on linked msdn documentation. Returning without CV being signaled is clearly an incorrect implementation of base::ConditionVariable as currently documented. Is there a way we can test this in a unit test? https://codereview.chromium.org/2381893002/diff/60001/base/synchronization/co... File base/synchronization/condition_variable_win.cc (right): https://codereview.chromium.org/2381893002/diff/60001/base/synchronization/co... base/synchronization/condition_variable_win.cc:20: static const int kMaxNumberOfTriesAfterFailure = 10; do we need to limit the max number of retries? what if 10 retries is needed for a valid use case?
> Is there a way we can test this in a unit test? I'm not sure if we can generate the scenario of spurious wakeups in test environment, but if we enable MultiThreadConsumerTest on windows which has been disable due to its flaky behavior (http://crbug.com/10607), it may test this patch.
> Is there a way we can test this in a unit test? I'm not sure if we can generate the scenario of spurious wakeups in test environment, but if we enable MultiThreadConsumerTest on windows which has been disable due to its flaky behavior (http://crbug.com/10607), it may test this patch.
https://codereview.chromium.org/2381893002/diff/60001/base/synchronization/co... File base/synchronization/condition_variable_win.cc (right): https://codereview.chromium.org/2381893002/diff/60001/base/synchronization/co... base/synchronization/condition_variable_win.cc:20: static const int kMaxNumberOfTriesAfterFailure = 10; On 2016/10/03 10:58:07, reveman wrote: > do we need to limit the max number of retries? what if 10 retries is needed for > a valid use case? Hmm. Actually MSDN documentation tells for ever while loop, no limits. I thought putting some limits, so that we can control looping erroneous conditions. May be we can decide on proper limit number. In case CategorizedWorkerPool, while loop without limits is okay, but for other users of condition variable it may cause forever waits. (Not confirmed in code though.)
On 2016/10/03 11:39:50, prashant.n wrote: > > Is there a way we can test this in a unit test? > > I'm not sure if we can generate the scenario of spurious wakeups in test > environment, but if we enable MultiThreadConsumerTest on windows which has been > disable due to its flaky behavior (http://crbug.com/10607), it may test this > patch. https://codereview.chromium.org/2388083002 is added to check MultiThreadConsumerTest on windows with this patch.
https://codereview.chromium.org/2381893002/diff/60001/base/synchronization/co... File base/synchronization/condition_variable_win.cc (right): https://codereview.chromium.org/2381893002/diff/60001/base/synchronization/co... base/synchronization/condition_variable_win.cc:20: static const int kMaxNumberOfTriesAfterFailure = 10; On 2016/10/03 at 11:40:14, prashant.n wrote: > On 2016/10/03 10:58:07, reveman wrote: > > do we need to limit the max number of retries? what if 10 retries is needed for > > a valid use case? > > Hmm. Actually MSDN documentation tells for ever while loop, no limits. > > I thought putting some limits, so that we can control looping erroneous conditions. May be we can decide on proper limit number. > > In case CategorizedWorkerPool, while loop without limits is okay, but for other users of condition variable it may cause forever waits. (Not confirmed in code though.) The situation is very similar to how we handle interrupts on posix platforms. In that case we loop forever in debug builds and make 100 attempts in release builds: https://cs.chromium.org/chromium/src/base/posix/eintr_wrapper.h?rcl=0&l=37 maybe do the same here?
On 2016/10/03 13:23:40, reveman wrote: > > The situation is very similar to how we handle interrupts on posix platforms. In > that case we loop forever in debug builds and make 100 attempts in release > builds: > https://cs.chromium.org/chromium/src/base/posix/eintr_wrapper.h?rcl=0&l=37 > > maybe do the same here? Ok, I'll make this change. One query - As condition variables on any platform are subject to spurious wakeups, should we make this change for all platforms? danakj@ and robliao@, what is your opinion for the patch?
On 2016/10/03 13:23:40, reveman wrote: > https://codereview.chromium.org/2381893002/diff/60001/base/synchronization/co... > File base/synchronization/condition_variable_win.cc (right): > > https://codereview.chromium.org/2381893002/diff/60001/base/synchronization/co... > base/synchronization/condition_variable_win.cc:20: static const int > kMaxNumberOfTriesAfterFailure = 10; > On 2016/10/03 at 11:40:14, prashant.n wrote: > > On 2016/10/03 10:58:07, reveman wrote: > > > do we need to limit the max number of retries? what if 10 retries is needed > for > > > a valid use case? > > > > Hmm. Actually MSDN documentation tells for ever while loop, no limits. > > > > I thought putting some limits, so that we can control looping erroneous > conditions. May be we can decide on proper limit number. > > > > In case CategorizedWorkerPool, while loop without limits is okay, but for > other users of condition variable it may cause forever waits. (Not confirmed in > code though.) > > The situation is very similar to how we handle interrupts on posix platforms. In > that case we loop forever in debug builds and make 100 attempts in release > builds: > https://cs.chromium.org/chromium/src/base/posix/eintr_wrapper.h?rcl=0&l=37 > > maybe do the same here? As indicated before, clients will naturally have a while loop rechecking their condition when using a condition variable. There isn't a need for special handling of spurious wakeups. Spurious wakeups are indistinguishable from timeouts as well.
On 2016/10/03 17:05:08, robliao wrote: > > As indicated before, clients will naturally have a while loop rechecking their > condition when using a condition variable. > There isn't a need for special handling of spurious wakeups. Spurious wakeups > are indistinguishable from timeouts as well. In CWP (categorized worker pool), we do few more operations with checking state and then go to Sleep. For those operations acquiring lock is assumed, if not we are bound to see crashes. while (true) { if (!RunTaskWithLockAcquired(categories)) { // We are no longer running tasks, which may allow another category to // start running. Signal other worker threads. SignalHasReadyToRunTasksWithLockAcquired(); // Exit when shutdown is set and no more tasks are pending. if (shutdown_) break; // Wait for more tasks. has_ready_to_run_tasks_cv->Wait(); continue; } I thought here loop check as per patch would have been better. If not, we cannot have fix then. I'm unsure if we can have a fix at the moment for the bug as we can leave it as a hang at some other place than reported call stack.
On 2016/10/03 at 17:05:08, robliao wrote: > On 2016/10/03 13:23:40, reveman wrote: > > https://codereview.chromium.org/2381893002/diff/60001/base/synchronization/co... > > File base/synchronization/condition_variable_win.cc (right): > > > > https://codereview.chromium.org/2381893002/diff/60001/base/synchronization/co... > > base/synchronization/condition_variable_win.cc:20: static const int > > kMaxNumberOfTriesAfterFailure = 10; > > On 2016/10/03 at 11:40:14, prashant.n wrote: > > > On 2016/10/03 10:58:07, reveman wrote: > > > > do we need to limit the max number of retries? what if 10 retries is needed > > for > > > > a valid use case? > > > > > > Hmm. Actually MSDN documentation tells for ever while loop, no limits. > > > > > > I thought putting some limits, so that we can control looping erroneous > > conditions. May be we can decide on proper limit number. > > > > > > In case CategorizedWorkerPool, while loop without limits is okay, but for > > other users of condition variable it may cause forever waits. (Not confirmed in > > code though.) > > > > The situation is very similar to how we handle interrupts on posix platforms. In > > that case we loop forever in debug builds and make 100 attempts in release > > builds: > > https://cs.chromium.org/chromium/src/base/posix/eintr_wrapper.h?rcl=0&l=37 > > > > maybe do the same here? > > As indicated before, clients will naturally have a while loop rechecking their condition when using a condition variable. > There isn't a need for special handling of spurious wakeups. Spurious wakeups are indistinguishable from timeouts as well. Ah, it's documented here: https://cs.chromium.org/chromium/src/base/synchronization/condition_variable.... I think it's perfectly fine to leave this up to the caller but would be nice to also document it here: https://cs.chromium.org/chromium/src/base/synchronization/condition_variable.... as currently it sounds like Wait() will only return iff the CV was signaled if you read the comment next to the function declaration.
On 2016/10/03 19:46:17, reveman wrote: > Ah, it's documented here: > https://cs.chromium.org/chromium/src/base/synchronization/condition_variable.... > > I think it's perfectly fine to leave this up to the caller Hmm. Then should we patch up at call site in CWP? > but would be nice to also document it here: > https://cs.chromium.org/chromium/src/base/synchronization/condition_variable.... > > as currently it sounds like Wait() will only return iff the CV was signaled if > you read the comment next to the function declaration. I'll modify the comment, so that callers can be cautious.
On 2016/10/04 01:38:41, prashant.n wrote: > Hmm. Then should we patch up at call site in CWP? > I checked code of CWP once again. Even if spurious wakeups happen for the condition variable, the lock can not be acquired so that particular thread will continue to wait for acquiring lock. Once it acquires the lock, then it will work fine as other functions in while loop (comment #37) work as expected with lock acquired. So we'll conclude on this as we don't need to do anything.
On 2016/10/04 07:26:23, prashant.n wrote: > On 2016/10/04 01:38:41, prashant.n wrote: > > Hmm. Then should we patch up at call site in CWP? > > > > I checked code of CWP once again. Even if spurious wakeups happen for the > condition variable, the lock can not be acquired so that particular thread will > continue to wait for acquiring lock. Once it acquires the lock, then it will > work fine as other functions in while loop (comment #37) work as expected with > lock acquired. > > So we'll conclude on this as we don't need to do anything. Here I assume SleepConditionVariableSRW() will never return error other than ERROR_TIMEOUT.
On 2016/10/04 13:42:05, prashant.n wrote: > On 2016/10/04 07:26:23, prashant.n wrote: > > On 2016/10/04 01:38:41, prashant.n wrote: > > > Hmm. Then should we patch up at call site in CWP? > > > > > > > I checked code of CWP once again. Even if spurious wakeups happen for the > > condition variable, the lock can not be acquired so that particular thread > will > > continue to wait for acquiring lock. Once it acquires the lock, then it will > > work fine as other functions in while loop (comment #37) work as expected with > > lock acquired. > > > > So we'll conclude on this as we don't need to do anything. > > Here I assume SleepConditionVariableSRW() will never return error other than > ERROR_TIMEOUT. CWP needs to be robust to spurious wakeups. If CWP can't make forward progress due to deadlocking, that suggests abug.
On 2016/10/05 00:21:18, robliao wrote: > CWP needs to be robust to spurious wakeups. If CWP can't make forward progress > due to deadlocking, that suggests abug. The CWP handles this situation and will not lead to deadlocks assuming SleepConditionVariableSRW() provides atomic operations. On waiting, SleepConditionVariableSRW() = Unlock + Start Waiting On signalling, it should function as = Stop Waiting + Lock Suppose there are two threads t1 and t2. t1 is about to wait on cv and t2 has already been waiting on same cv and due to spurious/stolen wake up, both get woken up from cv, then only one can succeed to acquire the lock, other will continue to wait to acquire the lock.
On 2016/10/05 07:01:27, prashant.n wrote: > On 2016/10/05 00:21:18, robliao wrote: > > CWP needs to be robust to spurious wakeups. If CWP can't make forward progress > > due to deadlocking, that suggests abug. > > The CWP handles this situation and will not lead to deadlocks assuming > SleepConditionVariableSRW() provides atomic operations. > > On waiting, > SleepConditionVariableSRW() = Unlock + Start Waiting > > On signalling, > it should function as = Stop Waiting + Lock > > > Suppose there are two threads t1 and t2. t1 is about to wait on cv and t2 has > already been waiting on same cv and due to spurious/stolen wake up, both get > woken up from cv, then only one can succeed to acquire the lock, other will > continue to wait to acquire the lock. Are we all agreed that we do not need this change?
Message was sent while issue was closed.
On 2016/10/07 22:33:53, robliao wrote: > On 2016/10/05 07:01:27, prashant.n wrote: > > On 2016/10/05 00:21:18, robliao wrote: > > > CWP needs to be robust to spurious wakeups. If CWP can't make forward > progress > > > due to deadlocking, that suggests abug. > > > > The CWP handles this situation and will not lead to deadlocks assuming > > SleepConditionVariableSRW() provides atomic operations. > > > > On waiting, > > SleepConditionVariableSRW() = Unlock + Start Waiting > > > > On signalling, > > it should function as = Stop Waiting + Lock > > > > > > Suppose there are two threads t1 and t2. t1 is about to wait on cv and t2 has > > already been waiting on same cv and due to spurious/stolen wake up, both get > > woken up from cv, then only one can succeed to acquire the lock, other will > > continue to wait to acquire the lock. > > Are we all agreed that we do not need this change? Closing this patch for now. Feel free to reopen if the need presents itself. |