|
|
Created:
6 years, 7 months ago by wibling-chromium Modified:
6 years, 7 months ago Reviewers:
Vyacheslav Egorov (Chromium), Mads Ager (chromium), tkent, zerny-chromium, Erik Corry, haraken, oilpan-reviews CC:
blink-reviews, kouhei+heap_chromium.org, Mads Ager (chromium) Base URL:
svn://svn.chromium.org/blink/trunk Visibility:
Public. |
Description[oilpan]: Make parking threads for GC timeout in the case parking exceeds 100 MS.
This is to avoid issues where a GC is hanging forever waiting for a thread to reach a safepoint. This should not happen, but in the case it does we would rather abandon the GC and attempt it again later.
R=ager@chromium.org, erik.corry@gmail.com, haraken@chromium.org, oilpan-reviews@chromium.org, tkent@chromium.org, vegorov@chromium.org, zerny@chromium.org
BUG=
Committed: https://src.chromium.org/viewvc/blink?view=rev&revision=173646
Patch Set 1 #
Total comments: 10
Patch Set 2 : Fix currentTime for blink_heap_unittests + review feedback #Patch Set 3 : #Patch Set 4 : Add safepoint scopes around yield #
Messages
Total messages: 13 (0 generated)
The approach looks reasonable, although I see a threading issue. Intuitively, 10 ms sounds too short to me. For low-end devices where CPU is slow and the number of CPU is limited, it would be possible that it takes more than 10 ms to coordinate all threads (I think it's possible that layout takes more than 10 ms, for example). For safety, we might want to start from a larger number (e.g., 100 ms). Also, we should consider adding UMA about the time GC has to wait etc. Since it takes time to collect UMA from the wild, we want to start the work as soon as possible. https://codereview.chromium.org/260723003/diff/1/Source/platform/heap/ThreadS... File Source/platform/heap/ThreadState.cpp (right): https://codereview.chromium.org/260723003/diff/1/Source/platform/heap/ThreadS... Source/platform/heap/ThreadState.cpp:144: resumeOthers(true); Nit: It would be more consistent if you can remove resumeOthers() from here and let ~GCScope() call resumeOthers(). GCScope() { if (ThreadState::stopThreads()) { m_parkedAllThreads = true; m_state->enterGC(); } } ~GCScope() { if (m_parkedAllThreads) { m_state->leaveGC(); } ThreadState::resumeThreads(); } Then you can remove the |bool barrierLocked| flag from resumeOthers(). https://codereview.chromium.org/260723003/diff/1/Source/platform/heap/ThreadS... Source/platform/heap/ThreadState.cpp:154: atomicSubtract(&m_unparkedThreadCount, threads.size()); I guess this will confuse m_unparkedThreadCount. At this point, we need to subtract X from m_unparkedThreadCount, where X is the number of threads that are parking at this point. Before this CL, it's guaranteed that m_unparkedThreadCount == 0 and X == threads.size(). Thus we can subtract threads.size() from m_unparkedThreadCount. After this CL, it's not guaranteed that m_unparkedThreadCount == 0 nor X == threads.size(). Thus subtracting threads.size() from m_unparkedThreadCount will confuse the count.
Thanks for the review. I agree 10 MS are probably too little. I will experiment with what a good timeout will be. I am currently debugging an issue in the blink_heap_unittests since it seems currentTime() is returning 0 when called in the tests. I suspect this is a test only issue, but I would like to work around it. I will also look into UMA and A/B testing. https://codereview.chromium.org/260723003/diff/1/Source/platform/heap/ThreadS... File Source/platform/heap/ThreadState.cpp (right): https://codereview.chromium.org/260723003/diff/1/Source/platform/heap/ThreadS... Source/platform/heap/ThreadState.cpp:144: resumeOthers(true); On 2014/05/01 04:03:56, haraken wrote: > > Nit: It would be more consistent if you can remove resumeOthers() from here and > let ~GCScope() call resumeOthers(). > > GCScope() { > if (ThreadState::stopThreads()) { > m_parkedAllThreads = true; > m_state->enterGC(); > } > } > > ~GCScope() { > if (m_parkedAllThreads) { > m_state->leaveGC(); > } > ThreadState::resumeThreads(); > } > > Then you can remove the |bool barrierLocked| flag from resumeOthers(). Yes, that would be nice. However I am worried that this would mean we release m_mutex after timing out which will allow a thread to become parked in the meantime and signal the mutex since it was the last remaining thread. By doing it this way I am reasonably confident that cannot happen. https://codereview.chromium.org/260723003/diff/1/Source/platform/heap/ThreadS... Source/platform/heap/ThreadState.cpp:154: atomicSubtract(&m_unparkedThreadCount, threads.size()); On 2014/05/01 04:03:56, haraken wrote: > > I guess this will confuse m_unparkedThreadCount. At this point, we need to > subtract X from m_unparkedThreadCount, where X is the number of threads that are > parking at this point. > > Before this CL, it's guaranteed that m_unparkedThreadCount == 0 and X == > threads.size(). Thus we can subtract threads.size() from m_unparkedThreadCount. > > After this CL, it's not guaranteed that m_unparkedThreadCount == 0 nor X == > threads.size(). Thus subtracting threads.size() from m_unparkedThreadCount will > confuse the count. I initially read the code the same way, but parkOthers specifically add threads.size() to the count. resumeOthers subtracts it which means they cancel each other out. Similarly checkAndPark (specifically doPark call both decrement and increment which should also cancel each other out. Finally enterSafePoint (specifically doEnterSafePoint) and leaveSafePoint also both does a decrement and an increment which cancels each other out. As such I believe subtracting threads.size() is the correct thing to do even if all threads have not been parked.
I find your reply to Haraken's questions sound. LGTM with minor formatting change/revert. https://codereview.chromium.org/260723003/diff/1/Source/platform/heap/ThreadS... File Source/platform/heap/ThreadState.cpp (right): https://codereview.chromium.org/260723003/diff/1/Source/platform/heap/ThreadS... Source/platform/heap/ThreadState.cpp:183: void checkAndPark(ThreadState* state) Nit: avoid reordering/formatting for unchanged code.
I think we need to be careful about the timeout value, since we cannot expect what happens in real worlds. I'll delegate the decision to you and Mads, but I think it's safer to set the timeout value as large as possible (e.g., 100 ms, 500 ms etc). https://codereview.chromium.org/260723003/diff/1/Source/platform/heap/ThreadS... File Source/platform/heap/ThreadState.cpp (right): https://codereview.chromium.org/260723003/diff/1/Source/platform/heap/ThreadS... Source/platform/heap/ThreadState.cpp:144: resumeOthers(true); On 2014/05/01 06:51:48, wibling-chromium wrote: > On 2014/05/01 04:03:56, haraken wrote: > > > > Nit: It would be more consistent if you can remove resumeOthers() from here > and > > let ~GCScope() call resumeOthers(). > > > > GCScope() { > > if (ThreadState::stopThreads()) { > > m_parkedAllThreads = true; > > m_state->enterGC(); > > } > > } > > > > ~GCScope() { > > if (m_parkedAllThreads) { > > m_state->leaveGC(); > > } > > ThreadState::resumeThreads(); > > } > > > > Then you can remove the |bool barrierLocked| flag from resumeOthers(). > > Yes, that would be nice. However I am worried that this would mean we release > m_mutex after timing out which will allow a thread to become parked in the > meantime and signal the mutex since it was the last remaining thread. By doing > it this way I am reasonably confident that cannot happen. Yes, but is it problematic that we allow another thread to get parked while releasing the mutex? Probably I think it's problematic in the current code because line 154 and line 155 are not in the mutex. If we move the line 154 and 155 into the mutex, there is no problem in allowing another thread to get parked while releasing the mutex. (Then the code logic becomes clearer: we need to acquire the mutex whenever we touch m_unparkedThreadCount and m_canResume.) https://codereview.chromium.org/260723003/diff/1/Source/platform/heap/ThreadS... Source/platform/heap/ThreadState.cpp:154: atomicSubtract(&m_unparkedThreadCount, threads.size()); On 2014/05/01 06:51:48, wibling-chromium wrote: > On 2014/05/01 04:03:56, haraken wrote: > > > > I guess this will confuse m_unparkedThreadCount. At this point, we need to > > subtract X from m_unparkedThreadCount, where X is the number of threads that > are > > parking at this point. > > > > Before this CL, it's guaranteed that m_unparkedThreadCount == 0 and X == > > threads.size(). Thus we can subtract threads.size() from > m_unparkedThreadCount. > > > > After this CL, it's not guaranteed that m_unparkedThreadCount == 0 nor X == > > threads.size(). Thus subtracting threads.size() from m_unparkedThreadCount > will > > confuse the count. > > I initially read the code the same way, but parkOthers specifically add > threads.size() to the count. resumeOthers subtracts it which means they cancel > each other out. Similarly checkAndPark (specifically doPark call both decrement > and increment which should also cancel each other out. Finally enterSafePoint > (specifically doEnterSafePoint) and leaveSafePoint also both does a decrement > and an increment which cancels each other out. > As such I believe subtracting threads.size() is the correct thing to do even if > all threads have not been parked. Thanks for the clarification. Your explanation sounds reasonable.
I think if 10ms is not enough we need to consider adding more safe points. That's 10 million instructions! It would be good to use this feature to find places with too few safe points, without inconveniencing users. For now we can flush out some places by having a low timeout when running tests and a higher one in the actual browser. On Thu, May 1, 2014 at 9:33 AM, <haraken@chromium.org> wrote: > I think we need to be careful about the timeout value, since we cannot > expect > what happens in real worlds. I'll delegate the decision to you and Mads, but > I > think it's safer to set the timeout value as large as possible (e.g., 100 > ms, > 500 ms etc). > > > > > > https://codereview.chromium.org/260723003/diff/1/Source/platform/heap/ThreadS... > File Source/platform/heap/ThreadState.cpp (right): > > https://codereview.chromium.org/260723003/diff/1/Source/platform/heap/ThreadS... > Source/platform/heap/ThreadState.cpp:144: resumeOthers(true); > On 2014/05/01 06:51:48, wibling-chromium wrote: >> >> On 2014/05/01 04:03:56, haraken wrote: >> > >> > Nit: It would be more consistent if you can remove resumeOthers() > > from here >> >> and >> > let ~GCScope() call resumeOthers(). >> > >> > GCScope() { >> > if (ThreadState::stopThreads()) { >> > m_parkedAllThreads = true; >> > m_state->enterGC(); >> > } >> > } >> > >> > ~GCScope() { >> > if (m_parkedAllThreads) { >> > m_state->leaveGC(); >> > } >> > ThreadState::resumeThreads(); >> > } >> > >> > Then you can remove the |bool barrierLocked| flag from > > resumeOthers(). > >> Yes, that would be nice. However I am worried that this would mean we > > release >> >> m_mutex after timing out which will allow a thread to become parked in > > the >> >> meantime and signal the mutex since it was the last remaining thread. > > By doing >> >> it this way I am reasonably confident that cannot happen. > > > Yes, but is it problematic that we allow another thread to get parked > while releasing the mutex? > > Probably I think it's problematic in the current code because line 154 > and line 155 are not in the mutex. If we move the line 154 and 155 into > the mutex, there is no problem in allowing another thread to get parked > while releasing the mutex. (Then the code logic becomes clearer: we need > to acquire the mutex whenever we touch m_unparkedThreadCount and > m_canResume.) > > > https://codereview.chromium.org/260723003/diff/1/Source/platform/heap/ThreadS... > Source/platform/heap/ThreadState.cpp:154: > atomicSubtract(&m_unparkedThreadCount, threads.size()); > On 2014/05/01 06:51:48, wibling-chromium wrote: >> >> On 2014/05/01 04:03:56, haraken wrote: >> > >> > I guess this will confuse m_unparkedThreadCount. At this point, we > > need to >> >> > subtract X from m_unparkedThreadCount, where X is the number of > > threads that >> >> are >> > parking at this point. >> > >> > Before this CL, it's guaranteed that m_unparkedThreadCount == 0 and > > X == >> >> > threads.size(). Thus we can subtract threads.size() from >> m_unparkedThreadCount. >> > >> > After this CL, it's not guaranteed that m_unparkedThreadCount == 0 > > nor X == >> >> > threads.size(). Thus subtracting threads.size() from > > m_unparkedThreadCount >> >> will >> > confuse the count. > > >> I initially read the code the same way, but parkOthers specifically > > add >> >> threads.size() to the count. resumeOthers subtracts it which means > > they cancel >> >> each other out. Similarly checkAndPark (specifically doPark call both > > decrement >> >> and increment which should also cancel each other out. Finally > > enterSafePoint >> >> (specifically doEnterSafePoint) and leaveSafePoint also both does a > > decrement >> >> and an increment which cancels each other out. >> As such I believe subtracting threads.size() is the correct thing to > > do even if >> >> all threads have not been parked. > > > Thanks for the clarification. Your explanation sounds reasonable. > > https://codereview.chromium.org/260723003/ -- Erik Corry, Software Engineer Google Denmark ApS - Frederiksborggade 20B, 1 sal, 1360 København K - Denmark - CVR nr. 28 86 69 84 To unsubscribe from this group and stop receiving emails from it, send an email to blink-reviews+unsubscribe@chromium.org.
> I think if 10ms is not enough we need to consider adding more safe > points. That's 10 million instructions! > > It would be good to use this feature to find places with too few safe > points, without inconveniencing users. For now we can flush out some > places by having a low timeout when running tests and a higher one in > the actual browser. I agree that we should add more safe points if 10 ms is not enough, but we need to be more conservative until we really finish adding the safe points. Probably we can start with a larger timeout value for safety and collect UMA at the same time.
https://codereview.chromium.org/260723003/diff/1/Source/platform/heap/ThreadS... File Source/platform/heap/ThreadState.cpp (right): https://codereview.chromium.org/260723003/diff/1/Source/platform/heap/ThreadS... Source/platform/heap/ThreadState.cpp:144: resumeOthers(true); On 2014/05/01 07:33:50, haraken wrote: > On 2014/05/01 06:51:48, wibling-chromium wrote: > > On 2014/05/01 04:03:56, haraken wrote: > > > > > > Nit: It would be more consistent if you can remove resumeOthers() from here > > and > > > let ~GCScope() call resumeOthers(). > > > > > > GCScope() { > > > if (ThreadState::stopThreads()) { > > > m_parkedAllThreads = true; > > > m_state->enterGC(); > > > } > > > } > > > > > > ~GCScope() { > > > if (m_parkedAllThreads) { > > > m_state->leaveGC(); > > > } > > > ThreadState::resumeThreads(); > > > } > > > > > > Then you can remove the |bool barrierLocked| flag from resumeOthers(). > > > > Yes, that would be nice. However I am worried that this would mean we release > > m_mutex after timing out which will allow a thread to become parked in the > > meantime and signal the mutex since it was the last remaining thread. By doing > > it this way I am reasonably confident that cannot happen. > > Yes, but is it problematic that we allow another thread to get parked while > releasing the mutex? No, that should be okay. However previously we would never signal m_parked unless someone were in parkOthers and had added threads.size() to the unparked count. Now we do have a window where a thread could signal even though parkOthers had failed (ie. timed out). That signal would now be lingering on until the next parkOther's call. It might not be a big deal since we would likely just loop once more in the parkOthers while(m_unparkedThreadCount > 0) loop and find that threads are not yet parked. That said I find the current code and API simpler to understand. As it is now the SafePointBarrier API for parkOthers is to call resumeOthers only if it succeeds and there are no interim side effects if it fails. It is also easier to verify that the effects of parkOthers are negated since we do it locally in the SafePointBarrier. > Probably I think it's problematic in the current code because line 154 and line > 155 are not in the mutex. If we move the line 154 and 155 into the mutex, there > is no problem in allowing another thread to get parked while releasing the > mutex. (Then the code logic becomes clearer: we need to acquire the mutex > whenever we touch m_unparkedThreadCount and m_canResume.) The accesses in 154 and 155 are both atomic so they should be sane given. Strictly speaking this is not changing as part of my change. It was the case before as well and from what I can tell is fine with the current usage. https://codereview.chromium.org/260723003/diff/1/Source/platform/heap/ThreadS... Source/platform/heap/ThreadState.cpp:183: void checkAndPark(ThreadState* state) On 2014/05/01 07:05:38, zerny-chromium wrote: > Nit: avoid reordering/formatting for unchanged code. Done. I will revert that and do it in a separate change.
LGTM. https://codereview.chromium.org/260723003/diff/1/Source/platform/heap/ThreadS... File Source/platform/heap/ThreadState.cpp (right): https://codereview.chromium.org/260723003/diff/1/Source/platform/heap/ThreadS... Source/platform/heap/ThreadState.cpp:144: resumeOthers(true); On 2014/05/01 08:25:18, wibling-chromium wrote: > On 2014/05/01 07:33:50, haraken wrote: > > On 2014/05/01 06:51:48, wibling-chromium wrote: > > > On 2014/05/01 04:03:56, haraken wrote: > > > > > > > > Nit: It would be more consistent if you can remove resumeOthers() from > here > > > and > > > > let ~GCScope() call resumeOthers(). > > > > > > > > GCScope() { > > > > if (ThreadState::stopThreads()) { > > > > m_parkedAllThreads = true; > > > > m_state->enterGC(); > > > > } > > > > } > > > > > > > > ~GCScope() { > > > > if (m_parkedAllThreads) { > > > > m_state->leaveGC(); > > > > } > > > > ThreadState::resumeThreads(); > > > > } > > > > > > > > Then you can remove the |bool barrierLocked| flag from resumeOthers(). > > > > > > Yes, that would be nice. However I am worried that this would mean we > release > > > m_mutex after timing out which will allow a thread to become parked in the > > > meantime and signal the mutex since it was the last remaining thread. By > doing > > > it this way I am reasonably confident that cannot happen. > > > > Yes, but is it problematic that we allow another thread to get parked while > > releasing the mutex? > > No, that should be okay. However previously we would never signal m_parked > unless someone were in parkOthers and had added threads.size() to the unparked > count. Now we do have a window where a thread could signal even though > parkOthers had failed (ie. timed out). That signal would now be lingering on > until the next parkOther's call. It might not be a big deal since we would > likely just loop once more in the parkOthers while(m_unparkedThreadCount > 0) > loop and find that threads are not yet parked. > > That said I find the current code and API simpler to understand. As it is now > the SafePointBarrier API for parkOthers is to call resumeOthers only if it > succeeds and there are no interim side effects if it fails. > It is also easier to verify that the effects of parkOthers are negated since we > do it locally in the SafePointBarrier. > > > Probably I think it's problematic in the current code because line 154 and > line > > 155 are not in the mutex. If we move the line 154 and 155 into the mutex, > there > > is no problem in allowing another thread to get parked while releasing the > > mutex. (Then the code logic becomes clearer: we need to acquire the mutex > > whenever we touch m_unparkedThreadCount and m_canResume.) > > The accesses in 154 and 155 are both atomic so they should be sane given. > Strictly speaking this is not changing as part of my change. It was the case > before as well and from what I can tell is fine with the current usage. Thanks for the clarification, makes sense!
LGTM This would have avoided the deadlock we caused because of a missing safepoint in synchronous loading from a worker. As the next step we should look into A/B testing and having a release assert here for a small percentage of users to catch situations where we cannot reach a safepoint within a reasonable time.
The CQ bit was checked by wibling@chromium.org
CQ is trying da patch. Follow status at https://chromium-status.appspot.com/cq/wibling@chromium.org/260723003/60001
Message was sent while issue was closed.
Change committed as 173646 |