Issue 260723003: [oilpan]: Make parking threads for GC timeout in the case parking exceeds 100 MS

Issue 260723003: [oilpan]: Make parking threads for GC timeout in the case parking exceeds 100 MS (Closed)

Created:
6 years, 7 months ago by wibling-chromium

Modified:
6 years, 7 months ago

Reviewers:
Vyacheslav Egorov (Chromium), Mads Ager (chromium), tkent, zerny-chromium, Erik Corry, haraken, oilpan-reviews

CC:
blink-reviews, kouhei+heap_chromium.org, Mads Ager (chromium)

Base URL:
svn://svn.chromium.org/blink/trunk

Visibility:
Public.

More Reviews

Description

[oilpan]: Make parking threads for GC timeout in the case parking exceeds 100 MS. This is to avoid issues where a GC is hanging forever waiting for a thread to reach a safepoint. This should not happen, but in the case it does we would rather abandon the GC and attempt it again later. R=ager@chromium.org, erik.corry@gmail.com, haraken@chromium.org, oilpan-reviews@chromium.org, tkent@chromium.org, vegorov@chromium.org, zerny@chromium.org BUG= Committed: https://src.chromium.org/viewvc/blink?view=rev&revision=173646

Patch Set 2 : Fix currentTime for blink_heap_unittests + review feedback #

Patch Set 4 : Add safepoint scopes around yield #

Created: 6 years, 7 months ago

Download [raw] [tar.bz2]

	Unified diffs	Side-by-side diffs	Delta from patch set	Stats (+159 lines, -51 lines)			Patch
M	Source/platform/heap/Heap.cpp	View	1 2	3 chunks	+20 lines, -8 lines	0 comments	Download
M	Source/platform/heap/HeapTest.cpp	View	1 2 3	10 chunks	+84 lines, -10 lines	0 comments	Download
M	Source/platform/heap/RunAllTests.cpp	View	1 2	1 chunk	+2 lines, -1 line	0 comments	Download
M	Source/platform/heap/ThreadState.h	View	1 2	1 chunk	+1 line, -1 line	0 comments	Download
M	Source/platform/heap/ThreadState.cpp	View	1 2 3	7 chunks	+52 lines, -31 lines	0 comments	Download

Messages

Total messages: 13 (0 generated)

Expand Messages | Collapse Messages

haraken

The approach looks reasonable, although I see a threading issue. Intuitively, 10 ms sounds too ...

6 years, 7 months ago (2014-05-01 04:03:56 UTC) #2

The approach looks reasonable, although I see a threading issue.

Intuitively, 10 ms sounds too short to me. For low-end devices where CPU is slow
and the number of CPU is limited, it would be possible that it takes more than
10 ms to coordinate all threads (I think it's possible that layout takes more
than 10 ms, for example). For safety, we might want to start from a larger
number (e.g., 100 ms).

Also, we should consider adding UMA about the time GC has to wait etc. Since it
takes time to collect UMA from the wild, we want to start the work as soon as
possible.

https://codereview.chromium.org/260723003/diff/1/Source/platform/heap/ThreadS...
File Source/platform/heap/ThreadState.cpp (right):

https://codereview.chromium.org/260723003/diff/1/Source/platform/heap/ThreadS...
Source/platform/heap/ThreadState.cpp:144: resumeOthers(true);

Nit: It would be more consistent if you can remove resumeOthers() from here and
let ~GCScope() call resumeOthers().

GCScope() {
  if (ThreadState::stopThreads()) {
    m_parkedAllThreads = true;
    m_state->enterGC();
  }
}

~GCScope() {
  if (m_parkedAllThreads) {
    m_state->leaveGC();
  }
  ThreadState::resumeThreads();
}

Then you can remove the |bool barrierLocked| flag from resumeOthers().

https://codereview.chromium.org/260723003/diff/1/Source/platform/heap/ThreadS...
Source/platform/heap/ThreadState.cpp:154: atomicSubtract(&m_unparkedThreadCount,
threads.size());

I guess this will confuse m_unparkedThreadCount. At this point, we need to
subtract X from m_unparkedThreadCount, where X is the number of threads that are
parking at this point.

Before this CL, it's guaranteed that m_unparkedThreadCount == 0 and X ==
threads.size(). Thus we can subtract threads.size() from m_unparkedThreadCount.

After this CL, it's not guaranteed that m_unparkedThreadCount == 0 nor X ==
threads.size(). Thus subtracting threads.size() from m_unparkedThreadCount will
confuse the count.

wibling-chromium

Thanks for the review. I agree 10 MS are probably too little. I will experiment ...

6 years, 7 months ago (2014-05-01 06:51:48 UTC) #3

Thanks for the review. 

I agree 10 MS are probably too little. I will experiment with what a good
timeout will be. 

I am currently debugging an issue in the blink_heap_unittests since it seems
currentTime() is returning 0 when called in the tests. I suspect this is a test
only issue, but I would like to work around it.
I will also look into UMA and A/B testing.

https://codereview.chromium.org/260723003/diff/1/Source/platform/heap/ThreadS...
File Source/platform/heap/ThreadState.cpp (right):

https://codereview.chromium.org/260723003/diff/1/Source/platform/heap/ThreadS...
Source/platform/heap/ThreadState.cpp:144: resumeOthers(true);
On 2014/05/01 04:03:56, haraken wrote:
> 
> Nit: It would be more consistent if you can remove resumeOthers() from here
and
> let ~GCScope() call resumeOthers().
> 
> GCScope() {
>   if (ThreadState::stopThreads()) {
>     m_parkedAllThreads = true;
>     m_state->enterGC();
>   }
> }
> 
> ~GCScope() {
>   if (m_parkedAllThreads) {
>     m_state->leaveGC();
>   }
>   ThreadState::resumeThreads();
> }
> 
> Then you can remove the |bool barrierLocked| flag from resumeOthers().

Yes, that would be nice. However I am worried that this would mean we release
m_mutex after timing out which will allow a thread to become parked in the
meantime and signal the mutex since it was the last remaining thread. By doing
it this way I am reasonably confident that cannot happen.

https://codereview.chromium.org/260723003/diff/1/Source/platform/heap/ThreadS...
Source/platform/heap/ThreadState.cpp:154: atomicSubtract(&m_unparkedThreadCount,
threads.size());
On 2014/05/01 04:03:56, haraken wrote:
> 
> I guess this will confuse m_unparkedThreadCount. At this point, we need to
> subtract X from m_unparkedThreadCount, where X is the number of threads that
are
> parking at this point.
> 
> Before this CL, it's guaranteed that m_unparkedThreadCount == 0 and X ==
> threads.size(). Thus we can subtract threads.size() from
m_unparkedThreadCount.
> 
> After this CL, it's not guaranteed that m_unparkedThreadCount == 0 nor X ==
> threads.size(). Thus subtracting threads.size() from m_unparkedThreadCount
will
> confuse the count.

I initially read the code the same way, but parkOthers specifically add
threads.size() to the count. resumeOthers subtracts it which means they cancel
each other out. Similarly checkAndPark (specifically doPark call both decrement
and increment which should also cancel each other out. Finally enterSafePoint
(specifically doEnterSafePoint) and leaveSafePoint also both does a decrement
and an increment which cancels each other out.
As such I believe subtracting threads.size() is the correct thing to do even if
all threads have not been parked.

zerny-chromium

I find your reply to Haraken's questions sound. LGTM with minor formatting change/revert. https://codereview.chromium.org/260723003/diff/1/Source/platform/heap/ThreadState.cpp File ...

6 years, 7 months ago (2014-05-01 07:05:38 UTC) #4

haraken

I think we need to be careful about the timeout value, since we cannot expect ...

6 years, 7 months ago (2014-05-01 07:33:50 UTC) #5

I think we need to be careful about the timeout value, since we cannot expect
what happens in real worlds. I'll delegate the decision to you and Mads, but I
think it's safer to set the timeout value as large as possible (e.g., 100 ms,
500 ms etc).

https://codereview.chromium.org/260723003/diff/1/Source/platform/heap/ThreadS...
File Source/platform/heap/ThreadState.cpp (right):

https://codereview.chromium.org/260723003/diff/1/Source/platform/heap/ThreadS...
Source/platform/heap/ThreadState.cpp:144: resumeOthers(true);
On 2014/05/01 06:51:48, wibling-chromium wrote:
> On 2014/05/01 04:03:56, haraken wrote:
> > 
> > Nit: It would be more consistent if you can remove resumeOthers() from here
> and
> > let ~GCScope() call resumeOthers().
> > 
> > GCScope() {
> >   if (ThreadState::stopThreads()) {
> >     m_parkedAllThreads = true;
> >     m_state->enterGC();
> >   }
> > }
> > 
> > ~GCScope() {
> >   if (m_parkedAllThreads) {
> >     m_state->leaveGC();
> >   }
> >   ThreadState::resumeThreads();
> > }
> > 
> > Then you can remove the |bool barrierLocked| flag from resumeOthers().
> 
> Yes, that would be nice. However I am worried that this would mean we release
> m_mutex after timing out which will allow a thread to become parked in the
> meantime and signal the mutex since it was the last remaining thread. By doing
> it this way I am reasonably confident that cannot happen.

Yes, but is it problematic that we allow another thread to get parked while
releasing the mutex?

Probably I think it's problematic in the current code because line 154 and line
155 are not in the mutex. If we move the line 154 and 155 into the mutex, there
is no problem in allowing another thread to get parked while releasing the
mutex. (Then the code logic becomes clearer: we need to acquire the mutex
whenever we touch m_unparkedThreadCount and m_canResume.)

https://codereview.chromium.org/260723003/diff/1/Source/platform/heap/ThreadS...
Source/platform/heap/ThreadState.cpp:154: atomicSubtract(&m_unparkedThreadCount,
threads.size());
On 2014/05/01 06:51:48, wibling-chromium wrote:
> On 2014/05/01 04:03:56, haraken wrote:
> > 
> > I guess this will confuse m_unparkedThreadCount. At this point, we need to
> > subtract X from m_unparkedThreadCount, where X is the number of threads that
> are
> > parking at this point.
> > 
> > Before this CL, it's guaranteed that m_unparkedThreadCount == 0 and X ==
> > threads.size(). Thus we can subtract threads.size() from
> m_unparkedThreadCount.
> > 
> > After this CL, it's not guaranteed that m_unparkedThreadCount == 0 nor X ==
> > threads.size(). Thus subtracting threads.size() from m_unparkedThreadCount
> will
> > confuse the count.
> 
> I initially read the code the same way, but parkOthers specifically add
> threads.size() to the count. resumeOthers subtracts it which means they cancel
> each other out. Similarly checkAndPark (specifically doPark call both
decrement
> and increment which should also cancel each other out. Finally enterSafePoint
> (specifically doEnterSafePoint) and leaveSafePoint also both does a decrement
> and an increment which cancels each other out.
> As such I believe subtracting threads.size() is the correct thing to do even
if
> all threads have not been parked.

Thanks for the clarification. Your explanation sounds reasonable.

Erik Corry

I think if 10ms is not enough we need to consider adding more safe points. ...

6 years, 7 months ago (2014-05-01 08:11:30 UTC) #6

I think if 10ms is not enough we need to consider adding more safe
points.  That's 10 million instructions!

It would be good to use this feature to find places with too few safe
points, without inconveniencing users.  For now we can flush out some
places by having a low timeout when running tests and a higher one in
the actual browser.

On Thu, May 1, 2014 at 9:33 AM,  <haraken@chromium.org> wrote:
> I think we need to be careful about the timeout value, since we cannot
> expect
> what happens in real worlds. I'll delegate the decision to you and Mads, but
> I
> think it's safer to set the timeout value as large as possible (e.g., 100
> ms,
> 500 ms etc).
>
>
>
>
>
>
https://codereview.chromium.org/260723003/diff/1/Source/platform/heap/ThreadS...
> File Source/platform/heap/ThreadState.cpp (right):
>
>
https://codereview.chromium.org/260723003/diff/1/Source/platform/heap/ThreadS...
> Source/platform/heap/ThreadState.cpp:144: resumeOthers(true);
> On 2014/05/01 06:51:48, wibling-chromium wrote:
>>
>> On 2014/05/01 04:03:56, haraken wrote:
>> >
>> > Nit: It would be more consistent if you can remove resumeOthers()
>
> from here
>>
>> and
>> > let ~GCScope() call resumeOthers().
>> >
>> > GCScope() {
>> >   if (ThreadState::stopThreads()) {
>> >     m_parkedAllThreads = true;
>> >     m_state->enterGC();
>> >   }
>> > }
>> >
>> > ~GCScope() {
>> >   if (m_parkedAllThreads) {
>> >     m_state->leaveGC();
>> >   }
>> >   ThreadState::resumeThreads();
>> > }
>> >
>> > Then you can remove the |bool barrierLocked| flag from
>
> resumeOthers().
>
>> Yes, that would be nice. However I am worried that this would mean we
>
> release
>>
>> m_mutex after timing out which will allow a thread to become parked in
>
> the
>>
>> meantime and signal the mutex since it was the last remaining thread.
>
> By doing
>>
>> it this way I am reasonably confident that cannot happen.
>
>
> Yes, but is it problematic that we allow another thread to get parked
> while releasing the mutex?
>
> Probably I think it's problematic in the current code because line 154
> and line 155 are not in the mutex. If we move the line 154 and 155 into
> the mutex, there is no problem in allowing another thread to get parked
> while releasing the mutex. (Then the code logic becomes clearer: we need
> to acquire the mutex whenever we touch m_unparkedThreadCount and
> m_canResume.)
>
>
>
https://codereview.chromium.org/260723003/diff/1/Source/platform/heap/ThreadS...
> Source/platform/heap/ThreadState.cpp:154:
> atomicSubtract(&m_unparkedThreadCount, threads.size());
> On 2014/05/01 06:51:48, wibling-chromium wrote:
>>
>> On 2014/05/01 04:03:56, haraken wrote:
>> >
>> > I guess this will confuse m_unparkedThreadCount. At this point, we
>
> need to
>>
>> > subtract X from m_unparkedThreadCount, where X is the number of
>
> threads that
>>
>> are
>> > parking at this point.
>> >
>> > Before this CL, it's guaranteed that m_unparkedThreadCount == 0 and
>
> X ==
>>
>> > threads.size(). Thus we can subtract threads.size() from
>> m_unparkedThreadCount.
>> >
>> > After this CL, it's not guaranteed that m_unparkedThreadCount == 0
>
> nor X ==
>>
>> > threads.size(). Thus subtracting threads.size() from
>
> m_unparkedThreadCount
>>
>> will
>> > confuse the count.
>
>
>> I initially read the code the same way, but parkOthers specifically
>
> add
>>
>> threads.size() to the count. resumeOthers subtracts it which means
>
> they cancel
>>
>> each other out. Similarly checkAndPark (specifically doPark call both
>
> decrement
>>
>> and increment which should also cancel each other out. Finally
>
> enterSafePoint
>>
>> (specifically doEnterSafePoint) and leaveSafePoint also both does a
>
> decrement
>>
>> and an increment which cancels each other out.
>> As such I believe subtracting threads.size() is the correct thing to
>
> do even if
>>
>> all threads have not been parked.
>
>
> Thanks for the clarification. Your explanation sounds reasonable.
>
> https://codereview.chromium.org/260723003/



-- 
Erik Corry, Software Engineer
Google Denmark ApS - Frederiksborggade 20B, 1 sal,
1360 København K - Denmark - CVR nr. 28 86 69 84

To unsubscribe from this group and stop receiving emails from it, send an email
to blink-reviews+unsubscribe@chromium.org.

haraken

> I think if 10ms is not enough we need to consider adding more safe ...

6 years, 7 months ago (2014-05-01 08:21:11 UTC) #7

wibling-chromium

https://codereview.chromium.org/260723003/diff/1/Source/platform/heap/ThreadState.cpp File Source/platform/heap/ThreadState.cpp (right): https://codereview.chromium.org/260723003/diff/1/Source/platform/heap/ThreadState.cpp#newcode144 Source/platform/heap/ThreadState.cpp:144: resumeOthers(true); On 2014/05/01 07:33:50, haraken wrote: > On 2014/05/01 ...

6 years, 7 months ago (2014-05-01 08:25:18 UTC) #8

https://codereview.chromium.org/260723003/diff/1/Source/platform/heap/ThreadS...
File Source/platform/heap/ThreadState.cpp (right):

https://codereview.chromium.org/260723003/diff/1/Source/platform/heap/ThreadS...
Source/platform/heap/ThreadState.cpp:144: resumeOthers(true);
On 2014/05/01 07:33:50, haraken wrote:
> On 2014/05/01 06:51:48, wibling-chromium wrote:
> > On 2014/05/01 04:03:56, haraken wrote:
> > > 
> > > Nit: It would be more consistent if you can remove resumeOthers() from
here
> > and
> > > let ~GCScope() call resumeOthers().
> > > 
> > > GCScope() {
> > >   if (ThreadState::stopThreads()) {
> > >     m_parkedAllThreads = true;
> > >     m_state->enterGC();
> > >   }
> > > }
> > > 
> > > ~GCScope() {
> > >   if (m_parkedAllThreads) {
> > >     m_state->leaveGC();
> > >   }
> > >   ThreadState::resumeThreads();
> > > }
> > > 
> > > Then you can remove the |bool barrierLocked| flag from resumeOthers().
> > 
> > Yes, that would be nice. However I am worried that this would mean we
release
> > m_mutex after timing out which will allow a thread to become parked in the
> > meantime and signal the mutex since it was the last remaining thread. By
doing
> > it this way I am reasonably confident that cannot happen.
> 
> Yes, but is it problematic that we allow another thread to get parked while
> releasing the mutex?

No, that should be okay. However previously we would never signal m_parked
unless someone were in parkOthers and had added threads.size() to the unparked
count. Now we do have a window where a thread could signal even though
parkOthers had failed (ie. timed out). That signal would now be lingering on
until the next parkOther's call. It might not be a big deal since we would
likely just loop once more in the parkOthers while(m_unparkedThreadCount > 0)
loop and find that threads are not yet parked.

That said I find the current code and API simpler to understand. As it is now
the SafePointBarrier API for parkOthers is to call resumeOthers only if it
succeeds and there are no interim side effects if it fails.
It is also easier to verify that the effects of parkOthers are negated since we
do it locally in the SafePointBarrier.

> Probably I think it's problematic in the current code because line 154 and
line
> 155 are not in the mutex. If we move the line 154 and 155 into the mutex,
there
> is no problem in allowing another thread to get parked while releasing the
> mutex. (Then the code logic becomes clearer: we need to acquire the mutex
> whenever we touch m_unparkedThreadCount and m_canResume.)

The accesses in 154 and 155 are both atomic so they should be sane given.
Strictly speaking this is not changing as part of my change. It was the case
before as well and from what I can tell is fine with the current usage.

https://codereview.chromium.org/260723003/diff/1/Source/platform/heap/ThreadS...
Source/platform/heap/ThreadState.cpp:183: void checkAndPark(ThreadState* state)
On 2014/05/01 07:05:38, zerny-chromium wrote:
> Nit: avoid reordering/formatting for unchanged code.

Done. I will revert that and do it in a separate change.

haraken

LGTM. https://codereview.chromium.org/260723003/diff/1/Source/platform/heap/ThreadState.cpp File Source/platform/heap/ThreadState.cpp (right): https://codereview.chromium.org/260723003/diff/1/Source/platform/heap/ThreadState.cpp#newcode144 Source/platform/heap/ThreadState.cpp:144: resumeOthers(true); On 2014/05/01 08:25:18, wibling-chromium wrote: > On ...

6 years, 7 months ago (2014-05-01 08:36:31 UTC) #9

LGTM.

https://codereview.chromium.org/260723003/diff/1/Source/platform/heap/ThreadS...
File Source/platform/heap/ThreadState.cpp (right):

https://codereview.chromium.org/260723003/diff/1/Source/platform/heap/ThreadS...
Source/platform/heap/ThreadState.cpp:144: resumeOthers(true);
On 2014/05/01 08:25:18, wibling-chromium wrote:
> On 2014/05/01 07:33:50, haraken wrote:
> > On 2014/05/01 06:51:48, wibling-chromium wrote:
> > > On 2014/05/01 04:03:56, haraken wrote:
> > > > 
> > > > Nit: It would be more consistent if you can remove resumeOthers() from
> here
> > > and
> > > > let ~GCScope() call resumeOthers().
> > > > 
> > > > GCScope() {
> > > >   if (ThreadState::stopThreads()) {
> > > >     m_parkedAllThreads = true;
> > > >     m_state->enterGC();
> > > >   }
> > > > }
> > > > 
> > > > ~GCScope() {
> > > >   if (m_parkedAllThreads) {
> > > >     m_state->leaveGC();
> > > >   }
> > > >   ThreadState::resumeThreads();
> > > > }
> > > > 
> > > > Then you can remove the |bool barrierLocked| flag from resumeOthers().
> > > 
> > > Yes, that would be nice. However I am worried that this would mean we
> release
> > > m_mutex after timing out which will allow a thread to become parked in the
> > > meantime and signal the mutex since it was the last remaining thread. By
> doing
> > > it this way I am reasonably confident that cannot happen.
> > 
> > Yes, but is it problematic that we allow another thread to get parked while
> > releasing the mutex?
> 
> No, that should be okay. However previously we would never signal m_parked
> unless someone were in parkOthers and had added threads.size() to the unparked
> count. Now we do have a window where a thread could signal even though
> parkOthers had failed (ie. timed out). That signal would now be lingering on
> until the next parkOther's call. It might not be a big deal since we would
> likely just loop once more in the parkOthers while(m_unparkedThreadCount > 0)
> loop and find that threads are not yet parked.
> 
> That said I find the current code and API simpler to understand. As it is now
> the SafePointBarrier API for parkOthers is to call resumeOthers only if it
> succeeds and there are no interim side effects if it fails.
> It is also easier to verify that the effects of parkOthers are negated since
we
> do it locally in the SafePointBarrier.
> 
> > Probably I think it's problematic in the current code because line 154 and
> line
> > 155 are not in the mutex. If we move the line 154 and 155 into the mutex,
> there
> > is no problem in allowing another thread to get parked while releasing the
> > mutex. (Then the code logic becomes clearer: we need to acquire the mutex
> > whenever we touch m_unparkedThreadCount and m_canResume.)
> 
> The accesses in 154 and 155 are both atomic so they should be sane given.
> Strictly speaking this is not changing as part of my change. It was the case
> before as well and from what I can tell is fine with the current usage.

Thanks for the clarification, makes sense!

Mads Ager (chromium)

LGTM This would have avoided the deadlock we caused because of a missing safepoint in ...

6 years, 7 months ago (2014-05-01 09:02:01 UTC) #10

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-status.appspot.com/cq/wibling@chromium.org/260723003/60001

6 years, 7 months ago (2014-05-08 13:01:46 UTC) #12

Message was sent while issue was closed.

Change committed as 173646

Expand Messages | Collapse Messages