Issue 9405034: Allow task tracing in official chrome.

jbates

ptal

8 years, 10 months ago (2012-02-16 00:15:39 UTC) #1

jar (doing other things)

This is pretty cool/exciting stuff! :-) I don't think you ever got around to changing ...

8 years, 10 months ago (2012-02-16 03:15:33 UTC) #2

nduca

If you're referring to our use of a static variable to store the çategory struct, ...

8 years, 10 months ago (2012-02-16 03:32:27 UTC) #3

jar (doing other things)

To help resolve this, I've added an expert (Dmitry) in threading to the CC list. ...

8 years, 10 months ago (2012-02-16 17:50:02 UTC) #4

jbates

On 2012/02/16 17:50:02, jar wrote: > To help resolve this, I've added an expert (Dmitry) ...

8 years, 10 months ago (2012-02-16 19:45:38 UTC) #5

On 2012/02/16 17:50:02, jar wrote:
> To help resolve this, I've added an expert (Dmitry) in threading to the CC
list.
>  He and Wan Teh have schooled me on this topic in the past year or two (as
> histogram.h evolved to its current state).  They can probably cite educational
> material that goes into all the stuff (and more), but they can also chime in
if
> I say any of this incorrectly.
> 
> The reason why the code in src/base/debug/trace_event.h (see the static
> initializer around line 491 and the associated ANNOTATE_BENIGN_RACE macro) is
> not yet safe is that you're counting on both the pointer, *and* the
> pointed-to-contents being visible on the receiving thread. 
> 
> When a receiving thread sees the pointer, there is no guarantee (currently)
that
> the pointed-to-contents will also be visible in the receiving core's cache...
> and in fact, old data may be sitting in that core's cache.  As a result, the
> receiving core may deref the pointer, and reach garbage.  For instance, it
could
> see an unterminated char* (if this is indeed a c-string) leading to
inaccessible
> memory.
> 
> The the rule is simple for this pattern: Both the sending and receiving
threads
> *must* use a barrier.  The fact that you don't currently have a barrier on the
> receiving (fast) path is what causes the race.
> 
> You can see some additional discussion of this topic in the context of
> histograms in the bug http://code.google.com/p/chromium/issues/detail?id=88519
> or in the resolving CL
> http://codereview.chromium.org/7535006
> 
> 
> Jim

I think we are saved by one simple difference from the histograms code: the
tracing macros operate safely even if they read garbage data from the static
unsigned char.

After reading through
http://code.google.com/p/data-race-test/wiki/AboutRaces#Racy_publication though,
I'm slightly less sure of myself regarding "Clarification 4". Considering that
Acquire_Load and Release_Store are essentially nops on x86, I'm not opposed to
changing trace_event to match histogram macros. For my own sanity though, I
would love to have the author of "Clarification 4" provide references so I can
edubicate myself!

jar (doing other things)

The objection you cited, due to "cleverness" in the compiler, is certainly an issue. If ...

8 years, 10 months ago (2012-02-16 20:40:36 UTC) #6

The objection you cited, due to "cleverness" in the compiler, is certainly
an issue.

If the string is garbage, it does not need to contain a null termination.
 As a result, any attempt to strcpy or strcmp could "run off the end" of
the string. If the allocated string was at the end of a page, adjacent to
unmapped memory, then you'd get a segv in such an access.

I guess if the only thing you do is copy the pointer.... and significantly
delay accessing the referenced memory... it would be safer <sigh>, but
there seems to be no reason to leave the code racing.

Also keep in mind that our code is now running on ARM for Android/Clank, so
the fact that x86 is "friendlier" with regard to cache coherence is not
always going to protect us.

...and clank is running on the TOT.

Jim

On Thu, Feb 16, 2012 at 11:45 AM, <jbates@chromium.org> wrote:

> On 2012/02/16 17:50:02, jar wrote:
>
>> To help resolve this, I've added an expert (Dmitry) in threading to the CC
>>
> list.
>
>>  He and Wan Teh have schooled me on this topic in the past year or two (as
>> histogram.h evolved to its current state).  They can probably cite
>> educational
>> material that goes into all the stuff (and more), but they can also chime
>> in
>>
> if
>
>> I say any of this incorrectly.
>>
>
>  The reason why the code in src/base/debug/trace_event.h (see the static
>> initializer around line 491 and the associated ANNOTATE_BENIGN_RACE
>> macro) is
>> not yet safe is that you're counting on both the pointer, *and* the
>> pointed-to-contents being visible on the receiving thread.
>>
>
>  When a receiving thread sees the pointer, there is no guarantee
>> (currently)
>>
> that
>
>> the pointed-to-contents will also be visible in the receiving core's
>> cache...
>> and in fact, old data may be sitting in that core's cache.  As a result,
>> the
>> receiving core may deref the pointer, and reach garbage.  For instance, it
>>
> could
>
>> see an unterminated char* (if this is indeed a c-string) leading to
>>
> inaccessible
>
>> memory.
>>
>
>  The the rule is simple for this pattern: Both the sending and receiving
>>
> threads
>
>> *must* use a barrier.  The fact that you don't currently have a barrier
>> on the
>> receiving (fast) path is what causes the race.
>>
>
>  You can see some additional discussion of this topic in the context of
>> histograms in the bug http://code.google.com/p/**
>>
chromium/issues/detail?id=**88519<http://code.google.com/p/chromium/issues/detail?id=88519>
>> or in the resolving CL
>>
http://codereview.chromium.**org/7535006<http://codereview.chromium.org/7535006>
>>
>
>
>  Jim
>>
>
> I think we are saved by one simple difference from the histograms code: the
> tracing macros operate safely even if they read garbage data from the
> static
> unsigned char.
>
> After reading through
> http://code.google.com/p/data-**race-test/wiki/AboutRaces#**
>
Racy_publication<http://code.google.com/p/data-race-test/wiki/AboutRaces#Racy_publication>though,
> I'm slightly less sure of myself regarding "Clarification 4". Considering
> that
> Acquire_Load and Release_Store are essentially nops on x86, I'm not
> opposed to
> changing trace_event to match histogram macros. For my own sanity though, I
> would love to have the author of "Clarification 4" provide references so I
> can
> edubicate myself!
>
>
https://chromiumcodereview.**appspot.com/9405034/<https://chromiumcodereview....
>

nduca

On 2012/02/16 20:40:36, jar wrote: > The objection you cited, due to "cleverness" in the ...

8 years, 10 months ago (2012-02-16 21:00:20 UTC) #7

jbates

On 2012/02/16 20:40:36, jar wrote: > The objection you cited, due to "cleverness" in the ...

8 years, 10 months ago (2012-02-16 22:59:31 UTC) #8

On 2012/02/16 20:40:36, jar wrote:
> The objection you cited, due to "cleverness" in the compiler, is certainly
> an issue.
> 
> If the string is garbage,

As we discussed, it's not a string, despite being a char* (I'll use a typedef to
make the code more readable). The macro code simply reads the value of a single
unsigned char. If it's 0, one branch is taken. If it's non-zero, another branch
is taken. Both branches are safe at all times, so it should handle garbage data
without risk.

What we're left with is the compiler doing something stupid. I'm convinced that
a theoretical compiler could take advantage of the single-threaded C spec and do
something insane such as described in "Clarification 4". (ie: load the unsigned
char *twice* into two different registers, one NULL, one non-NULL, and use those
registers later in any order.)

So, it looks like we need a NoBarrier_Load on the static pointer to ensure that
the compiler doesn't mess up. I will do that in a separate CL and cc the atomic
experts here.

> it does not need to contain a null termination.
>  As a result, any attempt to strcpy or strcmp could "run off the end" of
> the string. If the allocated string was at the end of a page, adjacent to
> unmapped memory, then you'd get a segv in such an access.
> 
> I guess if the only thing you do is copy the pointer.... and significantly
> delay accessing the referenced memory... it would be safer <sigh>, but
> there seems to be no reason to leave the code racing.
> 
> Also keep in mind that our code is now running on ARM for Android/Clank, so
> the fact that x86 is "friendlier" with regard to cache coherence is not
> always going to protect us.
> 
> ...and clank is running on the TOT.
> 
> Jim
> 
> On Thu, Feb 16, 2012 at 11:45 AM, <mailto:jbates@chromium.org> wrote:
> 
> > On 2012/02/16 17:50:02, jar wrote:
> >
> >> To help resolve this, I've added an expert (Dmitry) in threading to the CC
> >>
> > list.
> >
> >>  He and Wan Teh have schooled me on this topic in the past year or two (as
> >> histogram.h evolved to its current state).  They can probably cite
> >> educational
> >> material that goes into all the stuff (and more), but they can also chime
> >> in
> >>
> > if
> >
> >> I say any of this incorrectly.
> >>
> >
> >  The reason why the code in src/base/debug/trace_event.h (see the static
> >> initializer around line 491 and the associated ANNOTATE_BENIGN_RACE
> >> macro) is
> >> not yet safe is that you're counting on both the pointer, *and* the
> >> pointed-to-contents being visible on the receiving thread.
> >>
> >
> >  When a receiving thread sees the pointer, there is no guarantee
> >> (currently)
> >>
> > that
> >
> >> the pointed-to-contents will also be visible in the receiving core's
> >> cache...
> >> and in fact, old data may be sitting in that core's cache.  As a result,
> >> the
> >> receiving core may deref the pointer, and reach garbage.  For instance, it
> >>
> > could
> >
> >> see an unterminated char* (if this is indeed a c-string) leading to
> >>
> > inaccessible
> >
> >> memory.
> >>
> >
> >  The the rule is simple for this pattern: Both the sending and receiving
> >>
> > threads
> >
> >> *must* use a barrier.  The fact that you don't currently have a barrier
> >> on the
> >> receiving (fast) path is what causes the race.
> >>
> >
> >  You can see some additional discussion of this topic in the context of
> >> histograms in the bug http://code.google.com/p/**
> >>
>
chromium/issues/detail?id=**88519<http://code.google.com/p/chromium/issues/detail?id=88519>
> >> or in the resolving CL
> >>
>
http://codereview.chromium.**org/7535006%3Chttp://codereview.chromium.org/753...>
> >>
> >
> >
> >  Jim
> >>
> >
> > I think we are saved by one simple difference from the histograms code: the
> > tracing macros operate safely even if they read garbage data from the
> > static
> > unsigned char.
> >
> > After reading through
> > http://code.google.com/p/data-**race-test/wiki/AboutRaces#**
> >
>
Racy_publication<http://code.google.com/p/data-race-test/wiki/AboutRaces#Racy_publication>though,
> > I'm slightly less sure of myself regarding "Clarification 4". Considering
> > that
> > Acquire_Load and Release_Store are essentially nops on x86, I'm not
> > opposed to
> > changing trace_event to match histogram macros. For my own sanity though, I
> > would love to have the author of "Clarification 4" provide references so I
> > can
> > edubicate myself!
> >
> >
>
https://chromiumcodereview.**appspot.com/9405034/%3Chttps://chromiumcoderevie...>
> >

dvyukov

On Fri, Feb 17, 2012 at 2:59 AM, <jbates@chromium.org> wrote: > On 2012/02/16 20:40:36, jar ...

8 years, 10 months ago (2012-02-17 16:34:16 UTC) #9

On Fri, Feb 17, 2012 at 2:59 AM, <jbates@chromium.org> wrote:

> On 2012/02/16 20:40:36, jar wrote:
>
>> The objection you cited, due to "cleverness" in the compiler, is certainly
>> an issue.
>>
>
>  If the string is garbage,
>>
>
> As we discussed, it's not a string, despite being a char* (I'll use a
> typedef to
> make the code more readable). The macro code simply reads the value of a
> single
> unsigned char. If it's 0, one branch is taken. If it's non-zero, another
> branch
> is taken. Both branches are safe at all times, so it should handle garbage
> data
> without risk.
>
> What we're left with is the compiler doing something stupid. I'm convinced
> that
> a theoretical compiler could take advantage of the single-threaded C spec
> and do
> something insane such as described in "Clarification 4". (ie: load the
> unsigned
> char *twice* into two different registers, one NULL, one non-NULL, and use
> those
> registers later in any order.)
>
> So, it looks like we need a NoBarrier_Load on the static pointer to ensure
> that
> the compiler doesn't mess up. I will do that in a separate CL and cc the
> atomic
> experts here.
>
>
I agree that it hardly can break even on weekly ordered ARM.
Here are 3 moments:
1. Readers that access the data w/o a mutex only read it.
2. Readers are OK with any garbage.
3. The racy data is global. When we publish a pointer to a global data it's
different from publishing a ptr to a dynamically allocated data, because
global data is visible to all threads from process start.

However, formally the code contains data races, undefined behavior,
bla-bla-bla.
I would use NoBarrier_Load/Store to access both the pointer and the
pointed-to data. The patch solves first part of the problem. As for
pointed-to data, there are no 1-byte atomic operations and so it would
require changing char's to int's, so it's not a win-win solution... Hey,
now I've actually taken a look at TRACE_EVENT_MAX_CATEGORIES, and it is
just 100. It's only 300 bytes!

I also have some suggestions regarding performance if you are interested.
As far as I understand the fast-path is quite performance critical and code
bloat is important.
What you have now is along the lines of:

void func1() {
static char* flagptr;
char* local = flagptr;
if (__builtin_expect(local == 0, 0)) {
local = register1("foo");
flagptr = local;
}
if (__builtin_expect(*local, 0))
trace("foo", 42);
printf("real work");
}

It has 2 loads from memory, moreover the second is dependent on the first,
+ 2 conditional branches:

0000000000400600 <func1>:
  400600:       48 83 ec 08             sub    $0x8,%rsp
  400604:       48 8b 05 35 0a 20 00    mov    0x200a35(%rip),%rax        #
load the flagptr
  40060b:       48 85 c0                test   %rax,%rax                  #
check flagptr value
  40060e:       74 15                   je     400625 <func1+0x25>        #
branch to register slow-path
  400610:       80 38 00                cmpb   $0x0,(%rax)                #
load the flag
  400613:       75 23                   jne    400638 <func1+0x38>        #
branch to trace slow-path
# below is functions "real work"
  400615:       bf 40 07 40 00          mov    $0x400740,%edi
  40061a:       31 c0                   xor    %eax,%eax
  40061c:       48 83 c4 08             add    $0x8,%rsp
  400620:       e9 2b fe ff ff          jmpq   400450 <printf@plt>
# below are 2 slow-paths - register and trace
  400625:       bf 3c 07 40 00          mov    $0x40073c,%edi
  40062a:       e8 b1 ff ff ff          callq  4005e0 <register1>
  40062f:       48 89 05 0a 0a 20 00    mov    %rax,0x200a0a(%rip)        #
601040 <flagptr.2338>
  400636:       eb d8                   jmp    400610 <func1+0x10>
  400638:       be 2a 00 00 00          mov    $0x2a,%esi
  40063d:       bf 3c 07 40 00          mov    $0x40073c,%edi
  400642:       e8 89 ff ff ff          callq  4005d0 <trace>
  400647:       eb cc                   jmp    400615 <func1+0x15>

What you do is register internal flag in user's variable. I would do it
vise versa - register user's flag in internal variable:

void func2() {
static char flag = -1;
if (__builtin_expect(flag, 0))
register_or_trace("foo", &flag, 42);
printf("real work");
}

flag's values are: -1 - requires registration, 0 - registered+tracing
disabled, 1 - registered+tracing enabled. It has only 1 load and 1 branch:

0000000000400590 <func2>:
  400590:       48 83 ec 08             sub    $0x8,%rsp
  400594:       80 3d 8d 0a 20 00 00    cmpb   $0x0,0x200a8d(%rip)        #
load the flag
  40059b:       75 10                   jne    4005ad <func2+0x1d>        #
branch to slow-path (either need to register or to trace)
# below is functions "real work"
  40059d:       bf 40 07 40 00          mov    $0x400740,%edi
  4005a2:       31 c0                   xor    %eax,%eax
  4005a4:       48 83 c4 08             add    $0x8,%rsp
  4005a8:       e9 a3 fe ff ff          jmpq   400450 <printf@plt>
# below is the slow-path
  4005ad:       ba 2a 00 00 00          mov    $0x2a,%edx
  4005b2:       be 28 10 60 00          mov    $0x601028,%esi
  4005b7:       bf 3c 07 40 00          mov    $0x40073c,%edi
  4005bc:       e8 bf ff ff ff          callq  400580 <register_or_trace>
  4005c1:       eb da                   jmp    40059d <func2+0xd>

So it's 9 bytes of code vs 17 bytes of code for fast-path (and 22 vs 36
bytes of code for slow-path) + considerably faster code: -1 branch, -1
memory load, -1 data dependency, -1 cache line in cache.

As for 'tracing enabled' slow-path. As far as I understand, it has a single
event buffer and a single mutex. It would make sense to have small
fixed-size per-thread buffers and mutexes. Threads flush their buffers when
they are full. If the most recent data is required to output, then the
output thread can iterate over all threads, lock their mutexes and flush
their buffers... Hey, wait, it's not a server application, it has 1,5
active threads most of the time. Well, maybe it makes no sense, you know
better.

And... taking the opportunity, what about that change :)
http://codereview.chromium.org/8772004/

jbates

On 2012/02/17 16:34:16, dvyukov wrote: > On Fri, Feb 17, 2012 at 2:59 AM, <mailto:jbates@chromium.org> ...

8 years, 10 months ago (2012-02-21 18:13:41 UTC) #10

On 2012/02/17 16:34:16, dvyukov wrote:
> On Fri, Feb 17, 2012 at 2:59 AM, <mailto:jbates@chromium.org> wrote:
> 
> > On 2012/02/16 20:40:36, jar wrote:
> >
> >> The objection you cited, due to "cleverness" in the compiler, is certainly
> >> an issue.
> >>
> >
> >  If the string is garbage,
> >>
> >
> > As we discussed, it's not a string, despite being a char* (I'll use a
> > typedef to
> > make the code more readable). The macro code simply reads the value of a
> > single
> > unsigned char. If it's 0, one branch is taken. If it's non-zero, another
> > branch
> > is taken. Both branches are safe at all times, so it should handle garbage
> > data
> > without risk.
> >
> > What we're left with is the compiler doing something stupid. I'm convinced
> > that
> > a theoretical compiler could take advantage of the single-threaded C spec
> > and do
> > something insane such as described in "Clarification 4". (ie: load the
> > unsigned
> > char *twice* into two different registers, one NULL, one non-NULL, and use
> > those
> > registers later in any order.)
> >
> > So, it looks like we need a NoBarrier_Load on the static pointer to ensure
> > that
> > the compiler doesn't mess up. I will do that in a separate CL and cc the
> > atomic
> > experts here.
> >
> >
> I agree that it hardly can break even on weekly ordered ARM.
> Here are 3 moments:
> 1. Readers that access the data w/o a mutex only read it.
> 2. Readers are OK with any garbage.
> 3. The racy data is global. When we publish a pointer to a global data it's
> different from publishing a ptr to a dynamically allocated data, because
> global data is visible to all threads from process start.
> 
> However, formally the code contains data races, undefined behavior,
> bla-bla-bla.
> I would use NoBarrier_Load/Store to access both the pointer and the
> pointed-to data. The patch solves first part of the problem. As for
> pointed-to data, there are no 1-byte atomic operations and so it would
> require changing char's to int's, so it's not a win-win solution... Hey,
> now I've actually taken a look at TRACE_EVENT_MAX_CATEGORIES, and it is
> just 100. It's only 300 bytes!
> 
> I also have some suggestions regarding performance if you are interested.
> As far as I understand the fast-path is quite performance critical and code
> bloat is important.
> What you have now is along the lines of:
> 
> void func1() {
> static char* flagptr;
> char* local = flagptr;
> if (__builtin_expect(local == 0, 0)) {
> local = register1("foo");
> flagptr = local;
> }
> if (__builtin_expect(*local, 0))
> trace("foo", 42);
> printf("real work");
> }
> 
> It has 2 loads from memory, moreover the second is dependent on the first,
> + 2 conditional branches:

Awesome idea! I will try this soon. It would also reduce the API complexity for
exporting trace event to third party libraries.

> 
> 0000000000400600 <func1>:
>   400600:       48 83 ec 08             sub    $0x8,%rsp
>   400604:       48 8b 05 35 0a 20 00    mov    0x200a35(%rip),%rax        #
> load the flagptr
>   40060b:       48 85 c0                test   %rax,%rax                  #
> check flagptr value
>   40060e:       74 15                   je     400625 <func1+0x25>        #
> branch to register slow-path
>   400610:       80 38 00                cmpb   $0x0,(%rax)                #
> load the flag
>   400613:       75 23                   jne    400638 <func1+0x38>        #
> branch to trace slow-path
> # below is functions "real work"
>   400615:       bf 40 07 40 00          mov    $0x400740,%edi
>   40061a:       31 c0                   xor    %eax,%eax
>   40061c:       48 83 c4 08             add    $0x8,%rsp
>   400620:       e9 2b fe ff ff          jmpq   400450 <printf@plt>
> # below are 2 slow-paths - register and trace
>   400625:       bf 3c 07 40 00          mov    $0x40073c,%edi
>   40062a:       e8 b1 ff ff ff          callq  4005e0 <register1>
>   40062f:       48 89 05 0a 0a 20 00    mov    %rax,0x200a0a(%rip)        #
> 601040 <flagptr.2338>
>   400636:       eb d8                   jmp    400610 <func1+0x10>
>   400638:       be 2a 00 00 00          mov    $0x2a,%esi
>   40063d:       bf 3c 07 40 00          mov    $0x40073c,%edi
>   400642:       e8 89 ff ff ff          callq  4005d0 <trace>
>   400647:       eb cc                   jmp    400615 <func1+0x15>
> 
> What you do is register internal flag in user's variable. I would do it
> vise versa - register user's flag in internal variable:
> 
> void func2() {
> static char flag = -1;
> if (__builtin_expect(flag, 0))
> register_or_trace("foo", &flag, 42);
> printf("real work");
> }
> 
> flag's values are: -1 - requires registration, 0 - registered+tracing
> disabled, 1 - registered+tracing enabled. It has only 1 load and 1 branch:
> 
> 0000000000400590 <func2>:
>   400590:       48 83 ec 08             sub    $0x8,%rsp
>   400594:       80 3d 8d 0a 20 00 00    cmpb   $0x0,0x200a8d(%rip)        #
> load the flag
>   40059b:       75 10                   jne    4005ad <func2+0x1d>        #
> branch to slow-path (either need to register or to trace)
> # below is functions "real work"
>   40059d:       bf 40 07 40 00          mov    $0x400740,%edi
>   4005a2:       31 c0                   xor    %eax,%eax
>   4005a4:       48 83 c4 08             add    $0x8,%rsp
>   4005a8:       e9 a3 fe ff ff          jmpq   400450 <printf@plt>
> # below is the slow-path
>   4005ad:       ba 2a 00 00 00          mov    $0x2a,%edx
>   4005b2:       be 28 10 60 00          mov    $0x601028,%esi
>   4005b7:       bf 3c 07 40 00          mov    $0x40073c,%edi
>   4005bc:       e8 bf ff ff ff          callq  400580 <register_or_trace>
>   4005c1:       eb da                   jmp    40059d <func2+0xd>
> 
> So it's 9 bytes of code vs 17 bytes of code for fast-path (and 22 vs 36
> bytes of code for slow-path) + considerably faster code: -1 branch, -1
> memory load, -1 data dependency, -1 cache line in cache.
> 
> As for 'tracing enabled' slow-path. As far as I understand, it has a single
> event buffer and a single mutex. It would make sense to have small
> fixed-size per-thread buffers and mutexes. Threads flush their buffers when
> they are full. If the most recent data is required to output, then the
> output thread can iterate over all threads, lock their mutexes and flush
> their buffers... Hey, wait, it's not a server application, it has 1,5
> active threads most of the time. Well, maybe it makes no sense, you know
> better.
> 
> And... taking the opportunity, what about that change :)
> http://codereview.chromium.org/8772004/

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-status.appspot.com/cq/jbates@chromium.org/9405034/1

8 years, 10 months ago (2012-02-23 00:42:30 UTC) #12

commit-bot: I haz the power

Try job failure for 9405034-1 (retry) (retry) on win_rel for step "ui_tests". It's a second ...

8 years, 10 months ago (2012-02-23 05:06:36 UTC) #13

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-status.appspot.com/cq/jbates@chromium.org/9405034/1

8 years, 10 months ago (2012-02-23 17:34:44 UTC) #14

Change committed as 123293

Issue 9405034: Allow task tracing in official chrome. (Closed)

Description

Patch Set 1 #

Messages