|
|
Created:
8 years, 10 months ago by jbates Modified:
8 years, 10 months ago Reviewers:
jar (doing other things) CC:
chromium-reviews, dvyukov, wtc, willchan no longer on Chromium, nduca Base URL:
svn://svn.chromium.org/chrome/trunk/src Visibility:
Public. |
DescriptionAllow task tracing in official chrome.
We have had various bug reports where it would have been helpful to see tasks in about:tracing reports.
R=jar
Committed: http://src.chromium.org/viewvc/chrome?view=rev&revision=123293
Patch Set 1 #
Messages
Total messages: 15 (0 generated)
ptal
This is pretty cool/exciting stuff! :-) I don't think you ever got around to changing the static to use a memory barrier, so I'd expect to see a lot of crashes during the initialization and pointer-test in the field. I think I used the pattern you have now(?) initially on some of my histograms macros, and it became a notable crasher. Am I mistaken, and was that taken care of? It could be the fact that you only have this macro in one place will hold down the odds... as I had literally 100's of uses of the HISTOGRAM_* macros across the codebase, many on competing threads... but you've chosen *the* most racy place in the system (message loop), and hence I think you'd hit some of these issues. Am I correct that the only run-time cost (while dynamically disabled) will be the testing of the static pointer, and the examination referenced structure (vs some distant on/off switch) to see if you even need to do any of the string-compare tests? Thanks.
If you're referring to our use of a static variable to store the çategory struct, the key thing to realize is that its completely safe if it is double set. The way to follow this is that we have this: static TraceCategroy* __cat = 0; if(!cat) __cat = TraceLog::GetInstance()->GetCategory("cat"); GetCategory returns a pointer, but the Trace system owns the pointer. GetCategory always returns the same thing for a given string. If two threads see the category, then both will go call GetCategory. The GetCategory itself acquires a lock, then searches through a global list for the category pointer, and returns it. GetCategory("foo") will always return the same pointer. Whoever wins the race will set the pointer. Then the next person who wins the race will set the pointer to the same thing. Voila, safe.
To help resolve this, I've added an expert (Dmitry) in threading to the CC list. He and Wan Teh have schooled me on this topic in the past year or two (as histogram.h evolved to its current state). They can probably cite educational material that goes into all the stuff (and more), but they can also chime in if I say any of this incorrectly. The reason why the code in src/base/debug/trace_event.h (see the static initializer around line 491 and the associated ANNOTATE_BENIGN_RACE macro) is not yet safe is that you're counting on both the pointer, *and* the pointed-to-contents being visible on the receiving thread. When a receiving thread sees the pointer, there is no guarantee (currently) that the pointed-to-contents will also be visible in the receiving core's cache... and in fact, old data may be sitting in that core's cache. As a result, the receiving core may deref the pointer, and reach garbage. For instance, it could see an unterminated char* (if this is indeed a c-string) leading to inaccessible memory. The the rule is simple for this pattern: Both the sending and receiving threads *must* use a barrier. The fact that you don't currently have a barrier on the receiving (fast) path is what causes the race. You can see some additional discussion of this topic in the context of histograms in the bug http://code.google.com/p/chromium/issues/detail?id=88519 or in the resolving CL http://codereview.chromium.org/7535006 Jim
On 2012/02/16 17:50:02, jar wrote: > To help resolve this, I've added an expert (Dmitry) in threading to the CC list. > He and Wan Teh have schooled me on this topic in the past year or two (as > histogram.h evolved to its current state). They can probably cite educational > material that goes into all the stuff (and more), but they can also chime in if > I say any of this incorrectly. > > The reason why the code in src/base/debug/trace_event.h (see the static > initializer around line 491 and the associated ANNOTATE_BENIGN_RACE macro) is > not yet safe is that you're counting on both the pointer, *and* the > pointed-to-contents being visible on the receiving thread. > > When a receiving thread sees the pointer, there is no guarantee (currently) that > the pointed-to-contents will also be visible in the receiving core's cache... > and in fact, old data may be sitting in that core's cache. As a result, the > receiving core may deref the pointer, and reach garbage. For instance, it could > see an unterminated char* (if this is indeed a c-string) leading to inaccessible > memory. > > The the rule is simple for this pattern: Both the sending and receiving threads > *must* use a barrier. The fact that you don't currently have a barrier on the > receiving (fast) path is what causes the race. > > You can see some additional discussion of this topic in the context of > histograms in the bug http://code.google.com/p/chromium/issues/detail?id=88519 > or in the resolving CL > http://codereview.chromium.org/7535006 > > > Jim I think we are saved by one simple difference from the histograms code: the tracing macros operate safely even if they read garbage data from the static unsigned char. After reading through http://code.google.com/p/data-race-test/wiki/AboutRaces#Racy_publication though, I'm slightly less sure of myself regarding "Clarification 4". Considering that Acquire_Load and Release_Store are essentially nops on x86, I'm not opposed to changing trace_event to match histogram macros. For my own sanity though, I would love to have the author of "Clarification 4" provide references so I can edubicate myself!
The objection you cited, due to "cleverness" in the compiler, is certainly an issue. If the string is garbage, it does not need to contain a null termination. As a result, any attempt to strcpy or strcmp could "run off the end" of the string. If the allocated string was at the end of a page, adjacent to unmapped memory, then you'd get a segv in such an access. I guess if the only thing you do is copy the pointer.... and significantly delay accessing the referenced memory... it would be safer <sigh>, but there seems to be no reason to leave the code racing. Also keep in mind that our code is now running on ARM for Android/Clank, so the fact that x86 is "friendlier" with regard to cache coherence is not always going to protect us. ...and clank is running on the TOT. Jim On Thu, Feb 16, 2012 at 11:45 AM, <jbates@chromium.org> wrote: > On 2012/02/16 17:50:02, jar wrote: > >> To help resolve this, I've added an expert (Dmitry) in threading to the CC >> > list. > >> He and Wan Teh have schooled me on this topic in the past year or two (as >> histogram.h evolved to its current state). They can probably cite >> educational >> material that goes into all the stuff (and more), but they can also chime >> in >> > if > >> I say any of this incorrectly. >> > > The reason why the code in src/base/debug/trace_event.h (see the static >> initializer around line 491 and the associated ANNOTATE_BENIGN_RACE >> macro) is >> not yet safe is that you're counting on both the pointer, *and* the >> pointed-to-contents being visible on the receiving thread. >> > > When a receiving thread sees the pointer, there is no guarantee >> (currently) >> > that > >> the pointed-to-contents will also be visible in the receiving core's >> cache... >> and in fact, old data may be sitting in that core's cache. As a result, >> the >> receiving core may deref the pointer, and reach garbage. For instance, it >> > could > >> see an unterminated char* (if this is indeed a c-string) leading to >> > inaccessible > >> memory. >> > > The the rule is simple for this pattern: Both the sending and receiving >> > threads > >> *must* use a barrier. The fact that you don't currently have a barrier >> on the >> receiving (fast) path is what causes the race. >> > > You can see some additional discussion of this topic in the context of >> histograms in the bug http://code.google.com/p/** >> chromium/issues/detail?id=**88519<http://code.google.com/p/chromium/issues/detail?id=88519> >> or in the resolving CL >> http://codereview.chromium.**org/7535006<http://codereview.chromium.org/7535006> >> > > > Jim >> > > I think we are saved by one simple difference from the histograms code: the > tracing macros operate safely even if they read garbage data from the > static > unsigned char. > > After reading through > http://code.google.com/p/data-**race-test/wiki/AboutRaces#** > Racy_publication<http://code.google.com/p/data-race-test/wiki/AboutRaces#Racy_publication>though, > I'm slightly less sure of myself regarding "Clarification 4". Considering > that > Acquire_Load and Release_Store are essentially nops on x86, I'm not > opposed to > changing trace_event to match histogram macros. For my own sanity though, I > would love to have the author of "Clarification 4" provide references so I > can > edubicate myself! > > https://chromiumcodereview.**appspot.com/9405034/<https://chromiumcodereview.... >
On 2012/02/16 20:40:36, jar wrote: > The objection you cited, due to "cleverness" in the compiler, is certainly > an issue. > > If the string is garbage, it does not need to contain a null termination. > As a result, any attempt to strcpy or strcmp could "run off the end" of > the string. If the allocated string was at the end of a page, adjacent to > unmapped memory, then you'd get a segv in such an access. > I'm sorry, you lost me again. What strcmp are you talkign about? I know of no such operation. Also, I think your scenario doesn't happen. The only time incoherent caches happen is when you miss on the category pointer. In that case, all accesses to the cateogry struct go through a mutex, which itself has fencing behavior. That is to say, I'm still pretty sure its safe.
On 2012/02/16 20:40:36, jar wrote: > The objection you cited, due to "cleverness" in the compiler, is certainly > an issue. > > If the string is garbage, As we discussed, it's not a string, despite being a char* (I'll use a typedef to make the code more readable). The macro code simply reads the value of a single unsigned char. If it's 0, one branch is taken. If it's non-zero, another branch is taken. Both branches are safe at all times, so it should handle garbage data without risk. What we're left with is the compiler doing something stupid. I'm convinced that a theoretical compiler could take advantage of the single-threaded C spec and do something insane such as described in "Clarification 4". (ie: load the unsigned char *twice* into two different registers, one NULL, one non-NULL, and use those registers later in any order.) So, it looks like we need a NoBarrier_Load on the static pointer to ensure that the compiler doesn't mess up. I will do that in a separate CL and cc the atomic experts here. > it does not need to contain a null termination. > As a result, any attempt to strcpy or strcmp could "run off the end" of > the string. If the allocated string was at the end of a page, adjacent to > unmapped memory, then you'd get a segv in such an access. > > I guess if the only thing you do is copy the pointer.... and significantly > delay accessing the referenced memory... it would be safer <sigh>, but > there seems to be no reason to leave the code racing. > > Also keep in mind that our code is now running on ARM for Android/Clank, so > the fact that x86 is "friendlier" with regard to cache coherence is not > always going to protect us. > > ...and clank is running on the TOT. > > Jim > > On Thu, Feb 16, 2012 at 11:45 AM, <mailto:jbates@chromium.org> wrote: > > > On 2012/02/16 17:50:02, jar wrote: > > > >> To help resolve this, I've added an expert (Dmitry) in threading to the CC > >> > > list. > > > >> He and Wan Teh have schooled me on this topic in the past year or two (as > >> histogram.h evolved to its current state). They can probably cite > >> educational > >> material that goes into all the stuff (and more), but they can also chime > >> in > >> > > if > > > >> I say any of this incorrectly. > >> > > > > The reason why the code in src/base/debug/trace_event.h (see the static > >> initializer around line 491 and the associated ANNOTATE_BENIGN_RACE > >> macro) is > >> not yet safe is that you're counting on both the pointer, *and* the > >> pointed-to-contents being visible on the receiving thread. > >> > > > > When a receiving thread sees the pointer, there is no guarantee > >> (currently) > >> > > that > > > >> the pointed-to-contents will also be visible in the receiving core's > >> cache... > >> and in fact, old data may be sitting in that core's cache. As a result, > >> the > >> receiving core may deref the pointer, and reach garbage. For instance, it > >> > > could > > > >> see an unterminated char* (if this is indeed a c-string) leading to > >> > > inaccessible > > > >> memory. > >> > > > > The the rule is simple for this pattern: Both the sending and receiving > >> > > threads > > > >> *must* use a barrier. The fact that you don't currently have a barrier > >> on the > >> receiving (fast) path is what causes the race. > >> > > > > You can see some additional discussion of this topic in the context of > >> histograms in the bug http://code.google.com/p/** > >> > chromium/issues/detail?id=**88519<http://code.google.com/p/chromium/issues/detail?id=88519> > >> or in the resolving CL > >> > http://codereview.chromium.**org/7535006%3Chttp://codereview.chromium.org/753...> > >> > > > > > > Jim > >> > > > > I think we are saved by one simple difference from the histograms code: the > > tracing macros operate safely even if they read garbage data from the > > static > > unsigned char. > > > > After reading through > > http://code.google.com/p/data-**race-test/wiki/AboutRaces#** > > > Racy_publication<http://code.google.com/p/data-race-test/wiki/AboutRaces#Racy_publication>though, > > I'm slightly less sure of myself regarding "Clarification 4". Considering > > that > > Acquire_Load and Release_Store are essentially nops on x86, I'm not > > opposed to > > changing trace_event to match histogram macros. For my own sanity though, I > > would love to have the author of "Clarification 4" provide references so I > > can > > edubicate myself! > > > > > https://chromiumcodereview.**appspot.com/9405034/%3Chttps://chromiumcoderevie...> > >
On Fri, Feb 17, 2012 at 2:59 AM, <jbates@chromium.org> wrote: > On 2012/02/16 20:40:36, jar wrote: > >> The objection you cited, due to "cleverness" in the compiler, is certainly >> an issue. >> > > If the string is garbage, >> > > As we discussed, it's not a string, despite being a char* (I'll use a > typedef to > make the code more readable). The macro code simply reads the value of a > single > unsigned char. If it's 0, one branch is taken. If it's non-zero, another > branch > is taken. Both branches are safe at all times, so it should handle garbage > data > without risk. > > What we're left with is the compiler doing something stupid. I'm convinced > that > a theoretical compiler could take advantage of the single-threaded C spec > and do > something insane such as described in "Clarification 4". (ie: load the > unsigned > char *twice* into two different registers, one NULL, one non-NULL, and use > those > registers later in any order.) > > So, it looks like we need a NoBarrier_Load on the static pointer to ensure > that > the compiler doesn't mess up. I will do that in a separate CL and cc the > atomic > experts here. > > I agree that it hardly can break even on weekly ordered ARM. Here are 3 moments: 1. Readers that access the data w/o a mutex only read it. 2. Readers are OK with any garbage. 3. The racy data is global. When we publish a pointer to a global data it's different from publishing a ptr to a dynamically allocated data, because global data is visible to all threads from process start. However, formally the code contains data races, undefined behavior, bla-bla-bla. I would use NoBarrier_Load/Store to access both the pointer and the pointed-to data. The patch solves first part of the problem. As for pointed-to data, there are no 1-byte atomic operations and so it would require changing char's to int's, so it's not a win-win solution... Hey, now I've actually taken a look at TRACE_EVENT_MAX_CATEGORIES, and it is just 100. It's only 300 bytes! I also have some suggestions regarding performance if you are interested. As far as I understand the fast-path is quite performance critical and code bloat is important. What you have now is along the lines of: void func1() { static char* flagptr; char* local = flagptr; if (__builtin_expect(local == 0, 0)) { local = register1("foo"); flagptr = local; } if (__builtin_expect(*local, 0)) trace("foo", 42); printf("real work"); } It has 2 loads from memory, moreover the second is dependent on the first, + 2 conditional branches: 0000000000400600 <func1>: 400600: 48 83 ec 08 sub $0x8,%rsp 400604: 48 8b 05 35 0a 20 00 mov 0x200a35(%rip),%rax # load the flagptr 40060b: 48 85 c0 test %rax,%rax # check flagptr value 40060e: 74 15 je 400625 <func1+0x25> # branch to register slow-path 400610: 80 38 00 cmpb $0x0,(%rax) # load the flag 400613: 75 23 jne 400638 <func1+0x38> # branch to trace slow-path # below is functions "real work" 400615: bf 40 07 40 00 mov $0x400740,%edi 40061a: 31 c0 xor %eax,%eax 40061c: 48 83 c4 08 add $0x8,%rsp 400620: e9 2b fe ff ff jmpq 400450 <printf@plt> # below are 2 slow-paths - register and trace 400625: bf 3c 07 40 00 mov $0x40073c,%edi 40062a: e8 b1 ff ff ff callq 4005e0 <register1> 40062f: 48 89 05 0a 0a 20 00 mov %rax,0x200a0a(%rip) # 601040 <flagptr.2338> 400636: eb d8 jmp 400610 <func1+0x10> 400638: be 2a 00 00 00 mov $0x2a,%esi 40063d: bf 3c 07 40 00 mov $0x40073c,%edi 400642: e8 89 ff ff ff callq 4005d0 <trace> 400647: eb cc jmp 400615 <func1+0x15> What you do is register internal flag in user's variable. I would do it vise versa - register user's flag in internal variable: void func2() { static char flag = -1; if (__builtin_expect(flag, 0)) register_or_trace("foo", &flag, 42); printf("real work"); } flag's values are: -1 - requires registration, 0 - registered+tracing disabled, 1 - registered+tracing enabled. It has only 1 load and 1 branch: 0000000000400590 <func2>: 400590: 48 83 ec 08 sub $0x8,%rsp 400594: 80 3d 8d 0a 20 00 00 cmpb $0x0,0x200a8d(%rip) # load the flag 40059b: 75 10 jne 4005ad <func2+0x1d> # branch to slow-path (either need to register or to trace) # below is functions "real work" 40059d: bf 40 07 40 00 mov $0x400740,%edi 4005a2: 31 c0 xor %eax,%eax 4005a4: 48 83 c4 08 add $0x8,%rsp 4005a8: e9 a3 fe ff ff jmpq 400450 <printf@plt> # below is the slow-path 4005ad: ba 2a 00 00 00 mov $0x2a,%edx 4005b2: be 28 10 60 00 mov $0x601028,%esi 4005b7: bf 3c 07 40 00 mov $0x40073c,%edi 4005bc: e8 bf ff ff ff callq 400580 <register_or_trace> 4005c1: eb da jmp 40059d <func2+0xd> So it's 9 bytes of code vs 17 bytes of code for fast-path (and 22 vs 36 bytes of code for slow-path) + considerably faster code: -1 branch, -1 memory load, -1 data dependency, -1 cache line in cache. As for 'tracing enabled' slow-path. As far as I understand, it has a single event buffer and a single mutex. It would make sense to have small fixed-size per-thread buffers and mutexes. Threads flush their buffers when they are full. If the most recent data is required to output, then the output thread can iterate over all threads, lock their mutexes and flush their buffers... Hey, wait, it's not a server application, it has 1,5 active threads most of the time. Well, maybe it makes no sense, you know better. And... taking the opportunity, what about that change :) http://codereview.chromium.org/8772004/
On 2012/02/17 16:34:16, dvyukov wrote: > On Fri, Feb 17, 2012 at 2:59 AM, <mailto:jbates@chromium.org> wrote: > > > On 2012/02/16 20:40:36, jar wrote: > > > >> The objection you cited, due to "cleverness" in the compiler, is certainly > >> an issue. > >> > > > > If the string is garbage, > >> > > > > As we discussed, it's not a string, despite being a char* (I'll use a > > typedef to > > make the code more readable). The macro code simply reads the value of a > > single > > unsigned char. If it's 0, one branch is taken. If it's non-zero, another > > branch > > is taken. Both branches are safe at all times, so it should handle garbage > > data > > without risk. > > > > What we're left with is the compiler doing something stupid. I'm convinced > > that > > a theoretical compiler could take advantage of the single-threaded C spec > > and do > > something insane such as described in "Clarification 4". (ie: load the > > unsigned > > char *twice* into two different registers, one NULL, one non-NULL, and use > > those > > registers later in any order.) > > > > So, it looks like we need a NoBarrier_Load on the static pointer to ensure > > that > > the compiler doesn't mess up. I will do that in a separate CL and cc the > > atomic > > experts here. > > > > > I agree that it hardly can break even on weekly ordered ARM. > Here are 3 moments: > 1. Readers that access the data w/o a mutex only read it. > 2. Readers are OK with any garbage. > 3. The racy data is global. When we publish a pointer to a global data it's > different from publishing a ptr to a dynamically allocated data, because > global data is visible to all threads from process start. > > However, formally the code contains data races, undefined behavior, > bla-bla-bla. > I would use NoBarrier_Load/Store to access both the pointer and the > pointed-to data. The patch solves first part of the problem. As for > pointed-to data, there are no 1-byte atomic operations and so it would > require changing char's to int's, so it's not a win-win solution... Hey, > now I've actually taken a look at TRACE_EVENT_MAX_CATEGORIES, and it is > just 100. It's only 300 bytes! > > I also have some suggestions regarding performance if you are interested. > As far as I understand the fast-path is quite performance critical and code > bloat is important. > What you have now is along the lines of: > > void func1() { > static char* flagptr; > char* local = flagptr; > if (__builtin_expect(local == 0, 0)) { > local = register1("foo"); > flagptr = local; > } > if (__builtin_expect(*local, 0)) > trace("foo", 42); > printf("real work"); > } > > It has 2 loads from memory, moreover the second is dependent on the first, > + 2 conditional branches: Awesome idea! I will try this soon. It would also reduce the API complexity for exporting trace event to third party libraries. > > 0000000000400600 <func1>: > 400600: 48 83 ec 08 sub $0x8,%rsp > 400604: 48 8b 05 35 0a 20 00 mov 0x200a35(%rip),%rax # > load the flagptr > 40060b: 48 85 c0 test %rax,%rax # > check flagptr value > 40060e: 74 15 je 400625 <func1+0x25> # > branch to register slow-path > 400610: 80 38 00 cmpb $0x0,(%rax) # > load the flag > 400613: 75 23 jne 400638 <func1+0x38> # > branch to trace slow-path > # below is functions "real work" > 400615: bf 40 07 40 00 mov $0x400740,%edi > 40061a: 31 c0 xor %eax,%eax > 40061c: 48 83 c4 08 add $0x8,%rsp > 400620: e9 2b fe ff ff jmpq 400450 <printf@plt> > # below are 2 slow-paths - register and trace > 400625: bf 3c 07 40 00 mov $0x40073c,%edi > 40062a: e8 b1 ff ff ff callq 4005e0 <register1> > 40062f: 48 89 05 0a 0a 20 00 mov %rax,0x200a0a(%rip) # > 601040 <flagptr.2338> > 400636: eb d8 jmp 400610 <func1+0x10> > 400638: be 2a 00 00 00 mov $0x2a,%esi > 40063d: bf 3c 07 40 00 mov $0x40073c,%edi > 400642: e8 89 ff ff ff callq 4005d0 <trace> > 400647: eb cc jmp 400615 <func1+0x15> > > What you do is register internal flag in user's variable. I would do it > vise versa - register user's flag in internal variable: > > void func2() { > static char flag = -1; > if (__builtin_expect(flag, 0)) > register_or_trace("foo", &flag, 42); > printf("real work"); > } > > flag's values are: -1 - requires registration, 0 - registered+tracing > disabled, 1 - registered+tracing enabled. It has only 1 load and 1 branch: > > 0000000000400590 <func2>: > 400590: 48 83 ec 08 sub $0x8,%rsp > 400594: 80 3d 8d 0a 20 00 00 cmpb $0x0,0x200a8d(%rip) # > load the flag > 40059b: 75 10 jne 4005ad <func2+0x1d> # > branch to slow-path (either need to register or to trace) > # below is functions "real work" > 40059d: bf 40 07 40 00 mov $0x400740,%edi > 4005a2: 31 c0 xor %eax,%eax > 4005a4: 48 83 c4 08 add $0x8,%rsp > 4005a8: e9 a3 fe ff ff jmpq 400450 <printf@plt> > # below is the slow-path > 4005ad: ba 2a 00 00 00 mov $0x2a,%edx > 4005b2: be 28 10 60 00 mov $0x601028,%esi > 4005b7: bf 3c 07 40 00 mov $0x40073c,%edi > 4005bc: e8 bf ff ff ff callq 400580 <register_or_trace> > 4005c1: eb da jmp 40059d <func2+0xd> > > So it's 9 bytes of code vs 17 bytes of code for fast-path (and 22 vs 36 > bytes of code for slow-path) + considerably faster code: -1 branch, -1 > memory load, -1 data dependency, -1 cache line in cache. > > As for 'tracing enabled' slow-path. As far as I understand, it has a single > event buffer and a single mutex. It would make sense to have small > fixed-size per-thread buffers and mutexes. Threads flush their buffers when > they are full. If the most recent data is required to output, then the > output thread can iterate over all threads, lock their mutexes and flush > their buffers... Hey, wait, it's not a server application, it has 1,5 > active threads most of the time. Well, maybe it makes no sense, you know > better. > > And... taking the opportunity, what about that change :) > http://codereview.chromium.org/8772004/
lgtm
CQ is trying da patch. Follow status at https://chromium-status.appspot.com/cq/jbates@chromium.org/9405034/1
Try job failure for 9405034-1 (retry) (retry) on win_rel for step "ui_tests". It's a second try, previously, step "ui_tests" failed. http://build.chromium.org/p/tryserver.chromium/buildstatus?builder=win_rel&nu...
CQ is trying da patch. Follow status at https://chromium-status.appspot.com/cq/jbates@chromium.org/9405034/1
Change committed as 123293 |