Issue 1975393002: Check stack pointer to be inside stack when unwinding.

Dmitry Skiba

Description was changed from ========== Check stack pointer to be inside stack when unwinding. TraceStackFramePointers() ...

4 years, 7 months ago (2016-05-13 21:19:33 UTC) #1

Dmitry Skiba

dskiba@google.com changed reviewers: + primiano@chromium.org, thakis@chromium.org

4 years, 7 months ago (2016-05-13 21:26:05 UTC) #2

Dmitry Skiba

Performance numbers: Before: TraceStackFramePointers(main thread): first call took 4us TraceStackFramePointers(main thread): next 10000 calls took ...

4 years, 7 months ago (2016-05-13 21:26:05 UTC) #3

Dmitry Skiba

Description was changed from ========== Check stack pointer to be inside stack when unwinding. TraceStackFramePointers() ...

4 years, 7 months ago (2016-05-16 21:28:15 UTC) #4

Primiano Tucci (use gerrit)

The patch itself LGTM. However I just realized that Oilpan does pretty much the same ...

4 years, 7 months ago (2016-05-17 20:08:07 UTC) #5

Primiano Tucci (use gerrit)

On 2016/05/17 20:08:07, Primiano Tucci wrote: > The patch itself LGTM. Ehm I just realized ...

4 years, 7 months ago (2016-05-17 20:09:21 UTC) #6

Primiano Tucci (use gerrit)

On 2016/05/17 20:09:21, Primiano Tucci wrote: > On 2016/05/17 20:08:07, Primiano Tucci wrote: > > ...

4 years, 7 months ago (2016-05-17 20:18:31 UTC) #7

Nico

lgtm Re comment 12 on the bug: This not working in component builds would probably ...

4 years, 7 months ago (2016-05-24 20:29:48 UTC) #9

Dmitry Skiba

https://codereview.chromium.org/1975393002/diff/20001/base/debug/stack_trace.cc File base/debug/stack_trace.cc (right): https://codereview.chromium.org/1975393002/diff/20001/base/debug/stack_trace.cc#newcode97 base/debug/stack_trace.cc:97: uintptr_t stack_end = GetStackEnd(); On 2016/05/24 20:29:48, Nico wrote: ...

4 years, 7 months ago (2016-05-24 23:53:09 UTC) #10

Primiano Tucci (use gerrit)

> https://codereview.chromium.org/1975393002/diff/20001/base/debug/stack_trace.cc#newcode97 > base/debug/stack_trace.cc:97: uintptr_t stack_end = GetStackEnd(); > On 2016/05/24 20:29:48, Nico wrote: > ...

4 years, 7 months ago (2016-05-25 08:05:24 UTC) #11

Dmitry Skiba

PTAL again. Performance numbers from Linux: Before: main thread: first call took 4us. main thread: ...

4 years, 7 months ago (2016-05-25 19:00:57 UTC) #12

Dmitry Skiba

OK, I spoke too soon. Since malloc is weak in Linux, and pthread_getattr_np() allocates memory, ...

4 years, 7 months ago (2016-05-25 19:43:33 UTC) #13

Primiano Tucci (use gerrit)

On 2016/05/25 19:00:57, Dmitry Skiba wrote: > So it now takes 3x longer for the ...

4 years, 7 months ago (2016-05-25 19:44:59 UTC) #14

Dmitry Skiba

On 2016/05/25 19:44:59, Primiano Tucci wrote: > On 2016/05/25 19:00:57, Dmitry Skiba wrote: > > ...

4 years, 7 months ago (2016-05-25 19:48:55 UTC) #15

Nico

what makes us recurse back? can we prevent that? (i really think not turning this ...

4 years, 7 months ago (2016-05-25 19:51:43 UTC) #16

Dmitry Skiba

On 2016/05/25 19:51:43, Nico wrote: > what makes us recurse back? can we prevent that? ...

4 years, 7 months ago (2016-05-25 19:55:37 UTC) #17

Primiano Tucci (use gerrit)

On 2016/05/25 19:51:43, Nico wrote: > what makes us recurse back? can we prevent that? ...

4 years, 7 months ago (2016-05-25 19:57:09 UTC) #18

Primiano Tucci (use gerrit)

> This means that this code will crash as it is also on Android arm64 ...

4 years, 7 months ago (2016-05-25 20:08:02 UTC) #19

Dmitry Skiba

On 2016/05/25 20:08:02, Primiano Tucci wrote: > > This means that this code will crash ...

4 years, 7 months ago (2016-05-25 20:39:30 UTC) #20

Dmitry Skiba

OK, so the overall question is whether I should make it work on Linux to ...

4 years, 7 months ago (2016-05-25 20:47:04 UTC) #21

Primiano Tucci (use gerrit)

On 2016/05/25 20:47:04, Dmitry Skiba wrote: > OK, so the overall question is whether I ...

4 years, 7 months ago (2016-05-26 15:57:09 UTC) #22

Dmitry Skiba

OK, so I added seccomp exception in profiling mode, and that avoids seccomp crashes. But! ...

4 years, 7 months ago (2016-05-27 08:02:43 UTC) #23

OK, so I added seccomp exception in profiling mode, and that avoids seccomp
crashes.

But! Now it deadlocks in renderer:

#0  __lll_lock_wait_private ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
#1  0x00007ffff4938286 in _L_lock_40 ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007ffff4937fa9 in pthread_getattr_np (thread_id=140737353818136, 
    attr=0x7fffffffb848) at pthread_getattr_np.c:41
#3  0x000055555a24d6bb in base::debug::TraceStackFramePointers(void const**,
unsigned long, unsigned long, base::debug::PerThreadStackInfo*) ()
#4  0x000055555a2a3c7f in
base::trace_event::AllocationContextTracker::GetContextSnapshot() ()
#5  0x000055555a2c3567 in
base::trace_event::MallocDumpProvider::InsertAllocation(void*, unsigned long) ()
#6  0x000055555a2c3929 in base::trace_event::(anonymous
namespace)::HookRealloc(base::allocator::AllocatorDispatch const*, void*,
unsigned long) ()
#7  0x00005555564894c9 in realloc ()
#8  0x00007ffff1a9706b in _IO_getdelim (lineptr=lineptr@entry=0x7fffffffc160, 
    n=n@entry=0x7fffffffc188, delimiter=delimiter@entry=10, 
    fp=fp@entry=0x7ffff7e59d40) at iogetdelim.c:106
#9  0x00007ffff49381be in pthread_getattr_np (thread_id=140737353816576, 
    attr=0x7fffffffc1f0) at pthread_getattr_np.c:116
#10 0x000055555915a9a6 in blink::StackFrameDepth::getStackStart() ()
#11 0x000055555915aa4e in blink::ThreadState::ThreadState(bool) ()
#12 0x0000555558f16829 in blink::Platform::initialize(blink::Platform*) ()
#13 0x00005555591b004f in blink::initialize(blink::Platform*) ()
#14 0x0000555559ffb220 in
content::RenderThreadImpl::InitializeWebKit(scoped_refptr<base::SingleThreadTaskRunner>&)
()
#15 0x0000555559ffa201 in
content::RenderThreadImpl::Init(scoped_refptr<base::SingleThreadTaskRunner>&) ()
#16 0x0000555559ffa074 in
content::RenderThreadImpl::RenderThreadImpl(std::unique_ptr<base::MessageLoop,
std::default_delete<base::MessageLoop> >,
std::unique_ptr<scheduler::RendererScheduler,
std::default_delete<scheduler::RendererScheduler> >) ()
#17 0x0000555559ff9d3c in
content::RenderThreadImpl::Create(std::unique_ptr<base::MessageLoop,
std::default_delete<base::MessageLoop> >,
std::unique_ptr<scheduler::RendererScheduler,
std::default_delete<scheduler::RendererScheduler> >) ()
#18 0x000055555a023a1c in content::RendererMain(content::MainFunctionParams
const&) ()
#19 0x000055555a228702 in content::ContentMainRunnerImpl::Run() ()
#20 0x000055555a2271d6 in content::ContentMain(content::ContentMainParams
const&) ()
#21 0x000055555606c20a in ChromeMain ()
#22 0x00007ffff1a49ec5 in __libc_start_main (main=0x55555606c1c0 <main>, 
    argc=28, argv=0x7fffffffcb58, init=<optimized out>, fini=<optimized out>, 
    rtld_fini=<optimized out>, stack_end=0x7fffffffcb48) at libc-start.c:287
#23 0x000055555606c0dd in _start ()


pthread_getattr_np() recursively calls itself. Yet another reason why we
shouldn't override weak malloc on Linux - it catches too many things.

Primiano Tucci (use gerrit)

On 2016/05/27 08:02:43, Dmitry Skiba wrote: > pthread_getattr_np() recursively calls itself. This is not true. ...

4 years, 7 months ago (2016-05-27 08:39:43 UTC) #24

Dmitry Skiba

On 2016/05/27 08:39:43, Primiano Tucci wrote: > On 2016/05/27 08:02:43, Dmitry Skiba wrote: > > ...

4 years, 7 months ago (2016-05-27 09:07:51 UTC) #25

On 2016/05/27 08:39:43, Primiano Tucci wrote:
> On 2016/05/27 08:02:43, Dmitry Skiba wrote:
> > pthread_getattr_np() recursively calls itself. 
> This is not true. What's happening is that you are causing a reentrant call of
> pthread_getattr_np.

Right. Bad wording on my side.

> > Yet another reason why we shouldn't override weak malloc on Linux - it
catches
> too many things.
> I really don't understand how this is related.
> - How do you intend to do heap profiling without hooking malloc?
> - How is "weak" related? Conversely to popular belief malloc is NOT a weak
> symbol. What is weak are the glibc malloc_hook symbols, but that is a
different
> story and is irrelevant here.

The problem with weak malloc is that we get calls from system libraries. We
could just wrap malloc as we do on Android and limit our scope to Chrome only.

> - We've always overridden malloc on Linux, before any heap profiling. We did
> that for the sake of security checks and having our own allocator (TCMalloc).
> Not sure why we should change that now.
> - This problem is not about overriding malloc. The problem is that writing a
> heap profiler is tricky. Here you were assuming that pthread_gettattr_np
doesn't
> call malloc re-entering the shim. You just found that this is not true.

The problem is not in the heap profiler, but in the fact that
pthread_getattr_np() is not reentrant by design. And I can't blame Glibc guys -
they obviously thought that if you are not explicitly reentering a function, you
don't need to make it reentrant.
BTW, thread name deadlock situation in Chrome is exactly the same - we can't ask
for a thread name in GetCurrentSnapshot() because GetThreadName() would deadlock
if indirectly called from SetThreadName() via an allocation. So on Android where
we initialize heap profiling too late, we have to get thread names from
backchannels.

Anyway, I have a workaround for that reentrancy deadlock, but now I found that
GPU process reports way less allocations, and doesn't report 'gpu' dumper (in
the same row where 'malloc' is).

Primiano Tucci (use gerrit)

Here's an alternative proposal, which will not require you to punch any hole in the ...

4 years, 7 months ago (2016-05-27 09:21:16 UTC) #26

Dmitry Skiba

On 2016/05/27 09:21:16, Primiano Tucci wrote: > Here's an alternative proposal, which will not require ...

4 years, 7 months ago (2016-05-27 09:43:22 UTC) #27

On 2016/05/27 09:21:16, Primiano Tucci wrote:
> Here's an alternative proposal, which will not require you to punch any hole
in
> the sandbox (^__^), will avoid any re-entrancy and should work on both linux
and
> android (And even mac).

I don't think there is a problem with punching a hole inside sandbox - we're
already punching a hole for sanitizers, so we might as well punch one more for
profiling builds.

> Don't use pthread_getattr_np. Estimate the stack start using mincore().
> mincore is a very nice and fast syscall which cal tell you two important
things:
> 1. If a range of page pages is resident in memory. 2. If all the pages in the
> range are mapped or accessing it would segfault.
> What you want here is 2.
> 
> So my proposal is: do a binary search from (cur_stack) to (cur_stack - X), and
> check for errno=ENOMEM. Where X some reasonable expected stack amount (~256k).
> Make also sure you doublecheck at the end that the area
> [result_of_binary_search; cur_stack] is fully mapped (you didn't hit holes in
> the middle) by doing a final mincore() call.
> On linux desktop the cost of a mincore() call seems to be ~260 ns. If it turns
> out to be a perf issue, cache the stack start address in the TLS.
> 
> Also, mincore is supported also on Mac (not sure if it has the same semantics
> though, check). So the odds are that this code would work also there.
> Definitely should work on Linux and Android.

Hmm, documentation says that mincore() determines whether accessing a page would
cause a page fault (not a segfault). So I can imagine that for large stacks some
pages would go to the page file and would be reported by mincore() as absent,
causing us to stop early.

Overall, I've already spent day and and half on a problem that doesn't exist
(crashing in unwinding on Linux), so my plan now is to see whether I can fix
issues with the GPU process, and revert to pure-Android if not.

BTW, it seems that OSX builds everything with frame pointers, so we're fine
there (judging from its backtrace() function, which uses frame pointers).

Primiano Tucci (use gerrit)

On 2016/05/27 09:43:22, Dmitry Skiba wrote: > I don't think there is a problem with ...

4 years, 7 months ago (2016-05-27 09:50:11 UTC) #28

Dmitry Skiba

On 2016/05/27 09:50:11, Primiano Tucci wrote: > On 2016/05/27 09:43:22, Dmitry Skiba wrote: > > ...

4 years, 7 months ago (2016-05-27 10:29:48 UTC) #29

Dmitry Skiba

dskiba@google.com changed reviewers: + rickyz@chromium.org

4 years, 7 months ago (2016-05-27 10:34:44 UTC) #30

Dmitry Skiba

Ricky, please review seccomp exception I added for __NR_sched_getaffinity for Linux profiling builds (enable_profiling = ...

4 years, 7 months ago (2016-05-27 10:34:45 UTC) #31

Primiano Tucci (use gerrit)

Ehm somethign I am missing here is: how did the re-entrancy case menioned in #23 ...

4 years, 7 months ago (2016-05-27 11:09:15 UTC) #32

Dmitry Skiba

On 2016/05/27 11:09:15, Primiano Tucci wrote: > Ehm somethign I am missing here is: how ...

4 years, 6 months ago (2016-05-27 15:51:09 UTC) #33

Primiano Tucci (use gerrit)

On 2016/05/27 15:51:09, Dmitry Skiba wrote: > Honestly, I don't like complexity of mincore() approach. ...

4 years, 6 months ago (2016-05-27 16:19:30 UTC) #34

Dmitry Skiba

On 2016/05/27 16:19:30, Primiano Tucci wrote: > On 2016/05/27 15:51:09, Dmitry Skiba wrote: > > ...

4 years, 6 months ago (2016-05-27 16:28:30 UTC) #35

Primiano Tucci (use gerrit)

On 2016/05/27 16:28:30, Dmitry Skiba wrote: > That parsing is currently happening in pthread_getattr_np() for ...

4 years, 6 months ago (2016-05-27 16:49:43 UTC) #36

mmenke

mmenke@chromium.org changed reviewers: + mmenke@chromium.org

4 years, 6 months ago (2016-05-27 17:39:42 UTC) #37

mmenke

https://codereview.chromium.org/1975393002/diff/120001/base/debug/stack_trace.cc File base/debug/stack_trace.cc (right): https://codereview.chromium.org/1975393002/diff/120001/base/debug/stack_trace.cc#newcode78 base/debug/stack_trace.cc:78: CHECK(!error); include base/logging.h

4 years, 6 months ago (2016-05-27 17:39:43 UTC) #38

Dmitry Skiba

OK, so while I've worked around pthread_getattr_np() reentrancy deadlock for the main thread, it still ...

4 years, 6 months ago (2016-05-27 18:50:09 UTC) #39

Primiano Tucci (use gerrit)

On 2016/05/27 18:50:09, Dmitry Skiba wrote: > OK, so while I've worked around pthread_getattr_np() reentrancy ...

4 years, 6 months ago (2016-05-27 20:04:03 UTC) #40

Dmitry Skiba

On 2016/05/27 20:04:03, Primiano Tucci wrote: > On 2016/05/27 18:50:09, Dmitry Skiba wrote: > > ...

4 years, 6 months ago (2016-05-27 23:54:31 UTC) #41

Dmitry Skiba

dskiba@google.com changed reviewers: + jln@chromium.org

4 years, 6 months ago (2016-05-28 00:03:50 UTC) #42

Dmitry Skiba

Julien, since Ricky is OOO next week, can you please review mincore exception for profiling ...

4 years, 6 months ago (2016-05-28 00:03:51 UTC) #43

jln (very slow on Chromium)

On 2016/05/28 00:03:51, Dmitry Skiba wrote: > Julien, since Ricky is OOO next week, can ...

4 years, 6 months ago (2016-05-28 00:14:42 UTC) #44

Dmitry Skiba

On 2016/05/28 00:14:42, jln (very slow on Chromium) wrote: > On 2016/05/28 00:03:51, Dmitry Skiba ...

4 years, 6 months ago (2016-05-31 04:01:57 UTC) #45

Primiano Tucci (use gerrit)

Thanks for the work. Makes sense conceptually but I think the code can be simplified ...

4 years, 6 months ago (2016-05-31 16:13:07 UTC) #47

Thanks for the work. Makes sense conceptually but I think the code can be
simplified quite a bit (removing template, lambdas and double nested loops :) )

On 2016/05/28 00:14:42, jln (very slow on Chromium) wrote:
> On 2016/05/28 00:03:51, Dmitry Skiba wrote:
> This seems to have a pretty significant security impact.
> When is #if BUILDFLAG(ENABLE_PROFILING) true? Does such a config ever ship to
users?

> so I don't think that we ever ship such builds to users.
Yup I am not aware of any case where we ship this. profiling=true should just be
for local perf development workflow.

> Primiano, can we guarantee that we won't ever ship profiling builds to users?
Would it make feel everybody better if we added, in the sandbox code something
like:

#if defined(GOOGLE_CHROME_BUILD) && BUILDFLAG(ENABLE_PROFILING)
#error Nononon this is not meant to be shipped
#endif

to make sure we never ship a chrome-branded thing built with profiling=1?

https://codereview.chromium.org/1975393002/diff/120001/base/trace_event/heap_...
File base/trace_event/heap_profiler_allocation_context_tracker.cc (right):

https://codereview.chromium.org/1975393002/diff/120001/base/trace_event/heap_...
base/trace_event/heap_profiler_allocation_context_tracker.cc:212: #if
HAVE_TRACE_STACK_FRAME_POINTERS && !defined(OS_NACL)
On 2016/05/27 18:50:09, Dmitry Skiba wrote:
> On 2016/05/27 16:49:43, Primiano Tucci wrote:
> > tip: maybe exclude OS_NACL in HAVE_TRACE_STACK_FRAME_POINTERS so you have to
> > check only for HAVE_TRACE_STACK_FRAME_POINTERS without putting OS_NACL
> > everywhere?
> 
> OS_NACL is there because in NaCl mode stack_trace.cc is not compiled. So
> stack_trace.h/cc itself is fine, and it shouldn't care about NACL. The problem
> is in this file which (without NACL check) tries to use something that is not
> available. 

Yeah but since we generally don't care about what happens in nacl, can we be
aggressive (always set HAVE_TRACE_STACK_FRAME_POINTERS =  false on nacl) in
favor of code readability?

https://codereview.chromium.org/1975393002/diff/140001/base/debug/stack_trace.cc
File base/debug/stack_trace.cc (right):

https://codereview.chromium.org/1975393002/diff/140001/base/debug/stack_trace...
base/debug/stack_trace.cc:62: template <size_t PageCount>
why do you need a template function with a lambda inside? Can this function be
simpler?

https://codereview.chromium.org/1975393002/diff/140001/base/debug/stack_trace...
base/debug/stack_trace.cc:66: int result = HANDLE_EINTR(
accordin to its manpage mincore doesn't EINTR (vm operations are typically not
interruptible because they take the map_sem). At most can EAGAIN if the kernel
is temporarily OOM.So maybe you want to do something like
https://code.google.com/p/chromium/codesearch#chromium/src/base/trace_event/p...

https://codereview.chromium.org/1975393002/diff/140001/base/debug/stack_trace...
base/debug/stack_trace.cc:72: CHECK_EQ(errno, ENOMEM);
I'd remove this CHECK or make it a DCHECK.
1) you can get a EAGAIN
2) If you fail this estimation worst case you'll end up scanning a reduced stack
which is not undefined behavior (which is what CHECK should protect against).

https://codereview.chromium.org/1975393002/diff/140001/base/debug/stack_trace...
base/debug/stack_trace.cc:88: for (size_t pages = PageCount; pages != 1;) {
I think this is a bit hard to follow as you do a binary search here and an outer
while below.
Could this be just a single loop in one place so it's easier to read?

https://codereview.chromium.org/1975393002/diff/140001/base/debug/stack_trace.h
File base/debug/stack_trace.h (right):

https://codereview.chromium.org/1975393002/diff/140001/base/debug/stack_trace...
base/debug/stack_trace.h:107: struct BASE_EXPORT PerThreadStackInfo {
maybe s/PerThreadStackInfo/ThreadStackLimits/

https://codereview.chromium.org/1975393002/diff/140001/base/debug/stack_trace...
base/debug/stack_trace.h:122: // Note on |stack_info|. By default the function
relies on heuristics to check
IMHO this comment is a bit too apologetic. I'd just say:
|stack_info| is an optional cache that avoids recomputing the stack limits on
each invocation. The caller is supposed to just keep it in a TLS and pass it
back.

Maria

mariakhomenko@chromium.org changed reviewers: + mariakhomenko@chromium.org

4 years, 6 months ago (2016-05-31 18:16:37 UTC) #48

Dmitry Skiba

> Thanks for the work. Makes sense conceptually but I think the code can be ...

4 years, 6 months ago (2016-05-31 21:52:19 UTC) #50

> Thanks for the work. Makes sense conceptually but I think the code can be
> simplified quite a bit (removing template, lambdas and double nested loops :)
)

I refactored AdvanceMappedPages() into more obvious FindUnmappedPage()
function, and moved while loop there, but I still need all those things:

1. I need lambda to abstract nasty mincore() details, since we're really
   interested in binary answer. Code is just cleaner this way.

2. I need while() loop because mincore() wants 'vec' array to cover
   the whole page range. I don't want to allocate, and input address range
   can be arbitrary, so we might need several probe iterations to cover the
   whole input range.

3. We have two distinct cases for probe() calls - the first one checks if
   the whole range is fine (and is really a part of #2), the second is inside
   binary search loop, and divides range in half on each iteration.

I peppered FindUnmappedPages() with comments, hopefully it's easier to follow
now.

https://codereview.chromium.org/1975393002/diff/140001/base/debug/stack_trace.cc
File base/debug/stack_trace.cc (right):

https://codereview.chromium.org/1975393002/diff/140001/base/debug/stack_trace...
base/debug/stack_trace.cc:66: int result = HANDLE_EINTR(
On 2016/05/31 16:13:07, Primiano Tucci wrote:
> accordin to its manpage mincore doesn't EINTR (vm operations are typically not
> interruptible because they take the map_sem). At most can EAGAIN if the kernel
> is temporarily OOM.So maybe you want to do something like
>
https://code.google.com/p/chromium/codesearch#chromium/src/base/trace_event/p...

Done.

https://codereview.chromium.org/1975393002/diff/140001/base/debug/stack_trace...
base/debug/stack_trace.cc:72: CHECK_EQ(errno, ENOMEM);
On 2016/05/31 16:13:06, Primiano Tucci wrote:
> I'd remove this CHECK or make it a DCHECK.
> 1) you can get a EAGAIN
> 2) If you fail this estimation worst case you'll end up scanning a reduced
stack
> which is not undefined behavior (which is what CHECK should protect against).

Done.

https://codereview.chromium.org/1975393002/diff/140001/base/debug/stack_trace...
base/debug/stack_trace.cc:88: for (size_t pages = PageCount; pages != 1;) {
On 2016/05/31 16:13:06, Primiano Tucci wrote:
> I think this is a bit hard to follow as you do a binary search here and an
outer
> while below.
> Could this be just a single loop in one place so it's easier to read? 

I don't see how it can be a single loop. When binary searching I half the range
on each iteration. So the first check that checks the whole range doens't really
belong here. Besides, I need to return true / false depending on the first
check.

https://codereview.chromium.org/1975393002/diff/140001/base/debug/stack_trace.h
File base/debug/stack_trace.h (right):

https://codereview.chromium.org/1975393002/diff/140001/base/debug/stack_trace...
base/debug/stack_trace.h:107: struct BASE_EXPORT PerThreadStackInfo {
On 2016/05/31 16:13:07, Primiano Tucci wrote:
> maybe s/PerThreadStackInfo/ThreadStackLimits/

Acknowledged.

https://codereview.chromium.org/1975393002/diff/140001/base/debug/stack_trace...
base/debug/stack_trace.h:122: // Note on |stack_info|. By default the function
relies on heuristics to check
On 2016/05/31 16:13:07, Primiano Tucci wrote:
> IMHO this comment is a bit too apologetic. I'd just say:
> |stack_info| is an optional cache that avoids recomputing the stack limits on
> each invocation. The caller is supposed to just keep it in a TLS and pass it
> back.

Hmm, I don't see it that way. I think it explains motivation behind stack_info,
which can help deciding whether or not need you need it.

Dmitry Skiba

Julien: I added !defined(GOOGLE_CHROME_BUILD) to the condition, to make sure it doesn't get into official ...

4 years, 6 months ago (2016-06-01 00:58:03 UTC) #51

mmenke

mmenke@chromium.org changed reviewers: - mmenke@chromium.org

4 years, 6 months ago (2016-06-01 17:06:29 UTC) #52

Primiano Tucci (use gerrit)

I think this can be really simplified. Unless I am missing something you can do ...

4 years, 6 months ago (2016-06-01 20:10:58 UTC) #53

I think this can be really simplified. Unless I am missing something you can do
all this in 1/4 of the LOC, in a way which is IMHO more readable and less
obscure.
I might be a bit old style but I really don't see the point of using fancy
lambdas which in your case just hide the method name and require you comments.
Here's a proposal in < 40 LOC  

const size_t kMaxFrameSize = 128 * 1024;

#if defined(HAVE_MINCORE)
bool IsRangeMapped(uintptr_t start, uintptr_t end) {
  static uintptr_t page_size = 0;
  if (!page_size) page_size = GetPageSize();
  const uintptr_t aligned_start = start & ~(page_size - 1);
  const uintptr_t aligned_end = end & ~(page_size - 1);
  const size_t num_pages = 1 + (aligned_end - aligned_start) / page_size; 
  const size_t kMaxProbePages = kMaxFrameSize / 4096;
  DCHECK_LE(num_pages, kMaxProbePages);
  uint8_t unused[kMaxProbePages];
  int error_counter = 0;
  int result;
  do {
    result = mincore(reinterpret_cast<void*>(aligned_start), num_pages, unused);
  } while (result == -1 && errno == EAGAIN && error_counter++ < 100);
  if (result == 0)
    return true;
  // mincore returns ENOMEM if at least one page in the range is not mapped.
  DCHECK_EQ(ENOMEM, errno);
  return false;
}
#endif


size_t TraceStackFramePointers(..) {
...
{
  uintptr_t next_sp = reinterpret_cast<const uintptr_t*>(sp)[0];

  // With the stack growing downwards, older stack frame must be
  // at a greater address that the current one.
  if (next_sp <= sp) break;

  // Check alignment.
  if (sp & (sizeof(void*) - 1)) break;

  // Assume stack frames larger than kMaxFrameSize bytes are bogus.
  if (next_sp - sp > kMaxFrameSize) break;

#if defined(HAVE_MINCORE)
  uintptr_t stack_limit = stack_info ? stack_info->stack_limit : 0;
  // If the next sp is beyond the previously probed stack limit, probe the new
  // limit and cache it.
  if (next_sp > stack_limit) {
    if (IsRangeMapped(sp, next_sp))) {
      stack_limit = base::Align(next_sp, page_size) - 1;
      if (stack_info) stack_info->stack_limit = stack_limit;
    }
  }
  if (next_sp > stack_limit)
    break;
#endif

  sp = next_sp;
}
...
}

Dmitry Skiba

On 2016/06/01 20:10:58, Primiano Tucci wrote: > I think this can be really simplified. Unless ...

4 years, 6 months ago (2016-06-01 23:26:23 UTC) #54

Dmitry Skiba

Just found and issue with mincore approach on Linux: 7f4299c1f000-7f429a41f000 rw-p 00000000 00:00 0 [stack:109410] ...

4 years, 6 months ago (2016-06-02 07:43:18 UTC) #55

Primiano Tucci (use gerrit)

On 2016/06/01 23:26:23, Dmitry Skiba wrote: > 1. It will likely probe all pages one ...

4 years, 6 months ago (2016-06-02 13:32:49 UTC) #56

On 2016/06/01 23:26:23, Dmitry Skiba wrote:
> 1. It will likely probe all pages one by one, since you are probing [sp,
> next_sp) range, and next_sp is likely to be close. I.e. it will be slower.

Yes but only once per thread. Is it a real problem? Don't think so.

> 2. For large stack frames it will give inaccurate results, since when
> IsRangeMapped(sp, next_sp) fails, you don't know which page is unmapped, and
> simply use previous stack_limit. Eventually  because of #1 it will find true
> stack limit, but at runtime cost.

If a (sp, next_sp) range is not fully mapped it means that the stack is corrupt.
At which point I am not sure we care about whatever happens.
The current code will stop at the latest valid stack limit found, which doesn't
seem unreasonable as a behavior.


> 3. IsRangeMapped hardcodes page size to be 4096 (kMaxProbePages =
kMaxFrameSize
> / 4096). This is probably fine, but all code everywhere else uses
GetPageSize(),
> and I feel that using function is safer.
Right sorry just divide that by page_size not 4096, my bad.

> If we want to go this route, we can simply have IsPageMapped() and check all
> pages between sp and next_sp.
I just thought this code was the simplest and easier to read.

> This will also solve some stylistic issues, like
> the fact that kMaxFrameSize, which is TraceStackFramePointers' implementation
> detail, is exposed.
You had that kMaxFrameSize anyways. It's just called anonymously 100000. I don't
see the issue with turning that into a constant.
This is not exposed anywhere, it's a const in the anonymous namespace.

> What exactly bothers you in my solution?
The fact that is ~4x longer and way IMHO harder to read. The time spent on this
CL is nothing comapred to the time that we'll spend 6 months from now debugging
this code.

> I think it's pretty well commented, and lambda actually contributes to the
function's clarity.
Ok don't know what else to say. I'm not even a reviewer here, was trying to
help. Figure it out with a //base/ OWNER

> So mincore will report that guard page is mapped, which however doesn't mean
that it's readable. What's more, mincore is actually useless here, since stacks
follow each other, and there are no gaps. So if there is a bad pointer, which
however passes <100000 check, it's probably perfectly readable.
Good point. So either you look at the resident bit of mincore() and just drop
the ball early if you hit stack that has been swapped (I wouldn't expect that as
stack should be in LRU, but I never checked actual data) or figure out some
other solution.

Dmitry Skiba

The patchset sent to the CQ was uploaded after l-g-t-m from thakis@chromium.org, primiano@chromium.org Link to ...

4 years, 6 months ago (2016-06-06 19:46:18 UTC) #58

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1975393002/200001

4 years, 6 months ago (2016-06-06 19:46:38 UTC) #59

commit-bot: I haz the power

Description was changed from ========== Check stack pointer to be inside stack when unwinding. TraceStackFramePointers() ...

4 years, 6 months ago (2016-06-06 21:38:09 UTC) #60

commit-bot: I haz the power

Description was changed from ========== Check stack pointer to be inside stack when unwinding. TraceStackFramePointers() ...

4 years, 6 months ago (2016-06-06 21:40:09 UTC) #62

commit-bot: I haz the power

Patchset 11 (id:??) landed as https://crrev.com/0bed5150f19acdd4897dd12ea8e4d4802c51d75f Cr-Commit-Position: refs/heads/master@{#398134}

4 years, 6 months ago (2016-06-06 21:40:11 UTC) #63

Message was sent while issue was closed.

Documented my attempts here: crbug.com/617730

Issue 1975393002: Check stack pointer to be inside stack when unwinding. (Closed)

Description

Patch Set 1 #

Patch Set 2 : rebase #

Patch Set 3 : rebase #

Patch Set 4 : Check on Linux too; rename function to GetStackStart() #

Patch Set 5 : Disable Linux #

Patch Set 6 : Avoid issues on Linux #

Patch Set 7 : Fix renderer deadlock on Linux #

Patch Set 8 : Implement mincore() approach #

Patch Set 9 : Address comments #

Patch Set 10 : rebase #

Patch Set 11 : Revert to the LGTMed patchset 1 #

Messages