Issue 1980263002: GPU Watchdog to check I/O before terminating GPU process

stanisc

Description was changed from ========== GPU Watchdog to check I/O before terminating BUG= ========== to ...

4 years, 7 months ago (2016-05-17 22:02:28 UTC) #1

stanisc

stanisc@chromium.org changed reviewers: + jbauman@chromium.org, wfh@chromium.org

4 years, 7 months ago (2016-05-17 22:23:52 UTC) #2

stanisc

jbauman@ - please review changes in GPU watchdog wfh@ - I had to add a ...

4 years, 7 months ago (2016-05-17 22:23:53 UTC) #3

jbauman

https://codereview.chromium.org/1980263002/diff/20001/content/gpu/gpu_watchdog_thread.cc File content/gpu/gpu_watchdog_thread.cc (right): https://codereview.chromium.org/1980263002/diff/20001/content/gpu/gpu_watchdog_thread.cc#newcode277 content/gpu/gpu_watchdog_thread.cc:277: if (use_temp_file_for_io_checking_) { Instead of using a flag here, ...

4 years, 7 months ago (2016-05-17 22:57:20 UTC) #4

stanisc

https://codereview.chromium.org/1980263002/diff/20001/content/gpu/gpu_watchdog_thread.cc File content/gpu/gpu_watchdog_thread.cc (right): https://codereview.chromium.org/1980263002/diff/20001/content/gpu/gpu_watchdog_thread.cc#newcode277 content/gpu/gpu_watchdog_thread.cc:277: if (use_temp_file_for_io_checking_) { On 2016/05/17 22:57:20, jbauman wrote: > ...

4 years, 7 months ago (2016-05-18 00:01:30 UTC) #5

Will Harris

I don't quite understand how doing an unbuffered write will help unblock the process. Can ...

4 years, 7 months ago (2016-05-18 08:52:55 UTC) #6

stanisc

Here is the rationale for doing the I/O. There is some evidence that many of ...

4 years, 7 months ago (2016-05-18 17:25:31 UTC) #7

Will Harris

I really don't like this solution. Id be happy to give a temporary l-g-t-m if ...

4 years, 7 months ago (2016-05-19 10:25:01 UTC) #8

stanisc

OK, let's pause this for now. There is another change that will increase watchdog timeout ...

4 years, 7 months ago (2016-05-19 18:47:26 UTC) #9

stanisc

stanisc@chromium.org changed reviewers: + manzagop@chromium.org, nick@chromium.org, pmonette@chromium.org

4 years, 6 months ago (2016-06-09 00:33:45 UTC) #12

stanisc

+nick@chromium.org for content/common code +manzagop@ and pmonette@ who are also investigating Chrome hangs. I've addressed ...

4 years, 6 months ago (2016-06-09 00:33:46 UTC) #13

stanisc

stanisc@chromium.org changed reviewers: + brucedawson@chromium.org

4 years, 6 months ago (2016-06-09 00:45:12 UTC) #14

stanisc

Submitted this too soon by mistake. Also +brucedawson@ I've limited this change to OS_WIN only ...

4 years, 6 months ago (2016-06-09 00:45:12 UTC) #15

Will Harris

On 2016/06/09 00:45:12, stanisc wrote: > Submitted this too soon by mistake. > > Also ...

4 years, 6 months ago (2016-06-09 00:50:14 UTC) #16

brucedawson

If a task is blocked behind I/O then it is probably blocked by multiple I/Os, ...

4 years, 6 months ago (2016-06-09 01:01:13 UTC) #17

ncarter (slow)

lgtm https://codereview.chromium.org/1980263002/diff/100001/content/common/gpu_watchdog_utils.cc File content/common/gpu_watchdog_utils.cc (right): https://codereview.chromium.org/1980263002/diff/100001/content/common/gpu_watchdog_utils.cc#newcode12 content/common/gpu_watchdog_utils.cc:12: CONTENT_EXPORT bool GetGpuWatchdogTempFile(base::FilePath* file_path) { Can this CONTENT_EXPORT ...

4 years, 6 months ago (2016-06-09 16:14:43 UTC) #18

brucedawson

https://codereview.chromium.org/1980263002/diff/100001/content/gpu/gpu_watchdog_thread.cc File content/gpu/gpu_watchdog_thread.cc (right): https://codereview.chromium.org/1980263002/diff/100001/content/gpu/gpu_watchdog_thread.cc#newcode281 content/gpu/gpu_watchdog_thread.cc:281: io_check_duration_ = timer.Elapsed(); On 2016/06/09 16:14:43, ncarter wrote: > ...

4 years, 6 months ago (2016-06-09 17:29:05 UTC) #19

manzagop (departed)

https://codereview.chromium.org/1980263002/diff/100001/content/gpu/gpu_watchdog_thread.cc File content/gpu/gpu_watchdog_thread.cc (right): https://codereview.chromium.org/1980263002/diff/100001/content/gpu/gpu_watchdog_thread.cc#newcode281 content/gpu/gpu_watchdog_thread.cc:281: io_check_duration_ = timer.Elapsed(); > (because 'this' is in a ...

4 years, 6 months ago (2016-06-09 19:41:54 UTC) #20

ncarter (slow)

https://codereview.chromium.org/1980263002/diff/100001/content/gpu/gpu_watchdog_thread.cc File content/gpu/gpu_watchdog_thread.cc (right): https://codereview.chromium.org/1980263002/diff/100001/content/gpu/gpu_watchdog_thread.cc#newcode281 content/gpu/gpu_watchdog_thread.cc:281: io_check_duration_ = timer.Elapsed(); On 2016/06/09 19:41:54, manzagop wrote: > ...

4 years, 6 months ago (2016-06-09 20:11:20 UTC) #21

brucedawson

> > Drive-by question! Is memory pulled in around registers? I thought it was only ...

4 years, 6 months ago (2016-06-09 20:24:50 UTC) #22

stanisc

I am fine with reverting once we get more data. Let's see what kind of ...

4 years, 6 months ago (2016-06-09 20:52:28 UTC) #23

I am fine with reverting once we get more data. Let's see what kind of impact it
has on rate of [GPU hang] crashes and what kind of I/O write times we get back
in crash dumps.

https://codereview.chromium.org/1980263002/diff/100001/content/common/gpu_wat...
File content/common/gpu_watchdog_utils.cc (right):

https://codereview.chromium.org/1980263002/diff/100001/content/common/gpu_wat...
content/common/gpu_watchdog_utils.cc:12: CONTENT_EXPORT bool
GetGpuWatchdogTempFile(base::FilePath* file_path) {
On 2016/06/09 16:14:43, ncarter wrote:
> Can this CONTENT_EXPORT be omitted from the .cc file? Usually it's only needed
> in the header.

Done.

https://codereview.chromium.org/1980263002/diff/100001/content/gpu/gpu_watchd...
File content/gpu/gpu_watchdog_thread.cc (right):

https://codereview.chromium.org/1980263002/diff/100001/content/gpu/gpu_watchd...
content/gpu/gpu_watchdog_thread.cc:96: base::File::FLAG_SHARE_DELETE);
On 2016/06/09 16:14:43, ncarter wrote:
> I imagine that running browsertests in parallel with
> --enable-pixel-output-in-tests will give decent coverage of contention on this
> file.

I've tested contention on this file by running two instances of Chrome with
different profile directories. The file gets deleted when the last instance of
Chrome is closed. The file gets written to only when the hang is already
detected (15 sec on not getting acknowledge) and chrome is about to crash. If we
limit this solution to Windows only we could probably not create this file until
we are ready to write to it, but on other platforms I think it needs to be
created before the sandbox is activated (which is here, in constructor).

https://codereview.chromium.org/1980263002/diff/100001/content/gpu/gpu_watchd...
content/gpu/gpu_watchdog_thread.cc:277: const char temp_data[32] = {0};
On 2016/06/09 01:01:13, brucedawson wrote:
> VC++ still generates sub-optimal code for = {0}. You should prefer = {}
instead.
> 
> I just tested and the compiler still generates *crazy* code that writes 1
byte,
> then 8, 8, 8, 4, 2, 1. I only reported the bug six years ago - give them time.
> On a 64-bit (release) test build this wastes 13 bytes of code.

Done.

https://codereview.chromium.org/1980263002/diff/100001/content/gpu/gpu_watchd...
content/gpu/gpu_watchdog_thread.cc:281: io_check_duration_ = timer.Elapsed();
On 2016/06/09 20:11:20, ncarter wrote:
> On 2016/06/09 19:41:54, manzagop wrote:
> > > (because 'this' is in a register when we crash)
> > 
> > Drive-by question! Is memory pulled in around registers? I thought it was
only
> > around stack addresses? (Also, is the answer for MiniDumpWriteDump or for
> > CrashPad?)
> 
> The memory around |*this|, for all stack frames, are captured pretty reliably
in
> minidumps on Windows nowadays. It regressed for about a month and a half
during
> the breakpad/crashpad switchover late last year, but has been decent since. No
> idea about other platforms. One can use crashkeys if you want the values to be
> visible on the crash/ web UI.
I've added a comment suggested by Bruce.

Using crash key is a good idea for something like the CPU time delta for the
watched thread. I think as long as io_check_duration_ is visible when debugging
the crash dump that should be good enough.

stanisc

I'll commit this tomorrow morning if there is no further feedback.

4 years, 6 months ago (2016-06-10 01:24:16 UTC) #24

stanisc

The patchset sent to the CQ was uploaded after l-g-t-m from nick@chromium.org, wfh@chromium.org Link to ...

4 years, 6 months ago (2016-06-10 17:25:23 UTC) #26

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1980263002/120001

4 years, 6 months ago (2016-06-10 17:25:51 UTC) #27

commit-bot: I haz the power

Description was changed from ========== GPU Watchdog to check I/O before terminating The idea is ...

4 years, 6 months ago (2016-06-10 17:47:15 UTC) #28

commit-bot: I haz the power

Description was changed from ========== GPU Watchdog to check I/O before terminating The idea is ...

4 years, 6 months ago (2016-06-10 17:48:47 UTC) #31

commit-bot: I haz the power

Patchset 5 (id:??) landed as https://crrev.com/4015b488f743a7399e3362fd49917f494ff7caaf Cr-Commit-Position: refs/heads/master@{#399222}

4 years, 6 months ago (2016-06-10 17:48:49 UTC) #32

Ken Russell (switch to Gerrit)

4 years, 6 months ago (2016-06-15 18:34:01 UTC) #33

Message was sent while issue was closed.

A revert of this CL (patchset #5 id:120001) has been created in
https://codereview.chromium.org/2071613002/ by kbr@chromium.org.

The reason for reverting is: This CL seems to have caused intermittent assertion
failures in the context_lost_tests on the commit queue and reliable assertion
failures on some of the GPU bots. See http://crbug.com/619196 .
.

Issue 1980263002: GPU Watchdog to check I/O before terminating GPU process (Closed)

Description

Patch Set 1 #

Patch Set 2 : Avoid creating the temp file in advance #

Patch Set 3 : Addressed feedback #

Patch Set 4 : #

Patch Set 5 : Addressed CR feedback #

Messages