|
|
Chromium Code Reviews
DescriptionGPU Watchdog to check I/O before terminating
The idea is to try to do an unbuffered write in GPU watchdog
just before crashing the process. If the process is slow due
to heavy I/O this should give it more time to unblock.
This should theoretically help with some of GPU hangs.
BUG=612607
Committed: https://crrev.com/4015b488f743a7399e3362fd49917f494ff7caaf
Cr-Commit-Position: refs/heads/master@{#399222}
Patch Set 1 #Patch Set 2 : Avoid creating the temp file in advance #
Total comments: 3
Patch Set 3 : Addressed feedback #
Total comments: 4
Patch Set 4 : #
Total comments: 11
Patch Set 5 : Addressed CR feedback #
Messages
Total messages: 33 (10 generated)
Description was changed from ========== GPU Watchdog to check I/O before terminating BUG= ========== to ========== GPU Watchdog to check I/O before terminating The idea is to try to do an unbuffered write in GPU watchdog just before crashing the process. If the process is slow due to heavy I/O this should give it more time to unblock. This should theoretically help with some of GPU hangs. BUG=612607 ==========
stanisc@chromium.org changed reviewers: + jbauman@chromium.org, wfh@chromium.org
jbauman@ - please review changes in GPU watchdog wfh@ - I had to add a sandbox rule to allow GPU watchdog to write to a temporary file. Please review that. Please note that I've tried avoiding adding the rule by creating the temp file before lowering the sandbox but that didn't seem to work.
https://codereview.chromium.org/1980263002/diff/20001/content/gpu/gpu_watchdo... File content/gpu/gpu_watchdog_thread.cc (right): https://codereview.chromium.org/1980263002/diff/20001/content/gpu/gpu_watchdo... content/gpu/gpu_watchdog_thread.cc:277: if (use_temp_file_for_io_checking_) { Instead of using a flag here, we should make this into a separate method. Everything that currently does a PostTask of GpuWatchdogThread::DeliberatelyTerminateToRecoverFromHang would instead do a posttask of the new method (GpuWatchdogThread::CheckForIo ?), and this new method would be the only thing to PostTask GpuWatchdogThread::DeliberatelyTerminateToRecoverFromHang. https://codereview.chromium.org/1980263002/diff/20001/content/gpu/gpu_watchdo... content/gpu/gpu_watchdog_thread.cc:286: temp_file.Initialize( We're not doing anything special with the sandbox on non-windows, so we need to make sure that this works there.
https://codereview.chromium.org/1980263002/diff/20001/content/gpu/gpu_watchdo... File content/gpu/gpu_watchdog_thread.cc (right): https://codereview.chromium.org/1980263002/diff/20001/content/gpu/gpu_watchdo... content/gpu/gpu_watchdog_thread.cc:277: if (use_temp_file_for_io_checking_) { On 2016/05/17 22:57:20, jbauman wrote: > Instead of using a flag here, we should make this into a separate method. > Everything that currently does a PostTask of > GpuWatchdogThread::DeliberatelyTerminateToRecoverFromHang would instead do a > posttask of the new method (GpuWatchdogThread::CheckForIo ?), and this new > method would be the only thing to PostTask > GpuWatchdogThread::DeliberatelyTerminateToRecoverFromHang. Good point! I didn't realize that adding sandbox rules works only on Windows. I might have to go back to the first iteration where I created the temp file before lowering the sandbox (see patch set 1). Another option is to limit this to Windows only.
I don't quite understand how doing an unbuffered write will help unblock the process. Can you perhaps describe more on why you think this helps (happy to also discuss on the bug if you prefer). Thanks.
Here is the rationale for doing the I/O. There is some evidence that many of these GPU hangs are not true hangs but just the machine running slow. Basically if the GPU watchdog thread doesn't get acknowledge from the GPU main thread's task observer within 10 seconds it assumes that main thread is hung and deliberately crashes the process that generates the crash dump. I looked at a large number of these crash dumps and I've noticed the following: - The main thread ends up in a large variety of call stacks during these crash dumps - In some crash dumps it looks like the main thread (that is supposedly hung) manages to make some progress after the crash has already been triggered. For example in some cases it starts processing a new task (which is evident from the task timestamp). So perhaps the reason the main thread is being slow is because it touches a swapped memory page or does something else that requires an I/O and there is an I/O bottleneck. The code that I am adding to the watchdog makes a write a file and flushes it. The flush operation would have to wait for the write to finish, and if the I/O queue is saturated it would have to clear before the write goes through (assuming it is the same drive). Meanwhile if an acknowledge comes from the main thread while waiting for the I/O it would reset the watchdog and cancel the termination.
I really don't like this solution. Id be happy to give a temporary l-g-t-m if this was only going in for a few releases of Canary to test if your hypothesis was correct, but I'm not really happy with this code landing all the way to stable. It seems very speculative - is there no other way of coordinating between the main thread and the watchdog thread to determine what might be happening here? https://codereview.chromium.org/1980263002/diff/40001/content/gpu/gpu_watchdo... File content/gpu/gpu_watchdog_thread.cc (right): https://codereview.chromium.org/1980263002/diff/40001/content/gpu/gpu_watchdo... content/gpu/gpu_watchdog_thread.cc:91: temp_dir_path.Append(FILE_PATH_LITERAL("gpu-watchdog.tmp")); use a constant for "gpu-watchdog.tmp" as it's used in two places. https://codereview.chromium.org/1980263002/diff/40001/content/gpu/gpu_watchdo... content/gpu/gpu_watchdog_thread.cc:99: base::File::FLAG_SHARE_DELETE); two chrome's running on the same machine as same user e.g. chrome canary and chrome stable, will use the same file but the share permissions here don't allow the file to be opened by both.
OK, let's pause this for now. There is another change that will increase watchdog timeout to 15 seconds. Let's see how much that affects the GPU hang rate. I'll start a document to discuss this approach and other related info about GPU hangs. https://codereview.chromium.org/1980263002/diff/40001/content/gpu/gpu_watchdo... File content/gpu/gpu_watchdog_thread.cc (right): https://codereview.chromium.org/1980263002/diff/40001/content/gpu/gpu_watchdo... content/gpu/gpu_watchdog_thread.cc:99: base::File::FLAG_SHARE_DELETE); On 2016/05/19 10:25:01, Will Harris (slow thru 23 May) wrote: > two chrome's running on the same machine as same user e.g. chrome canary and > chrome stable, will use the same file but the share permissions here don't allow > the file to be opened by both. It does work with two clients. I've tested that. The file gets deleted when the last client exits. If you look at the implementation in file.h The "share write" flag is added by default unless the "restrict write" flag is specified.
Patchset #4 (id:60001) has been deleted
Patchset #4 (id:80001) has been deleted
stanisc@chromium.org changed reviewers: + manzagop@chromium.org, nick@chromium.org, pmonette@chromium.org
+nick@chromium.org for content/common code +manzagop@ and pmonette@ who are also investigating Chrome hangs. I've addressed previous https://codereview.chromium.org/1980263002/diff/40001/content/gpu/gpu_watchdo... File content/gpu/gpu_watchdog_thread.cc (right): https://codereview.chromium.org/1980263002/diff/40001/content/gpu/gpu_watchdo... content/gpu/gpu_watchdog_thread.cc:91: temp_dir_path.Append(FILE_PATH_LITERAL("gpu-watchdog.tmp")); On 2016/05/19 10:25:01, Will Harris wrote: > use a constant for "gpu-watchdog.tmp" as it's used in two places. Done.
stanisc@chromium.org changed reviewers: + brucedawson@chromium.org
Submitted this too soon by mistake. Also +brucedawson@ I've limited this change to OS_WIN only to minimize the risk and because we are observing these [GPU hang] crashes on Windows only. There is a new evidence from a hang report provided by manzagop@ and pmonette@ that memory and disk load seems to be a big cause. More specifically "at least 60% of the reports have 0% disk idle" and "the median avg disk queue len is 6". So it worth to try this on Canary and see if it has any impact on the crash rate. I don't know if doing only one write would be sufficient though. But we should at least get the I/O timing back in crash dumps.
On 2016/06/09 00:45:12, stanisc wrote: > Submitted this too soon by mistake. > > Also +brucedawson@ > > I've limited this change to OS_WIN only to minimize the risk and because we are > observing these [GPU hang] crashes on Windows only. > > There is a new evidence from a hang report provided by manzagop@ and pmonette@ > that memory and disk load seems to be a big cause. > More specifically "at least 60% of the reports have 0% disk idle" and "the > median avg disk queue len is 6". > > So it worth to try this on Canary and see if it has any impact on the crash > rate. I don't know if doing only one write would be sufficient though. But we > should at least get the I/O timing back in crash dumps. I'm happy to try this on Canary but wouldn't want this shipping further without a concrete plan, so happy to lgtm this and then perhaps revert in a few days once you have some data.
If a task is blocked behind I/O then it is probably blocked by multiple I/Os, many of which are not yet queued (page faults from a single thread never lead to long queue lengths) so this is unlikely to resolve the hang. However it may give us information on how long the write takes. If we find that many writes are taking 500+ ms then that gives us a hint that the machine is under heavy I/O stress. https://codereview.chromium.org/1980263002/diff/100001/content/gpu/gpu_watchd... File content/gpu/gpu_watchdog_thread.cc (right): https://codereview.chromium.org/1980263002/diff/100001/content/gpu/gpu_watchd... content/gpu/gpu_watchdog_thread.cc:277: const char temp_data[32] = {0}; VC++ still generates sub-optimal code for = {0}. You should prefer = {} instead. I just tested and the compiler still generates *crazy* code that writes 1 byte, then 8, 8, 8, 4, 2, 1. I only reported the bug six years ago - give them time. On a 64-bit (release) test build this wastes 13 bytes of code.
lgtm https://codereview.chromium.org/1980263002/diff/100001/content/common/gpu_wat... File content/common/gpu_watchdog_utils.cc (right): https://codereview.chromium.org/1980263002/diff/100001/content/common/gpu_wat... content/common/gpu_watchdog_utils.cc:12: CONTENT_EXPORT bool GetGpuWatchdogTempFile(base::FilePath* file_path) { Can this CONTENT_EXPORT be omitted from the .cc file? Usually it's only needed in the header. https://codereview.chromium.org/1980263002/diff/100001/content/gpu/gpu_watchd... File content/gpu/gpu_watchdog_thread.cc (right): https://codereview.chromium.org/1980263002/diff/100001/content/gpu/gpu_watchd... content/gpu/gpu_watchdog_thread.cc:96: base::File::FLAG_SHARE_DELETE); I imagine that running browsertests in parallel with --enable-pixel-output-in-tests will give decent coverage of contention on this file. https://codereview.chromium.org/1980263002/diff/100001/content/gpu/gpu_watchd... content/gpu/gpu_watchdog_thread.cc:281: io_check_duration_ = timer.Elapsed(); Did you intend to use this for anything?
https://codereview.chromium.org/1980263002/diff/100001/content/gpu/gpu_watchd... File content/gpu/gpu_watchdog_thread.cc (right): https://codereview.chromium.org/1980263002/diff/100001/content/gpu/gpu_watchd... content/gpu/gpu_watchdog_thread.cc:281: io_check_duration_ = timer.Elapsed(); On 2016/06/09 16:14:43, ncarter wrote: > Did you intend to use this for anything? The io_check_duration_ member will end up in the crash dump (because 'this' is in a register when we crash) so this value may give us some hints about how big a factor I/O load is. Stan - add a comment to that effect?
https://codereview.chromium.org/1980263002/diff/100001/content/gpu/gpu_watchd... File content/gpu/gpu_watchdog_thread.cc (right): https://codereview.chromium.org/1980263002/diff/100001/content/gpu/gpu_watchd... content/gpu/gpu_watchdog_thread.cc:281: io_check_duration_ = timer.Elapsed(); > (because 'this' is in a register when we crash) Drive-by question! Is memory pulled in around registers? I thought it was only around stack addresses? (Also, is the answer for MiniDumpWriteDump or for CrashPad?)
https://codereview.chromium.org/1980263002/diff/100001/content/gpu/gpu_watchd... File content/gpu/gpu_watchdog_thread.cc (right): https://codereview.chromium.org/1980263002/diff/100001/content/gpu/gpu_watchd... content/gpu/gpu_watchdog_thread.cc:281: io_check_duration_ = timer.Elapsed(); On 2016/06/09 19:41:54, manzagop wrote: > > (because 'this' is in a register when we crash) > > Drive-by question! Is memory pulled in around registers? I thought it was only > around stack addresses? (Also, is the answer for MiniDumpWriteDump or for > CrashPad?) The memory around |*this|, for all stack frames, are captured pretty reliably in minidumps on Windows nowadays. It regressed for about a month and a half during the breakpad/crashpad switchover late last year, but has been decent since. No idea about other platforms. One can use crashkeys if you want the values to be visible on the crash/ web UI.
> > Drive-by question! Is memory pulled in around registers? I thought it was
only
> > around stack addresses? (Also, is the answer for MiniDumpWriteDump or for
> > CrashPad?)
>
> The memory around |*this|, for all stack frames, are captured pretty reliably
in
> minidumps on Windows nowadays. It regressed for about a month and a half
during
> the breakpad/crashpad switchover late last year, but has been decent since. No
> idea about other platforms. One can use crashkeys if you want the values to be
> visible on the crash/ web UI.
That's a good drive-by question. I think that MiniDumpWriteDump saves 256 bytes
centered around any pointers it finds in registers. This causes the code near
EIP/RIP to be recorded, as well as part of whatever object is referenced in
ECX/RCX. I think that crashpad emulates this, and I believe it happens for all
threads. The saving of code bytes allows for detecting of some types of code
corruption by running this cryptic command:
!for_each_module "!chkimg -d @#ModuleName"
You can also use MiniDumpWithIndirectlyReferencedMemory
(https://msdn.microsoft.com/en-us/library/windows/desktop/ms680519(v=vs.85).aspx)
to get MiniDumpWriteDump to look for pointers on the stack(s), and therefore
save even more chunks of memory.
If memory besides module information, stacks, and register-pointed data is being
saved then I'm surprised. !vadump will show all blocks that were saved. Further
discussion should probably be elsewhere.
I am fine with reverting once we get more data. Let's see what kind of impact it has on rate of [GPU hang] crashes and what kind of I/O write times we get back in crash dumps. https://codereview.chromium.org/1980263002/diff/100001/content/common/gpu_wat... File content/common/gpu_watchdog_utils.cc (right): https://codereview.chromium.org/1980263002/diff/100001/content/common/gpu_wat... content/common/gpu_watchdog_utils.cc:12: CONTENT_EXPORT bool GetGpuWatchdogTempFile(base::FilePath* file_path) { On 2016/06/09 16:14:43, ncarter wrote: > Can this CONTENT_EXPORT be omitted from the .cc file? Usually it's only needed > in the header. Done. https://codereview.chromium.org/1980263002/diff/100001/content/gpu/gpu_watchd... File content/gpu/gpu_watchdog_thread.cc (right): https://codereview.chromium.org/1980263002/diff/100001/content/gpu/gpu_watchd... content/gpu/gpu_watchdog_thread.cc:96: base::File::FLAG_SHARE_DELETE); On 2016/06/09 16:14:43, ncarter wrote: > I imagine that running browsertests in parallel with > --enable-pixel-output-in-tests will give decent coverage of contention on this > file. I've tested contention on this file by running two instances of Chrome with different profile directories. The file gets deleted when the last instance of Chrome is closed. The file gets written to only when the hang is already detected (15 sec on not getting acknowledge) and chrome is about to crash. If we limit this solution to Windows only we could probably not create this file until we are ready to write to it, but on other platforms I think it needs to be created before the sandbox is activated (which is here, in constructor). https://codereview.chromium.org/1980263002/diff/100001/content/gpu/gpu_watchd... content/gpu/gpu_watchdog_thread.cc:277: const char temp_data[32] = {0}; On 2016/06/09 01:01:13, brucedawson wrote: > VC++ still generates sub-optimal code for = {0}. You should prefer = {} instead. > > I just tested and the compiler still generates *crazy* code that writes 1 byte, > then 8, 8, 8, 4, 2, 1. I only reported the bug six years ago - give them time. > On a 64-bit (release) test build this wastes 13 bytes of code. Done. https://codereview.chromium.org/1980263002/diff/100001/content/gpu/gpu_watchd... content/gpu/gpu_watchdog_thread.cc:281: io_check_duration_ = timer.Elapsed(); On 2016/06/09 20:11:20, ncarter wrote: > On 2016/06/09 19:41:54, manzagop wrote: > > > (because 'this' is in a register when we crash) > > > > Drive-by question! Is memory pulled in around registers? I thought it was only > > around stack addresses? (Also, is the answer for MiniDumpWriteDump or for > > CrashPad?) > > The memory around |*this|, for all stack frames, are captured pretty reliably in > minidumps on Windows nowadays. It regressed for about a month and a half during > the breakpad/crashpad switchover late last year, but has been decent since. No > idea about other platforms. One can use crashkeys if you want the values to be > visible on the crash/ web UI. I've added a comment suggested by Bruce. Using crash key is a good idea for something like the CPU time delta for the watched thread. I think as long as io_check_duration_ is visible when debugging the crash dump that should be good enough.
I'll commit this tomorrow morning if there is no further feedback.
The CQ bit was checked by stanisc@chromium.org
The patchset sent to the CQ was uploaded after l-g-t-m from nick@chromium.org, wfh@chromium.org Link to the patchset: https://codereview.chromium.org/1980263002/#ps120001 (title: "Addressed CR feedback")
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1980263002/120001
Message was sent while issue was closed.
Description was changed from ========== GPU Watchdog to check I/O before terminating The idea is to try to do an unbuffered write in GPU watchdog just before crashing the process. If the process is slow due to heavy I/O this should give it more time to unblock. This should theoretically help with some of GPU hangs. BUG=612607 ========== to ========== GPU Watchdog to check I/O before terminating The idea is to try to do an unbuffered write in GPU watchdog just before crashing the process. If the process is slow due to heavy I/O this should give it more time to unblock. This should theoretically help with some of GPU hangs. BUG=612607 ==========
Message was sent while issue was closed.
Committed patchset #5 (id:120001)
Message was sent while issue was closed.
CQ bit was unchecked
Message was sent while issue was closed.
Description was changed from ========== GPU Watchdog to check I/O before terminating The idea is to try to do an unbuffered write in GPU watchdog just before crashing the process. If the process is slow due to heavy I/O this should give it more time to unblock. This should theoretically help with some of GPU hangs. BUG=612607 ========== to ========== GPU Watchdog to check I/O before terminating The idea is to try to do an unbuffered write in GPU watchdog just before crashing the process. If the process is slow due to heavy I/O this should give it more time to unblock. This should theoretically help with some of GPU hangs. BUG=612607 Committed: https://crrev.com/4015b488f743a7399e3362fd49917f494ff7caaf Cr-Commit-Position: refs/heads/master@{#399222} ==========
Message was sent while issue was closed.
Patchset 5 (id:??) landed as https://crrev.com/4015b488f743a7399e3362fd49917f494ff7caaf Cr-Commit-Position: refs/heads/master@{#399222}
Message was sent while issue was closed.
A revert of this CL (patchset #5 id:120001) has been created in https://codereview.chromium.org/2071613002/ by kbr@chromium.org. The reason for reverting is: This CL seems to have caused intermittent assertion failures in the context_lost_tests on the commit queue and reliable assertion failures on some of the GPU bots. See http://crbug.com/619196 . . |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
