|
|
Chromium Code Reviews
DescriptionKeep track of output_snippet bytes and drop output snippets if output is too large
BUG=665159
Patch Set 1 #
Total comments: 1
Messages
Total messages: 24 (3 generated)
mcgreevy@google.com changed reviewers: + mcgreevy@google.com, phajdan.jr@chromium.org
Paweł, are you available to look at this? I'd like to get this in today before the holiday break, if possible. Let me know if you'd like someone else to review it.
phajdan.jr@chromium.org changed reviewers: + dpranke@chromium.org, jam@chromium.org
+dpranke,jam I'm not sure if getting this CL quickly is advisable. It can have far-reaching consequences which may be hard to reverse - as we can see by difficulty of tightening per-test output limit. I'd like to get more input from Dirk and John, and I'd strongly suggest we prevent silent regressions in this area. Thanks for tackling this - let's work on finding the best solution. https://codereview.chromium.org/2592923002/diff/1/base/test/launcher/test_res... File base/test/launcher/test_results_tracker.cc (right): https://codereview.chromium.org/2592923002/diff/1/base/test/launcher/test_res... base/test/launcher/test_results_tracker.cc:399: ? "lengthy output elided" I'm concerned this seems to result in a silent failure. It seems very easy we might run with effectively no output snippets until someone discovers a latent issue by accident. I'd strongly advocate surfacing this condition in some way, possibly by failing the test - see e.g. https://codereview.chromium.org/2406243004 . Another way we could do is further tighten per-test output limit, such that even with largest test binary we can't exceed max limit. See https://groups.google.com/a/chromium.org/d/msg/chromium-dev/ymxI-AaZ7-o/TgUWT... . We could also consider getting 64-bit python on the bots. For further context, please read https://goto.google.com/epoll .
Description was changed from ========== Keep track of output_snippet bytes and drop output snippets if output is too large BUG=665159 ========== to ========== Keep track of output_snippet bytes and drop output snippets if output is too large BUG=665159 ==========
+1 to surfacing a failure when this happens. It should be a clear reason in the build run that a step failed because it had too much output.
On 2016/12/22 19:11:03, jam wrote: > +1 to surfacing a failure when this happens. It should be a clear reason in the > build run that a step failed because it had too much output. Agreed. And, if we're doing anything in processing test output that needs 64-bit python we've already gone off the rails :).
On 2016/12/22 14:18:39, Paweł Hajdan Jr. wrote: > +dpranke,jam > > I'm not sure if getting this CL quickly is advisable. It can have far-reaching > consequences which may be hard to reverse - as we can see by difficulty of > tightening per-test output limit. > > I'd like to get more input from Dirk and John, and I'd strongly suggest we > prevent silent regressions in this area. > > Thanks for tackling this - let's work on finding the best solution. > > https://codereview.chromium.org/2592923002/diff/1/base/test/launcher/test_res... > File base/test/launcher/test_results_tracker.cc (right): > > https://codereview.chromium.org/2592923002/diff/1/base/test/launcher/test_res... > base/test/launcher/test_results_tracker.cc:399: ? "lengthy output elided" > I'm concerned this seems to result in a silent failure. It seems very easy we > might run with effectively no output snippets until someone discovers a latent > issue by accident. > > I'd strongly advocate surfacing this condition in some way, possibly by failing > the test - see e.g. https://codereview.chromium.org/2406243004 . It would be a bit odd to start marking tests as failed only from the point at which the total output exceeds a given threshold, since the preceding tests also contributed to exceeding the threshold. Ideally we could just mark the whole run as having failed, but I don't think we have a mechanism for doing that. Lacking such a mechanism, I think that it would be more correct to mark every test as failed in this case (perhaps somewhere around here? : https://cs.chromium.org/chromium/src/base/test/launcher/unit_test_launcher.cc...). WDYT? > Another way we could do is further tighten per-test output limit, such that even > with largest test binary we can't exceed max limit. See > https://groups.google.com/a/chromium.org/d/msg/chromium-dev/ymxI-AaZ7-o/TgUWT... > . How many tests do you expect in the "largest test binary"? Do we have any way of controlling that? It sounds like we'd end tuning this parameter forevermore if we want to use it to limit the total output size and the number of tests in a binary can grow unbounded. > We could also consider getting 64-bit python on the bots. > > For further context, please read https://goto.google.com/epoll .
On 2017/01/11 05:54:34, mcgreevy wrote: > It would be a bit odd to start marking tests as failed only from the point at > which the total output exceeds a given threshold, since the preceding tests also > contributed to exceeding the threshold. Ideally we could just mark the whole > run as having failed, but I don't think we have a mechanism for doing that. > Lacking such a mechanism, I think that it would be more correct to mark every > test as failed in this case (perhaps somewhere around here? : > https://cs.chromium.org/chromium/src/base/test/launcher/unit_test_launcher.cc...). > WDYT? That's one of the options - adopt a technique similar to above code, but use it in test_launcher.cc, since unit_test_launcher.cc is more specific. I'd recommend adding kUnreliableResultsTag (in addition to failing _some_ tests of course). This marks entire test run as invalid, and makes consumers not look into individual test failures. > How many tests do you expect in the "largest test binary"? I used 10k as an estimate, based on 6-8k for largest test binaries like browser_tests. > Do we have any way of controlling that? It sounds like we'd end tuning this parameter forevermore > if we want to use it to limit the total output size and the number of tests in a > binary can grow unbounded. That's a good point. Please consider involving wider chromium community (chromium-dev) in related discussion.
On 2017/01/11 10:55:56, Paweł Hajdan Jr. wrote: > On 2017/01/11 05:54:34, mcgreevy wrote: > > It would be a bit odd to start marking tests as failed only from the point at > > which the total output exceeds a given threshold, since the preceding tests > also > > contributed to exceeding the threshold. Ideally we could just mark the whole > > run as having failed, but I don't think we have a mechanism for doing that. > > Lacking such a mechanism, I think that it would be more correct to mark every > > test as failed in this case (perhaps somewhere around here? : > > > https://cs.chromium.org/chromium/src/base/test/launcher/unit_test_launcher.cc...). > > WDYT? We should have a mechanism for marking a whole run as failed; we do for other test types, and I thought we did for gtest-based tests, too? We shouldn't mark every test as failed, since that's not what actually happened. > > How many tests do you expect in the "largest test binary"? > > I used 10k as an estimate, based on 6-8k for largest test binaries like > browser_tests. The layout_tests are somewhere between 40k-70k tests. I believe the full web-platform-tests suite is ~40k or more as well, as is the webgl conformance suite. Those aren't gtest-based tests, but I think we should have the same constraints regardless of the test type. And we can't easily shrink the size of those larger suites, though hopefully they won't get much larger. I agree we should set a policy on chromium-dev, but we should propose one based on the conversations Paweł et al. have already had on this topic, and just look for confirmation or objections. -- Dirk
Just making sure: did you see https://groups.google.com/a/chromium.org/d/msg/chromium-dev/ymxI-AaZ7-o/TgUWT... ? I'd be interested what you think are best next steps here.
On 2017/01/11 21:24:33, Dirk Pranke wrote: > On 2017/01/11 10:55:56, Paweł Hajdan Jr. wrote: > > On 2017/01/11 05:54:34, mcgreevy wrote: > > > It would be a bit odd to start marking tests as failed only from the point > at > > > which the total output exceeds a given threshold, since the preceding tests > > also > > > contributed to exceeding the threshold. Ideally we could just mark the > whole > > > run as having failed, but I don't think we have a mechanism for doing that. > > > Lacking such a mechanism, I think that it would be more correct to mark > every > > > test as failed in this case (perhaps somewhere around here? : > > > > > > https://cs.chromium.org/chromium/src/base/test/launcher/unit_test_launcher.cc...). > > > WDYT? > > We should have a mechanism for marking a whole run as failed; we do for other > test types, and I thought we did for gtest-based tests, too? > > We shouldn't mark every test as failed, since that's not what actually happened. > > > > How many tests do you expect in the "largest test binary"? > > > > I used 10k as an estimate, based on 6-8k for largest test binaries like > > browser_tests. > > The layout_tests are somewhere between 40k-70k tests. I believe the full > web-platform-tests suite is ~40k or more as well, as is the webgl conformance > suite. > > Those aren't gtest-based tests, but I think we should have the same constraints > regardless of the test type. The WebGL 2.0 conformance tests run about 2500 top-level tests but they perform a lot of assertions internally. The logs are huge and we do plan to do some more work on reducing their size but I wouldn't want to impose an artificial constraint on them yet. As long as the constraint can be lifted manually for each test harness then that's fine. > And we can't easily shrink the size of those larger suites, though hopefully > they > won't get much larger. > > I agree we should set a policy on chromium-dev, but we should propose one > based on the conversations Paweł et al. have already had on this topic, and > just look for confirmation or objections. > > -- Dirk
On 2017/01/12 00:08:36, Ken Russell wrote: > On 2017/01/11 21:24:33, Dirk Pranke wrote: > > On 2017/01/11 10:55:56, Paweł Hajdan Jr. wrote: > > > On 2017/01/11 05:54:34, mcgreevy wrote: > > > > It would be a bit odd to start marking tests as failed only from the point > > at > > > > which the total output exceeds a given threshold, since the preceding > tests > > > also > > > > contributed to exceeding the threshold. Ideally we could just mark the > > whole > > > > run as having failed, but I don't think we have a mechanism for doing > that. > > > > Lacking such a mechanism, I think that it would be more correct to mark > > every > > > > test as failed in this case (perhaps somewhere around here? : > > > > > > > > > > https://cs.chromium.org/chromium/src/base/test/launcher/unit_test_launcher.cc...). > > > > WDYT? > > > > We should have a mechanism for marking a whole run as failed; we do for other > > test types, and I thought we did for gtest-based tests, too? > > > > We shouldn't mark every test as failed, since that's not what actually > happened. > > > > > > How many tests do you expect in the "largest test binary"? > > > > > > I used 10k as an estimate, based on 6-8k for largest test binaries like > > > browser_tests. > > > > The layout_tests are somewhere between 40k-70k tests. I believe the full > > web-platform-tests suite is ~40k or more as well, as is the webgl conformance > > suite. > > > > Those aren't gtest-based tests, but I think we should have the same > constraints > > regardless of the test type. > > The WebGL 2.0 conformance tests run about 2500 top-level tests but they perform > a lot of assertions internally. The logs are huge and we do plan to do some more > work on reducing their size but I wouldn't want to impose an artificial > constraint on them yet. As long as the constraint can be lifted manually for > each test harness then that's fine. P.S. A summary of all the test names can be found in src/content/test/data/gpu/webgl2_conformance_tests_output.json . > > > > And we can't easily shrink the size of those larger suites, though hopefully > > they > > won't get much larger. > > > > I agree we should set a policy on chromium-dev, but we should propose one > > based on the conversations Paweł et al. have already had on this topic, and > > just look for confirmation or objections. > > > > -- Dirk
On 2017/01/11 21:28:58, Paweł Hajdan Jr. wrote: > I'd be interested what you think are best next steps here. Who was this question directed at?
On 2017/01/11 10:55:56, Paweł Hajdan Jr. wrote: > On 2017/01/11 05:54:34, mcgreevy wrote: > > It would be a bit odd to start marking tests as failed only from the point at > > which the total output exceeds a given threshold, since the preceding tests > also > > contributed to exceeding the threshold. Ideally we could just mark the whole > > run as having failed, but I don't think we have a mechanism for doing that. > > Lacking such a mechanism, I think that it would be more correct to mark every > > test as failed in this case (perhaps somewhere around here? : > > > https://cs.chromium.org/chromium/src/base/test/launcher/unit_test_launcher.cc...). > > WDYT? > > That's one of the options - adopt a technique similar to above code, but use it > in test_launcher.cc, since unit_test_launcher.cc is more specific. > > I'd recommend adding kUnreliableResultsTag (in addition to failing _some_ tests > of course). This marks entire test run as invalid, and makes consumers not look > into individual test failures. Re: "failing _some_ tests": If we fail tests due to them exceeding a test-run-snippet budget, then the two options for where to implement that seem to be: (a) in TestLauncher::OnTestFinished. (b) as we are generating the JSON (e.g. in TestResultsTracker::SaveSummaryAsJSON) Neither of these seem like a good idea: A pro of (a) is that we can decide on the final status of the test early and then leave it alone. The corresponding con for (b) is that there are places other than SaveSummaryAsJSON that output test results, and it would be weird for them to treat a test as having succeeded, while SaveSummaryAsJSON reports it as having failed. A pro of (b) is that we can easily keep track of the number of bytes (raw + base64-encoded snippet) that we are outputting. The corresponding con for (a) is that it would be weird to do the base64 snippet encoding in OnTestFinished to know how many bytes a test result was going to consume when we later output it as JSON. If, however, we can just tag the entire test run with kUnreliableResultsTag and leave the original test status alone[1], then neither of the cons above apply. So, can we actually rely on consumers respecting kUnreliableResultsTag? [1] we'd still need to truncate some snippets, of course. > > I used 10k as an estimate, based on 6-8k for largest test binaries like > browser_tests. > > > Do we have any way of controlling that? It sounds like we'd end tuning this > parameter forevermore > > if we want to use it to limit the total output size and the number of tests in > a > > binary can grow unbounded. > > That's a good point. Please consider involving wider chromium community > (chromium-dev) in related discussion.
On 2017/01/12 01:14:15, Dirk Pranke wrote: > On 2017/01/11 21:28:58, Paweł Hajdan Jr. wrote: > > I'd be interested what you think are best next steps here. > > Who was this question directed at? Both you (for overall technical direction) and Michael (patch author). In recipes chromium_tests/steps.py marks entire test as failed when UNRELIABLE_RESULTS tag is present, and disables without patch retries.
On 2017/01/12 13:39:15, Paweł Hajdan Jr. wrote: > On 2017/01/12 01:14:15, Dirk Pranke wrote: > > On 2017/01/11 21:28:58, Paweł Hajdan Jr. wrote: > > > I'd be interested what you think are best next steps here. > > > > Who was this question directed at? > > Both you (for overall technical direction) and Michael (patch author). > > In recipes chromium_tests/steps.py marks entire test as failed when > UNRELIABLE_RESULTS tag is present, and disables without patch retries. Okay, my guidance is that 1) Excessive output for the step as a whole should cause the whole test step to fail. We shouldn't classify individual test steps as failing in this situation, but rather throw them out as per UNRELIABLE_RESULTS above. 2) We should document and enforce a policy that limits per-test output and per-step output as a whole. The limits should apply to all test types. I tend to agree w/ thakis' comment on the related email thread that a successful test should produce no output at all other than the test passed. But, we might not be able to implement that immediately, so I'm also okay with setting that as a goal and enforcing a small-ish limit. Failing tests should have a larger, enforced limit as per the thread discussion. We should also enforce a limit on the number of failing tests per step. I want us to get to a point where we enforce strict limits on both logs and durations for individual tests, steps as a whole, and builds as a whole.
OK, I've modified my CL (not yet uploaded) to fail the whole test via the UNRELIABLE_RESULTS, but I'm still not quite happy with it. I have a couple of questions: 1. Setting UNRELIABLE_RESULTS signals *that* the test run failed, but not *why* it failed. What is your preferred way of signalling why it failed? (a) Replacing output_snippets (for tests which are output after we exceed a byte threshold) with some explanatory text, (e.g. "lengthy output elided" in my original CL). (b) Setting another tag in addition to UNRELIABLE_RESULTS, e.g. "TEST_OUTPUT_LIMIT_EXCEEDED". (c) something else? 2. Do we want to output any of the output_snippet data if we are failing the whole test run? It seems like it might be helpful for tracking down why so much output was generated in the first place, but we can't output all of it. 3. Since I wrote my CL, test_results_tracker.cc has been modified to output even more information, namely a "summary" and "message" for each "result_part" (see https://cs.chromium.org/chromium/src/base/test/launcher/test_results_tracker....). I presume that these may also contain a lot of data and that I should also track how much data is being output due to this. Tracking just the bytes used by output snippets is starting to look quite hacky, and I am thinking about wrapping the summary_root DictionaryValue in something which can keep track of the number of bytes that it contains. This still doesn't allow us to set an exact limit on the output.json as the serialized JSON will require some extra bytes for, e.g. quote characters, but it should be good enough for keeping a lid on the output size without too much bookkeeping in SaveSummaryAsJSON. WDYT?
On 2017/01/16 06:23:17, mcgreevy wrote: > 1. Setting UNRELIABLE_RESULTS signals *that* the test run failed, but not *why* > it failed. What is your preferred way of signalling why it failed? > (a) Replacing output_snippets (for tests which are output after we exceed a > byte threshold) with some explanatory text, (e.g. "lengthy output elided" in my > original CL). > (b) Setting another tag in addition to UNRELIABLE_RESULTS, e.g. > "TEST_OUTPUT_LIMIT_EXCEEDED". > (c) something else? I prefer (b). We can do (a) in addition to that - up to you. > 2. Do we want to output any of the output_snippet data if we are failing the > whole test run? It seems like it might be helpful for tracking down why so much > output was generated in the first place, but we can't output all of it. Yes, sounds good. We could even create a second summary file with full snippets, which gets copied to GS without any processing. > 3. Since I wrote my CL, test_results_tracker.cc has been modified to output even > more information, namely a "summary" and "message" for each "result_part" (see > https://cs.chromium.org/chromium/src/base/test/launcher/test_results_tracker....). > I presume that these may also contain a lot of data and that I should also > track how much data is being output due to this. Tracking just the bytes used by > output snippets is starting to look quite hacky, and I am thinking about > wrapping the summary_root DictionaryValue in something which can keep track of > the number of bytes that it contains. This still doesn't allow us to set an > exact limit on the output.json as the serialized JSON will require some extra > bytes for, e.g. quote characters, but it should be good enough for keeping a lid > on the output size without too much bookkeeping in SaveSummaryAsJSON. WDYT? Sounds good.
On 2017/01/16 09:05:44, Paweł Hajdan Jr. wrote: > On 2017/01/16 06:23:17, mcgreevy wrote: > > 1. Setting UNRELIABLE_RESULTS signals *that* the test run failed, but not > *why* > > it failed. What is your preferred way of signalling why it failed? > > (a) Replacing output_snippets (for tests which are output after we exceed a > > byte threshold) with some explanatory text, (e.g. "lengthy output elided" in > my > > original CL). > > (b) Setting another tag in addition to UNRELIABLE_RESULTS, e.g. > > "TEST_OUTPUT_LIMIT_EXCEEDED". > > (c) something else? > > I prefer (b). We can do (a) in addition to that - up to you. I would do both (a) and (b) and also make sure you log something to stderr as close to the end of the run as possible to indicate why you're returning that error code. > > > 2. Do we want to output any of the output_snippet data if we are failing the > > whole test run? It seems like it might be helpful for tracking down why so > much > > output was generated in the first place, but we can't output all of it. > > Yes, sounds good. We could even create a second summary file with full snippets, > which gets copied to GS without any processing. I would output things. I don't know that I would create the second summary file, given that the whole point is that all of this output is causing performance problems. > > > 3. Since I wrote my CL, test_results_tracker.cc has been modified to output > even > > more information, namely a "summary" and "message" for each "result_part" (see > > > https://cs.chromium.org/chromium/src/base/test/launcher/test_results_tracker....). > > I presume that these may also contain a lot of data and that I should also > > track how much data is being output due to this. Tracking just the bytes used > by > > output snippets is starting to look quite hacky, and I am thinking about > > wrapping the summary_root DictionaryValue in something which can keep track of > > the number of bytes that it contains. This still doesn't allow us to set an > > exact limit on the output.json as the serialized JSON will require some extra > > bytes for, e.g. quote characters, but it should be good enough for keeping a > lid > > on the output size without too much bookkeeping in SaveSummaryAsJSON. WDYT? > > Sounds good. +1.
Rietveld CL cleanup time ... is this CL still relevant, or should it be closed? I thought we landed something to handle this, but I'm not sure what.
On 2017/07/15 22:18:07, Dirk Pranke wrote: > Rietveld CL cleanup time ... is this CL still relevant, or should it be closed? > I thought we landed something to handle this, but I'm not sure what. tansell mentioned the other week that he thought he'd addressed this in another way. I'll verify tomorrow. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
