Issue 2592923002: Keep track of output_snippet bytes and drop output snippets if output is too large

mcgreevy_g

mcgreevy@google.com changed reviewers: + mcgreevy@google.com, phajdan.jr@chromium.org

4 years ago (2016-12-21 01:28:42 UTC) #1

mcgreevy

Paweł, are you available to look at this? I'd like to get this in today ...

4 years ago (2016-12-21 23:15:52 UTC) #3

Paweł Hajdan Jr.

phajdan.jr@chromium.org changed reviewers: + dpranke@chromium.org, jam@chromium.org

4 years ago (2016-12-22 14:18:39 UTC) #4

Paweł Hajdan Jr.

+dpranke,jam I'm not sure if getting this CL quickly is advisable. It can have far-reaching ...

4 years ago (2016-12-22 14:18:39 UTC) #5

Paweł Hajdan Jr.

Description was changed from ========== Keep track of output_snippet bytes and drop output snippets if ...

4 years ago (2016-12-22 14:19:00 UTC) #6

jam

+1 to surfacing a failure when this happens. It should be a clear reason in ...

4 years ago (2016-12-22 19:11:03 UTC) #7

Dirk Pranke

On 2016/12/22 19:11:03, jam wrote: > +1 to surfacing a failure when this happens. It ...

4 years ago (2016-12-22 21:14:09 UTC) #8

mcgreevy

On 2016/12/22 14:18:39, Paweł Hajdan Jr. wrote: > +dpranke,jam > > I'm not sure if ...

3 years, 11 months ago (2017-01-11 05:54:34 UTC) #9

On 2016/12/22 14:18:39, Paweł Hajdan Jr. wrote:
> +dpranke,jam
> 
> I'm not sure if getting this CL quickly is advisable. It can have far-reaching
> consequences which may be hard to reverse - as we can see by difficulty of
> tightening per-test output limit.
> 
> I'd like to get more input from Dirk and John, and I'd strongly suggest we
> prevent silent regressions in this area.
> 
> Thanks for tackling this - let's work on finding the best solution.
> 
>
https://codereview.chromium.org/2592923002/diff/1/base/test/launcher/test_res...
> File base/test/launcher/test_results_tracker.cc (right):
> 
>
https://codereview.chromium.org/2592923002/diff/1/base/test/launcher/test_res...
> base/test/launcher/test_results_tracker.cc:399: ? "lengthy output elided"
> I'm concerned this seems to result in a silent failure. It seems very easy we
> might run with effectively no output snippets until someone discovers a latent
> issue by accident.
> 
> I'd strongly advocate surfacing this condition in some way, possibly by
failing
> the test - see e.g. https://codereview.chromium.org/2406243004 .

It would be a bit odd to start marking tests as failed only from the point at
which the total output exceeds a given threshold, since the preceding tests also
contributed to exceeding the threshold.  Ideally we could just mark the whole
run as having failed, but I don't think we have a mechanism for doing that.
Lacking such a mechanism, I think that it would be more correct to mark every
test as failed in this case (perhaps somewhere around here? :
https://cs.chromium.org/chromium/src/base/test/launcher/unit_test_launcher.cc...).
WDYT?

> Another way we could do is further tighten per-test output limit, such that
even
> with largest test binary we can't exceed max limit. See
>
https://groups.google.com/a/chromium.org/d/msg/chromium-dev/ymxI-AaZ7-o/TgUWT...
> .

How many tests do you expect in the "largest test binary"?  Do we have any way
of controlling that?  It sounds like we'd end tuning this parameter forevermore
if we want to use it to limit the total output size and the number of tests in a
binary can grow unbounded.

> We could also consider getting 64-bit python on the bots.
> 
> For further context, please read https://goto.google.com/epoll .

Paweł Hajdan Jr.

On 2017/01/11 05:54:34, mcgreevy wrote: > It would be a bit odd to start marking ...

3 years, 11 months ago (2017-01-11 10:55:56 UTC) #10

Dirk Pranke

On 2017/01/11 10:55:56, Paweł Hajdan Jr. wrote: > On 2017/01/11 05:54:34, mcgreevy wrote: > > ...

3 years, 11 months ago (2017-01-11 21:24:33 UTC) #12

Paweł Hajdan Jr.

Just making sure: did you see https://groups.google.com/a/chromium.org/d/msg/chromium-dev/ymxI-AaZ7-o/TgUWTZUsAQAJ ? I'd be interested what you think are ...

3 years, 11 months ago (2017-01-11 21:28:58 UTC) #13

Ken Russell (switch to Gerrit)

On 2017/01/11 21:24:33, Dirk Pranke wrote: > On 2017/01/11 10:55:56, Paweł Hajdan Jr. wrote: > ...

3 years, 11 months ago (2017-01-12 00:08:36 UTC) #14

On 2017/01/11 21:24:33, Dirk Pranke wrote:
> On 2017/01/11 10:55:56, Paweł Hajdan Jr. wrote:
> > On 2017/01/11 05:54:34, mcgreevy wrote:
> > > It would be a bit odd to start marking tests as failed only from the point
> at
> > > which the total output exceeds a given threshold, since the preceding
tests
> > also
> > > contributed to exceeding the threshold.  Ideally we could just mark the
> whole
> > > run as having failed, but I don't think we have a mechanism for doing
that.
> > > Lacking such a mechanism, I think that it would be more correct to mark
> every
> > > test as failed in this case (perhaps somewhere around here? :
> > >
> >
>
https://cs.chromium.org/chromium/src/base/test/launcher/unit_test_launcher.cc...).
> > > WDYT?
> 
> We should have a mechanism for marking a whole run as failed; we do for other
> test types, and I thought we did for gtest-based tests, too?
> 
> We shouldn't mark every test as failed, since that's not what actually
happened.
> 
> > > How many tests do you expect in the "largest test binary"?
> > 
> > I used 10k as an estimate, based on 6-8k for largest test binaries like
> > browser_tests.
> 
> The layout_tests are somewhere between 40k-70k tests. I believe the full
> web-platform-tests suite is ~40k or more as well, as is the webgl conformance 
> suite. 
> 
> Those aren't gtest-based tests, but I think we should have the same
constraints
> regardless of the test type.

The WebGL 2.0 conformance tests run about 2500 top-level tests but they perform
a lot of assertions internally. The logs are huge and we do plan to do some more
work on reducing their size but I wouldn't want to impose an artificial
constraint on them yet. As long as the constraint can be lifted manually for
each test harness then that's fine.


> And we can't easily shrink the size of those larger suites, though hopefully
> they
> won't get much larger.
> 
> I agree we should set a policy on chromium-dev, but we should propose one
> based on the conversations Paweł et al. have already had on this topic, and
> just look for confirmation or objections.
> 
> -- Dirk

Ken Russell (switch to Gerrit)

On 2017/01/12 00:08:36, Ken Russell wrote: > On 2017/01/11 21:24:33, Dirk Pranke wrote: > > ...

3 years, 11 months ago (2017-01-12 00:09:02 UTC) #15

On 2017/01/12 00:08:36, Ken Russell wrote:
> On 2017/01/11 21:24:33, Dirk Pranke wrote:
> > On 2017/01/11 10:55:56, Paweł Hajdan Jr. wrote:
> > > On 2017/01/11 05:54:34, mcgreevy wrote:
> > > > It would be a bit odd to start marking tests as failed only from the
point
> > at
> > > > which the total output exceeds a given threshold, since the preceding
> tests
> > > also
> > > > contributed to exceeding the threshold.  Ideally we could just mark the
> > whole
> > > > run as having failed, but I don't think we have a mechanism for doing
> that.
> > > > Lacking such a mechanism, I think that it would be more correct to mark
> > every
> > > > test as failed in this case (perhaps somewhere around here? :
> > > >
> > >
> >
>
https://cs.chromium.org/chromium/src/base/test/launcher/unit_test_launcher.cc...).
> > > > WDYT?
> > 
> > We should have a mechanism for marking a whole run as failed; we do for
other
> > test types, and I thought we did for gtest-based tests, too?
> > 
> > We shouldn't mark every test as failed, since that's not what actually
> happened.
> > 
> > > > How many tests do you expect in the "largest test binary"?
> > > 
> > > I used 10k as an estimate, based on 6-8k for largest test binaries like
> > > browser_tests.
> > 
> > The layout_tests are somewhere between 40k-70k tests. I believe the full
> > web-platform-tests suite is ~40k or more as well, as is the webgl
conformance 
> > suite. 
> > 
> > Those aren't gtest-based tests, but I think we should have the same
> constraints
> > regardless of the test type.
> 
> The WebGL 2.0 conformance tests run about 2500 top-level tests but they
perform
> a lot of assertions internally. The logs are huge and we do plan to do some
more
> work on reducing their size but I wouldn't want to impose an artificial
> constraint on them yet. As long as the constraint can be lifted manually for
> each test harness then that's fine.

P.S. A summary of all the test names can be found in
src/content/test/data/gpu/webgl2_conformance_tests_output.json .



> 
> 
> > And we can't easily shrink the size of those larger suites, though hopefully
> > they
> > won't get much larger.
> > 
> > I agree we should set a policy on chromium-dev, but we should propose one
> > based on the conversations Paweł et al. have already had on this topic, and
> > just look for confirmation or objections.
> > 
> > -- Dirk

Dirk Pranke

On 2017/01/11 21:28:58, Paweł Hajdan Jr. wrote: > I'd be interested what you think are ...

3 years, 11 months ago (2017-01-12 01:14:15 UTC) #16

mcgreevy

On 2017/01/11 10:55:56, Paweł Hajdan Jr. wrote: > On 2017/01/11 05:54:34, mcgreevy wrote: > > ...

3 years, 11 months ago (2017-01-12 06:07:44 UTC) #17

On 2017/01/11 10:55:56, Paweł Hajdan Jr. wrote:
> On 2017/01/11 05:54:34, mcgreevy wrote:
> > It would be a bit odd to start marking tests as failed only from the point
at
> > which the total output exceeds a given threshold, since the preceding tests
> also
> > contributed to exceeding the threshold.  Ideally we could just mark the
whole
> > run as having failed, but I don't think we have a mechanism for doing that.
> > Lacking such a mechanism, I think that it would be more correct to mark
every
> > test as failed in this case (perhaps somewhere around here? :
> >
>
https://cs.chromium.org/chromium/src/base/test/launcher/unit_test_launcher.cc...).
> > WDYT?
> 
> That's one of the options - adopt a technique similar to above code, but use
it
> in test_launcher.cc, since unit_test_launcher.cc is more specific.
> 
> I'd recommend adding kUnreliableResultsTag (in addition to failing _some_
tests
> of course). This marks entire test run as invalid, and makes consumers not
look
> into individual test failures.

Re: "failing _some_ tests":

If we fail tests due to them exceeding a test-run-snippet budget, then the two
options for where to implement that seem to be:

(a) in TestLauncher::OnTestFinished.
(b) as we are generating the JSON (e.g. in
TestResultsTracker::SaveSummaryAsJSON)

Neither of these seem like a good idea:
A pro of (a) is that we can decide on the final status of the test early and
then leave it alone. The corresponding con for (b) is that there are places
other than SaveSummaryAsJSON that output test results, and it would be weird for
them to treat a test as having succeeded, while SaveSummaryAsJSON reports it as
having failed.
A pro of (b) is that we can easily keep track of the number of bytes (raw +
base64-encoded snippet) that we are outputting. The corresponding con for (a) is
that it would be weird to do the base64 snippet encoding in OnTestFinished to
know how many bytes a test result was going to consume when we later output it
as JSON.

If, however, we can just tag the entire test run with kUnreliableResultsTag and
leave the original test status alone[1], then neither of the cons above apply. 
So, can we actually rely on consumers respecting kUnreliableResultsTag?

[1] we'd still need to truncate some snippets, of course.

> 
> I used 10k as an estimate, based on 6-8k for largest test binaries like
> browser_tests.
> 
> > Do we have any way of controlling that?  It sounds like we'd end tuning this
> parameter forevermore
> > if we want to use it to limit the total output size and the number of tests
in
> a
> > binary can grow unbounded.
> 
> That's a good point. Please consider involving wider chromium community
> (chromium-dev) in related discussion.

Paweł Hajdan Jr.

On 2017/01/12 01:14:15, Dirk Pranke wrote: > On 2017/01/11 21:28:58, Paweł Hajdan Jr. wrote: > ...

3 years, 11 months ago (2017-01-12 13:39:15 UTC) #18

Dirk Pranke

On 2017/01/12 13:39:15, Paweł Hajdan Jr. wrote: > On 2017/01/12 01:14:15, Dirk Pranke wrote: > ...

3 years, 11 months ago (2017-01-13 02:38:39 UTC) #19

mcgreevy

OK, I've modified my CL (not yet uploaded) to fail the whole test via the ...

3 years, 11 months ago (2017-01-16 06:23:17 UTC) #20

Paweł Hajdan Jr.

On 2017/01/16 06:23:17, mcgreevy wrote: > 1. Setting UNRELIABLE_RESULTS signals *that* the test run failed, ...

3 years, 11 months ago (2017-01-16 09:05:44 UTC) #21

Dirk Pranke

On 2017/01/16 09:05:44, Paweł Hajdan Jr. wrote: > On 2017/01/16 06:23:17, mcgreevy wrote: > > ...

3 years, 11 months ago (2017-01-16 16:21:10 UTC) #22

On 2017/01/16 09:05:44, Paweł Hajdan Jr. wrote:
> On 2017/01/16 06:23:17, mcgreevy wrote:
> > 1. Setting UNRELIABLE_RESULTS signals *that* the test run failed, but not
> *why*
> > it failed. What is your preferred way of signalling why it failed?
> >   (a) Replacing output_snippets (for tests which are output after we exceed
a
> > byte threshold) with some explanatory text, (e.g. "lengthy output elided" in
> my
> > original CL).
> >   (b) Setting another tag in addition to UNRELIABLE_RESULTS, e.g.
> > "TEST_OUTPUT_LIMIT_EXCEEDED".
> >   (c) something else?
> 
> I prefer (b). We can do (a) in addition to that - up to you.

I would do both (a) and (b) and also make sure you log something to stderr as
close to the end of the run as possible to indicate why you're returning that
error
code.

> 
> > 2. Do we want to output any of the output_snippet data if we are failing the
> > whole test run?  It seems like it might be helpful for tracking down why so
> much
> > output was generated in the first place, but we can't output all of it.
> 
> Yes, sounds good. We could even create a second summary file with full
snippets,
> which gets copied to GS without any processing.

I would output things. I don't know that I would create the second summary file,
given that the whole point is that all of this output is causing performance
problems.

> 
> > 3. Since I wrote my CL, test_results_tracker.cc has been modified to output
> even
> > more information, namely a "summary" and "message" for each "result_part"
(see
> >
>
https://cs.chromium.org/chromium/src/base/test/launcher/test_results_tracker....).
> >  I presume that these may also contain a lot of data and that I should also
> > track how much data is being output due to this. Tracking just the bytes
used
> by
> > output snippets is starting to look quite hacky, and I am thinking about
> > wrapping the summary_root DictionaryValue in something which can keep track
of
> > the number of bytes that it contains.  This still doesn't allow us to set an
> > exact limit on the output.json as the serialized JSON will require some
extra
> > bytes for, e.g. quote characters, but it should be good enough for keeping a
> lid
> > on the output size without too much bookkeeping in SaveSummaryAsJSON.  WDYT?
> 
> Sounds good.

+1.

Dirk Pranke

Rietveld CL cleanup time ... is this CL still relevant, or should it be closed? ...

3 years, 5 months ago (2017-07-15 22:18:07 UTC) #23

mcgreevy

3 years, 5 months ago (2017-07-17 06:52:54 UTC) #24

On 2017/07/15 22:18:07, Dirk Pranke wrote:
> Rietveld CL cleanup time ... is this CL still relevant, or should it be
closed?
> I thought we landed something to handle this, but I'm not sure what.

tansell mentioned the other week that he thought he'd addressed this in another
way. I'll verify tomorrow.

Issue 2592923002: Keep track of output_snippet bytes and drop output snippets if output is too large

Description

Patch Set 1 #

Messages