|
|
Created:
7 years, 5 months ago by Noel Gordon Modified:
6 years, 6 months ago CC:
blink-reviews, eae+blinkwatch, dglazkov+blink, jeez, kbalazs, Ken Russell (switch to Gerrit), tonyg Base URL:
svn://svn.chromium.org/blink/trunk Visibility:
Public. |
DescriptionImprove PNG decode performance: avoid branching in the row write loop
No change in behavior, no new tests.
BUG=258324
Committed: https://src.chromium.org/viewvc/blink?view=rev&revision=175666
Patch Set 1 #
Total comments: 2
Patch Set 2 : repetitive patch. #Patch Set 3 : template patch. #
Total comments: 11
Patch Set 4 : #Patch Set 5 : Use setRGBA<false>, remove setRGBARaw. #Patch Set 6 : Redo comment, re-run trys. #Patch Set 7 : Once more, measurement post facto. #Patch Set 8 : Refresh trys. #
Messages
Total messages: 64 (0 generated)
https://codereview.chromium.org/18099004/diff/1/Source/core/platform/image-de... File Source/core/platform/image-decoders/ImageDecoder.h (right): https://codereview.chromium.org/18099004/diff/1/Source/core/platform/image-de... Source/core/platform/image-decoders/ImageDecoder.h:177: inline void setRGBAPremultiply(PixelData* dest, unsigned r, unsigned g, unsigned b, unsigned a) I think we ought to be able to define setRGBA() in terms of these next two functions. I'm aware that we're trying to avoid the GIF decoder regression that came from doing this sort of thing previously. I think the main problem with that patch was that its version of setRGBAPremultiply() moved the "if (!a)" clause outside the "if (a < 255)" clause, resulting in the common case doing two comparisons instead of one. Since you haven't done that here, I think this patch ought to be more performant than that one. Can we try it?
https://codereview.chromium.org/18099004/diff/1/Source/core/platform/image-de... File Source/core/platform/image-decoders/ImageDecoder.h (right): https://codereview.chromium.org/18099004/diff/1/Source/core/platform/image-de... Source/core/platform/image-decoders/ImageDecoder.h:177: inline void setRGBAPremultiply(PixelData* dest, unsigned r, unsigned g, unsigned b, unsigned a) On 2013/07/10 18:24:47, Peter Kasting wrote: > I think we ought to be able to define setRGBA() in terms of these next two > functions. Seems perfectly reasonable ... > I'm aware that we're trying to avoid the GIF decoder regression that came from > doing this sort of thing previously. I think the main problem with that patch > was that its version of setRGBAPremultiply() moved the "if (!a)" clause outside > the "if (a < 255)" clause, resulting in the common case doing two comparisons > instead of one. In https://chromiumcodereview.appspot.com/17448010#msg12 I mentioned the exact same thing to b.keleman post-facto in review after his patch landed, viz., why was the if (!a) test moved? He prepared a change to put it back to where it was, and we landed that: https://src.chromium.org/viewvc/blink?revision=153264&view=revision but no joy: it did not fix the perf issue as noted in https://code.google.com/p/chromium/issues/detail?id=254462#c9 > Since you haven't done that here, I think this patch ought to > be more performant than that one. Can we try it? so I believe we'd see the perf regression if we went this way again.
I measured again to confirm: decoding the 154 GIF corpus images 500 times each, "separate" is 20% faster on my t410 win7 laptop. separate-500-gif: aka the current patch, 220.00 secs https://docs.google.com/spreadsheet/ccc?key=0Asd5UUPIMYYsdFo3dFdkNlNPTnhOcmo2... combined-500-gif: write setRGBA in terms of the other two, 274.97 secs https://docs.google.com/spreadsheet/ccc?key=0Asd5UUPIMYYsdFZndjZQSTE3TVR6V2ht...
On 2013/07/11 15:02:21, Noel Gordon (Google) wrote: > I measured again to confirm: decoding the 154 GIF corpus images 500 times each, > "separate" > is 20% faster on my t410 win7 laptop. > > separate-500-gif: aka the current patch, 220.00 secs > > https://docs.google.com/spreadsheet/ccc?key=0Asd5UUPIMYYsdFo3dFdkNlNPTnhOcmo2... > > combined-500-gif: write setRGBA in terms of the other two, 274.97 secs > > https://docs.google.com/spreadsheet/ccc?key=0Asd5UUPIMYYsdFZndjZQSTE3TVR6V2ht... Can we take a look at the generated assembly for these two cases? I can't see why we should get a slowdown. Maybe the compiler is not actually inlining where it should and we need to do more to force it to inline?
The code inlines they way I wrote it, and does not inline when written as you suggested. Inline is hint, but is done at the discretion of compiler using way more information than we have available. Forcing inline is not-a-great-idea, imho, and from the advice given at msdn.
On 2013/07/12 03:22:51, Noel Gordon (Google) wrote: > The code inlines they way I wrote it, and does not inline when written as you > suggested. Inline is hint, but is done at the discretion of compiler using way > more information than we have available. Forcing inline is not-a-great-idea, > imho, and from the advice given at msdn. I'm an ex-compiler engineer and much less inclined to trust the compiler's instincts than you are :). And it's not true that the compiler has more information than us. We have critical bits like actual image decoder speed test results. Do we know why things aren't inlining? Something as simple as declaration order could be the problem. Depending on when the compiler performs different inlining phases, it may not have the necessary IR available for later functions when earlier ones are constructed. In any case, if we write this code expecting inlining in order to achieve a particular level of performance, we should ensure the compiler doesn't stop inlining down the road, crippling that performance. In rare cases like this, forcing inlining can make sense. If we don't want to do that, we could consider writing the function bodies as #defines, but that is a sort of last resort.
On 2013/07/12 03:31:22, Peter Kasting wrote: > On 2013/07/12 03:22:51, Noel Gordon (Google) wrote: > > The code inlines they way I wrote it, and does not inline when written as you > > suggested. Inline is hint, but is done at the discretion of compiler using > way > > more information than we have available. Forcing inline is not-a-great-idea, > > imho, and from the advice given at msdn. > > I'm an ex-compiler engineer and much less inclined to trust the compiler's > instincts than you are :). And it's not true that the compiler has more > information than us. We have critical bits like actual image decoder speed test > results. I'm not trusting the compilers instincts, I'm trusting my measurements :) > Do we know why things aren't inlining? Something as simple as declaration order > could be the problem. Depending on when the compiler performs different > inlining phases, it may not have the necessary IR available for later functions > when earlier ones are constructed. Nope. > In any case, if we write this code expecting inlining in order to achieve a > particular level of performance, we should ensure the compiler doesn't stop > inlining down the road, crippling that performance. In rare cases like this, > forcing inlining can make sense. If we don't want to do that, we could consider > writing the function bodies as #defines, but that is a sort of last resort. We have image decode perf bots for this very reason.
On 2013/07/12 03:54:20, Noel Gordon (Google) wrote: > On 2013/07/12 03:31:22, Peter Kasting wrote: > > On 2013/07/12 03:22:51, Noel Gordon (Google) wrote: > > > The code inlines they way I wrote it, and does not inline when written as > you > > > suggested. Inline is hint, but is done at the discretion of compiler using > > way > > > more information than we have available. Forcing inline is > not-a-great-idea, > > > imho, and from the advice given at msdn. > > > > I'm an ex-compiler engineer and much less inclined to trust the compiler's > > instincts than you are :). And it's not true that the compiler has more > > information than us. We have critical bits like actual image decoder speed > test > > results. > > I'm not trusting the compilers instincts, I'm trusting my measurements :) I think we're talking past each other. I am saying, "the compiler is a genius so always trust its inlining decisions" is the wrong way to go. If we need to force the compiler to inline, we should do it and not look back. It's not a bad thing. By contrast, duplicating the code absolutely is a bad thing, and I'm not going to approve a patch that looks like that.
On 2013/07/12 04:39:03, Peter Kasting wrote: > On 2013/07/12 03:54:20, Noel Gordon (Google) wrote: > > On 2013/07/12 03:31:22, Peter Kasting wrote: > > > On 2013/07/12 03:22:51, Noel Gordon (Google) wrote: > > > > The code inlines they way I wrote it, and does not inline when written as > > you > > > > suggested. Inline is hint, but is done at the discretion of compiler > using > > > way > > > > more information than we have available. Forcing inline is > > not-a-great-idea, > > > > imho, and from the advice given at msdn. > > > > > > I'm an ex-compiler engineer and much less inclined to trust the compiler's > > > instincts than you are :). And it's not true that the compiler has more > > > information than us. We have critical bits like actual image decoder speed > > test > > > results. > > > > I'm not trusting the compilers instincts, I'm trusting my measurements :) > > I think we're talking past each other. I am saying, "the compiler is a genius > so always trust its inlining decisions" is the wrong way to go. That's fine. Again, I'm am not trusting the compiler. > If we need to > force the compiler to inline, we should do it and not look back. It's not a bad > thing. But I have forced it to inline.
On 2013/07/12 04:51:24, Noel Gordon (Google) wrote: > But I have forced it to inline. ?? The visible patch on this bug still implements setRGBA() and setRGBAPremultiply() in terms of a nearly-identical block of nontrivial code. My request was to avoid this, and then fix the issue where avoiding this resulted in the compiler's default heuristic being to no longer inline the code, which from my understanding is the result of the GIF decoder slowdown. Considering that the code I can see hasn't changed since you wrote "Forcing inline is not-a-great-idea, imho," I'm not totally sure how to parse your sentence here. I also wasn't clear from your emails whether the "Nope" in response to "do we know why things aren't inlining" paragraph was "no, I don't know why" or "no, changing declaration order doesn't help".
Noel, is there any way to get the desired speedup here without much code duplication? Perhaps we can change the m_premultiplyAlpha variable-check to a template argument to make this compile-time-specified, a la: // |premultiply| is provided as a template argument to avoid a branch in the non-premultiply case. inline void setRGBA<bool premultiply>(...) { if (premultiply && a < 255) ... Then we don't need setRGBAPremultiply() or setRGBARaw(), because callers can do setRGBA<true>() and setRGBA<false>(), respectively, and get the same code; and setRGBA() calls become setRGBA<m_premultiply>(). Would this still provide the desired perf increase?
This has seen no activity for almost seven months (really, almost ten months; my last message was just an unanswered question). Is there a way to move forward here? If this still hasn't seen any activity in another couple of months, I'm going to close this.
Was a taken off this onto more pressing thinks. It's Saturday here and I decided look at this gain in my own time, sorry I don't really have a lot of time to commit to this at present. First thing I did was get the repetitive code patch applied at tip and uploaded.
Next I, got the patch on https://codereview.chromium.org/18107003 working again so I could accurately measure decode time. A few hours later, with deferred image decoding disabled, it's ready for testing. I used the same gif and png corpus images. Each image is decoded 500 times to compute the average decode time in seconds, on a lenovo T530 laptop running win7. I computed the total decode time (in seconds) for each corpus, without and with the change, and the results are: PNG corpus: 300.188292 (before), 255.765057 (after). GIF corpus: 72.040065 (before), 70.506371 (after). PNG corpus improves, small improvement in the the GIF corpus, for the repetitive patch case. Next I looked at the template you suggested ... // |premultiply| is provided as a template argument to avoid a branch in the non-premultiply case. inline void setRGBA<bool premultiply>(...) { if (premultiply && a < 255) ... setRGBARaw() is being used by the JPEG decoder, so I left that alone. I ditched setRGBAPremultiply() and experimented with changing the body of setRGBA to use the template. setRGBA<false> was slower than using SkPackARGB32NoCheck so I retained the later. Uploaded the patch I measured with: I call it the template patch.
My measured results with that template patch are: PNG : 300.188292 (before), 255.765057 (repetitive), 255.428839 (template) GIF : 72.040065 (before), 70.506371 (repetitive), 71.669523 (template) PNG results look good. GIF-template is not as fast as GIF-repetitive, but both are faster than ToT (more) for this GIF image corpus. C++ Compiler was VS2013, all results tabulated at: https://docs.google.com/spreadsheets/d/1MZ5pVzTXHALk9ugfvTQ3ghcTUCtSf9b3uizWC...
Sorry meant, "are faster than ToT (before) for this ...", late here, wife's long gone to bed and so should I.
If we have a way to run our image decoding benchmarks on a mobile device, that sounds very interesting. I'm not sure how much image decoding figures into total page time or page responsiveness, but I suspect it's a non-zero cost. That said, the change looks reasonable to me. pkasting and you are both more familiar with this code than I.
we dont have a good wholistic performance test story for image decode. noel has some stuff, i know, but its not hooked into telemetry so we currently do not definitively know how the image stack actually impacts pages, nor how a given change affects that, using real world content. this having been said, stuff like this where there's a pretty good speedup without any clear downside, and thats a good thing.
This seems good in principle. https://codereview.chromium.org/18099004/diff/35001/Source/platform/image-dec... File Source/platform/image-decoders/ImageFrame.h (right): https://codereview.chromium.org/18099004/diff/35001/Source/platform/image-dec... Source/platform/image-decoders/ImageFrame.h:167: if (m_premultiplyAlpha) We could simplify this to just: setRGBA<m_premultiplyAlpha>(dest, r, g, b, a); https://codereview.chromium.org/18099004/diff/35001/Source/platform/image-dec... Source/platform/image-decoders/ImageFrame.h:176: inline void setRGBA(PixelData* dest, unsigned r, unsigned g, unsigned b, unsigned a) This can be static, can't it? https://codereview.chromium.org/18099004/diff/35001/Source/platform/image-dec... Source/platform/image-decoders/ImageFrame.h:195: inline void setRGBARaw(PixelData* dest, unsigned r, unsigned g, unsigned b, unsigned a) Seems like this should just disappear, and callers should call setRGBA<false>(). https://codereview.chromium.org/18099004/diff/35001/Source/platform/image-dec... File Source/platform/image-decoders/png/PNGImageDecoder.cpp (right): https://codereview.chromium.org/18099004/diff/35001/Source/platform/image-dec... Source/platform/image-decoders/png/PNGImageDecoder.cpp:498: if (hasAlpha) { Nit: Consider adding a comment here about how we write the code in this more repetitive form in order to keep comparisons out of the inner loops.
https://codereview.chromium.org/18099004/diff/35001/Source/platform/image-dec... File Source/platform/image-decoders/ImageFrame.h (right): https://codereview.chromium.org/18099004/diff/35001/Source/platform/image-dec... Source/platform/image-decoders/ImageFrame.h:167: if (m_premultiplyAlpha) On 2014/05/12 19:36:11, Peter Kasting wrote: > We could simplify this to just: > > setRGBA<m_premultiplyAlpha>(dest, r, g, b, a); Doesn't compile on any C++ compiler I know. VS2010, VS2013 complain with "Template parameter must be a compile-time constant expression". https://codereview.chromium.org/18099004/diff/35001/Source/platform/image-dec... Source/platform/image-decoders/ImageFrame.h:176: inline void setRGBA(PixelData* dest, unsigned r, unsigned g, unsigned b, unsigned a) On 2014/05/12 19:36:11, Peter Kasting wrote: > This can be static, can't it? Could be, but you'd then add function calling overhead to every pixel. https://codereview.chromium.org/18099004/diff/35001/Source/platform/image-dec... Source/platform/image-decoders/ImageFrame.h:195: inline void setRGBARaw(PixelData* dest, unsigned r, unsigned g, unsigned b, unsigned a) On 2014/05/12 19:36:11, Peter Kasting wrote: > Seems like this should just disappear, and callers should call setRGBA<false>(). Mentioned that in #15. Was a little slower according to my measurements, so I left it as is.
https://codereview.chromium.org/18099004/diff/35001/Source/platform/image-dec... File Source/platform/image-decoders/ImageFrame.h (right): https://codereview.chromium.org/18099004/diff/35001/Source/platform/image-dec... Source/platform/image-decoders/ImageFrame.h:167: if (m_premultiplyAlpha) On 2014/05/13 01:39:19, Noel Gordon (Google) wrote: > On 2014/05/12 19:36:11, Peter Kasting wrote: > > We could simplify this to just: > > > > setRGBA<m_premultiplyAlpha>(dest, r, g, b, a); > > Doesn't compile on any C++ compiler I know. VS2010, VS2013 complain with > "Template parameter must be a compile-time constant expression". Huh, that's odd. I could swear I've done precisely this before. Oh well. I guess this should just manually call the two forms of setRGBA(), then. (See below) https://codereview.chromium.org/18099004/diff/35001/Source/platform/image-dec... Source/platform/image-decoders/ImageFrame.h:176: inline void setRGBA(PixelData* dest, unsigned r, unsigned g, unsigned b, unsigned a) On 2014/05/13 01:39:19, Noel Gordon (Google) wrote: > On 2014/05/12 19:36:11, Peter Kasting wrote: > > This can be static, can't it? > > Could be, but you'd then add function calling overhead to every pixel. Why? Can't something be both static and inline? All marking it in static should do is make it clear to the reader that we don't touch anything in |this|. https://codereview.chromium.org/18099004/diff/35001/Source/platform/image-dec... Source/platform/image-decoders/ImageFrame.h:195: inline void setRGBARaw(PixelData* dest, unsigned r, unsigned g, unsigned b, unsigned a) On 2014/05/13 01:39:19, Noel Gordon (Google) wrote: > On 2014/05/12 19:36:11, Peter Kasting wrote: > > Seems like this should just disappear, and callers should call > setRGBA<false>(). > > Mentioned that in #15. Was a little slower according to my measurements, so I > left it as is. If the difference isn't large, let's make the change anyway. Better to have less code, and callers all calling one core function to set things, than having duplicated SkPackARGB32NoCheck() calls in multiple places. Though, there really isn't a good reason this ought to be slower. I wonder if somehow the optimizer is getting confused. Does reversing the function declaration order help? In any case, if we don't end up with any calls to setRGBA<false>(), then declaring it as a template arg makes no sense. We should instead just rename setRGBA<true> to setRGBAPremultiply, and setRGBARaw to setRGBANoPremultiply, and be done.
https://codereview.chromium.org/18099004/diff/35001/Source/platform/image-dec... File Source/platform/image-decoders/png/PNGImageDecoder.cpp (right): https://codereview.chromium.org/18099004/diff/35001/Source/platform/image-dec... Source/platform/image-decoders/png/PNGImageDecoder.cpp:498: if (hasAlpha) { On 2014/05/12 19:36:11, Peter Kasting wrote: > Nit: Consider adding a comment here about how we write the code in this more > repetitive form in order to keep comparisons out of the inner loops. Added a comment.
On 2014/05/13 01:54:56, Peter Kasting wrote: > https://codereview.chromium.org/18099004/diff/35001/Source/platform/image-dec... > File Source/platform/image-decoders/ImageFrame.h (right): > > https://codereview.chromium.org/18099004/diff/35001/Source/platform/image-dec... > Source/platform/image-decoders/ImageFrame.h:167: if (m_premultiplyAlpha) > On 2014/05/13 01:39:19, Noel Gordon (Google) wrote: > > On 2014/05/12 19:36:11, Peter Kasting wrote: > > > We could simplify this to just: > > > > > > setRGBA<m_premultiplyAlpha>(dest, r, g, b, a); > > > > Doesn't compile on any C++ compiler I know. VS2010, VS2013 complain with > > "Template parameter must be a compile-time constant expression". > > Huh, that's odd. I could swear I've done precisely this before. Oh well. I > guess this should just manually call the two forms of setRGBA(), then. (See > below) Nothing odd about it. > https://codereview.chromium.org/18099004/diff/35001/Source/platform/image-dec... > Source/platform/image-decoders/ImageFrame.h:176: inline void setRGBA(PixelData* > dest, unsigned r, unsigned g, unsigned b, unsigned a) > On 2014/05/13 01:39:19, Noel Gordon (Google) wrote: > > On 2014/05/12 19:36:11, Peter Kasting wrote: > > > This can be static, can't it? > > > > Could be, but you'd then add function calling overhead to every pixel. > > Why? Can't something be both static and inline? All marking it in static > should do is make it clear to the reader that we don't touch anything in |this|. I thought you meant static, not static inline. There are a whole bunch of functions like that in ImageFrame (inline but not static, and could be) shall we change them all?
On 2014/05/13 02:01:49, Noel Gordon (Google) wrote: > On 2014/05/13 01:54:56, Peter Kasting wrote: > > > https://codereview.chromium.org/18099004/diff/35001/Source/platform/image-dec... > > Source/platform/image-decoders/ImageFrame.h:176: inline void > setRGBA(PixelData* > > dest, unsigned r, unsigned g, unsigned b, unsigned a) > > On 2014/05/13 01:39:19, Noel Gordon (Google) wrote: > > > On 2014/05/12 19:36:11, Peter Kasting wrote: > > > > This can be static, can't it? > > > > > > Could be, but you'd then add function calling overhead to every pixel. > > > > Why? Can't something be both static and inline? All marking it in static > > should do is make it clear to the reader that we don't touch anything in > |this|. > > I thought you meant static, not static inline. There are a whole bunch of > functions like that in ImageFrame (inline but not static, and could be) shall we > change them all? If something is non-static inline and could just as well be static inline, then yeah, we should change it.
On 2014/05/13 02:03:01, Peter Kasting wrote: > If something is non-static inline and could just as well be static inline, then > yeah, we should change it. Correction, it's one place only so let's do that.
On 2014/05/13 01:54:56, Peter Kasting wrote: > https://codereview.chromium.org/18099004/diff/35001/Source/platform/image-dec... > File Source/platform/image-decoders/ImageFrame.h (right): > > https://codereview.chromium.org/18099004/diff/35001/Source/platform/image-dec... > Source/platform/image-decoders/ImageFrame.h:167: if (m_premultiplyAlpha) > On 2014/05/13 01:39:19, Noel Gordon (Google) wrote: > > On 2014/05/12 19:36:11, Peter Kasting wrote: > > > We could simplify this to just: > > > > > > setRGBA<m_premultiplyAlpha>(dest, r, g, b, a); > > > > Doesn't compile on any C++ compiler I know. VS2010, VS2013 complain with > > "Template parameter must be a compile-time constant expression". > > Huh, that's odd. I could swear I've done precisely this before. Oh well. I > guess this should just manually call the two forms of setRGBA(), then. (See > below) > > https://codereview.chromium.org/18099004/diff/35001/Source/platform/image-dec... > Source/platform/image-decoders/ImageFrame.h:176: inline void setRGBA(PixelData* > dest, unsigned r, unsigned g, unsigned b, unsigned a) > On 2014/05/13 01:39:19, Noel Gordon (Google) wrote: > > On 2014/05/12 19:36:11, Peter Kasting wrote: > > > This can be static, can't it? > > > > Could be, but you'd then add function calling overhead to every pixel. > > Why? Can't something be both static and inline? All marking it in static > should do is make it clear to the reader that we don't touch anything in |this|. > > https://codereview.chromium.org/18099004/diff/35001/Source/platform/image-dec... > Source/platform/image-decoders/ImageFrame.h:195: inline void > setRGBARaw(PixelData* dest, unsigned r, unsigned g, unsigned b, unsigned a) > On 2014/05/13 01:39:19, Noel Gordon (Google) wrote: > > On 2014/05/12 19:36:11, Peter Kasting wrote: > > > Seems like this should just disappear, and callers should call > > setRGBA<false>(). > > > > Mentioned that in #15. Was a little slower according to my measurements, so I > > left it as is. > > If the difference isn't large, let's make the change anyway. Better to have > less code, and callers all calling one core function to set things, than having > duplicated SkPackARGB32NoCheck() calls in multiple places. > > Though, there really isn't a good reason this ought to be slower. I wonder if > somehow the optimizer is getting confused. Does reversing the function > declaration order help? > > In any case, if we don't end up with any calls to setRGBA<false>(), then > declaring it as a template arg makes no sense. OK lemme replace all setRGBRaw's with setRGBA<false>() to see how that looks.
New patch uploaded, let me know if you want to retain the template or not.
On 2014/05/11 08:44:36, eseidel wrote: > If we have a way to run our image decoding benchmarks on a mobile device, that > sounds very interesting. I believe the perf bots image decode test is run on Nexus 5/4/7. A more interesting use of someone's time might be to upgrade libpng to 1.6.x and turn on its ARM Neon SIMD code. If users have opted into the image transcoding proxy, then png and all other image formats (except webp) don't matter.
On 2014/05/13 02:38:55, Noel Gordon (Google) wrote: > New patch uploaded, let me know if you want to retain the template or not. Can't land the current patch: slows the GIF corpus by a full 6 seconds (72 -> 78).
On 2014/05/13 15:04:42, Noel Gordon (Google) wrote: > On 2014/05/13 02:38:55, Noel Gordon (Google) wrote: > > New patch uploaded, let me know if you want to retain the template or not. > > Can't land the current patch: slows the GIF corpus by a full 6 seconds (72 -> > 78). How can that be, when GIFImageDecoder.cpp doesn't even use setRGBA() or its variants? It writes directly to the framebuffer in haveDecodedRow().
I was suspicious that the changes suggested lead us back to a patch similar to that submitted by b.kelemen, which was later reverted because it regressed GIF decoding according to the perf bots. So I measured again to see if the GIF regression reoccured: seems the answer is yes, based on my measurements on win32 [1] [1] https://docs.google.com/spreadsheets/d/17wNTiwS3KYPjJj9GfyWauPePcmC5e4tBzkkui... I then tried to different ways to shape the code -- use the template or not, static inline or not, restore setRGBRaw or not, re-order the code, and so on -- to see if I could avoid the GIF regression. The answer is to use setRGBARaw, and optionally drop the template / use static inline [1]. On mac, such a change is modestly positive for GIF [2] [2] https://docs.google.com/spreadsheet/ccc?key=0Asd5UUPIMYYsdGxfQjI4UTdhdHUycGpQ... I propose to post a patch based on these findings and let our pref bots decide the matter if we want to submit the patch.
On 2014/05/14 06:33:20, Noel Gordon (Google) wrote: > So I measured again to see if the GIF regression reoccured: seems the answer is > yes, based on my measurements on win32 [1] > > [1] > https://docs.google.com/spreadsheets/d/17wNTiwS3KYPjJj9GfyWauPePcmC5e4tBzkkui... > > I then tried to different ways to shape the code -- use the template or not, > static inline or not, restore setRGBRaw or not, re-order the code, and so on -- > to see if I could avoid the GIF regression. The answer is to use setRGBARaw, and > optionally drop the template / use static inline [1]. On mac, such a change is > modestly positive for GIF [2] But my point is that the GIF decoder _doesn't use any of the code you're touching_. If these different patch forms are affecting measured GIF performance, either something is very wrong with our test harness and it's not measuring what we think it's measuring, or we're not actually using the GIF decoding code in the tree and instead are using something somewhere that actually calls your code. In either case, simply continuing to run patches through the test harness is pointless. The next step is to actually debug the code and step through GIF decoding on a machine that demonstrates the performance difference to figure out how on earth it reaches the code you're touching.
Are you saying that out test-harness on the perf bots are wrong? Ot that the graphs they produced in http://crbug.com/258324 are wrong too?
On 2014/05/14 07:04:59, Noel Gordon (Google) wrote: > Are you saying that out test-harness on the perf bots are wrong? Ot that the > graphs they produced in http://crbug.com/258324 are wrong too? Neither patch ought to have regressed GIF decoding performance, because neither patch is touching code the GIF decoder actually calls. So I think something is very wrong. The only way I can see a real effect happening is via a more subtle mechanism, e.g. somehow this change results in some reordering of executable code in memory that places things the GIF decoder calls into the same page, thus improving the icache hit rate. Things like that. We need to understand this effect, rather than tinkering blindly until the bots spit out a "no regressions" line. Otherwise, we don't know what's happening in the real world and whether the bots are actually measuring it, nor do we understand all the factors potentially impacting performance and thus how to ultimately attain the best performance.
As an ex-compiler engineer, the above should be right up your street. Not really my area (compiler-innards). I placed the test harness in www/~noel/data/png-decode-test-suite.zip if you'd like to grab it and confirm my results. This regression only happens on win, and only when setRGBARaw() is removed from ImageFrame.h. I don't tinker.
On 2014/05/14 07:37:58, Noel Gordon (Google) wrote: > As an ex-compiler engineer, the above should be right up your street. Not > really my area (compiler-innards). I placed the test harness in > www/~noel/data/png-decode-test-suite.zip if you'd like to grab it and confirm my > results. I've never worked on Intel chips, nor did I ever do icache perf debugging. I can run test suites, but I'm not able to look at low-level x86 assembly and make anything of it. > This regression only happens on win, and only when setRGBARaw() is > removed from ImageFrame.h. I don't tinker. Please don't take my comments as some sort of personal attack. I believe you're seeing something. I just don't believe what you're seeing means what it's supposed to mean.
On 2014/05/14 15:51:47, Peter Kasting wrote: > On 2014/05/14 07:37:58, Noel Gordon (Google) wrote: > > As an ex-compiler engineer, the above should be right up your street. Not > > really my area (compiler-innards). I placed the test harness in > > www/~noel/data/png-decode-test-suite.zip if you'd like to grab it and confirm > my > > results. > > I've never worked on Intel chips, nor did I ever do icache perf debugging. I > can run test suites, but I'm not able to look at low-level x86 assembly and make > anything of it. > > > This regression only happens on win, and only when setRGBARaw() is > > removed from ImageFrame.h. I don't tinker. > > Please don't take my comments as some sort of personal attack. I believe you're > seeing something. I just don't believe what you're seeing means what it's > supposed to mean. Let's everyone please take a deep breath. :) Just from code examination it does look like the GIFImageDecoder doesn't use the setRGBA methods at all, so it is strange that this CL would affect performance of that suite at all. Peter's point about putting a breakpoint in these methods in a debug build and running one of the GIF decoding tests is a good one, just to confirm our understanding. The use of "static inline" in this header seems strange to me. Since the compiler should inline everything, the passing of the "this" pointer should be eliminated. Perhaps the "static inline" directive is causing the compiler to put useless definitions in .obj files all over the place, changing the code layout in the final linked executable? I also don't understand why the forward use of setRGBAPremultiply without a declaration compiles at all. Noel, could you possibly try replacing "static inline" with "inline", moving setRGBAPremultiply up above setRGBA, re-confirming that the PNG speedup is still achieved, and going forward with the commit? Peter, will you approve his patch assuming the PNG speedup is still present? I think at some level we need to just rely on the perf bots to find any regressions. There are enough differences between typical developer builds and official builds (incremental linking, whole program optimizations, etc.) that I think it's difficult to reliably reproduce small performance deviations with developer builds.
On 2014/05/14 20:16:35, Ken Russell wrote: > Peter, will you approve his patch assuming the PNG speedup is still present? > > I think at some level we need to just rely on the perf bots to find any > regressions. There are enough differences between typical developer builds and > official builds (incremental linking, whole program optimizations, etc.) that I > think it's difficult to reliably reproduce small performance deviations with > developer builds. I think it's critical for us to first understand what is creating the apparent GIF perf regression. We may well find that the perf tests aren't testing what we thought they were testing. I had intended to do the "set a breakpoint and see if it's hit" test myself later, but I got a bunch of things pushed onto my stack ahead of that :(
On 2014/05/14 20:16:35, Ken Russell wrote: > Noel, could you possibly try replacing "static inline" with "inline", moving > setRGBAPremultiply up above setRGBA, re-confirming that the PNG speedup is still > achieved, and going forward with the commit? I had already measured these cases in the spreadsheet [1], the sheet names give the results, which are summarized here. Test of no-static vs static: png-no-check-raw-no-static 254.696 png-no-check-raw-static-final 255.634 gif-no-check-raw-no-static 72.146 gif-no-check-raw-static-final 72.015 xxx-*-final is the current patch, where I retained the static, it's maybe a wash. Test code order: where I moved both setRGBAPremultiply and setRGBARaw above the setRGBA routine definitions: png-no-check-raw-no-static-order 255.715 gif-no-check-raw-no-static-order 72.349 [1] https://docs.google.com/spreadsheets/d/17wNTiwS3KYPjJj9GfyWauPePcmC5e4tBzkkui...
On 2014/05/14 20:28:19, Peter Kasting wrote: > On 2014/05/14 20:16:35, Ken Russell wrote: > > Peter, will you approve his patch assuming the PNG speedup is still present? > > > > I think at some level we need to just rely on the perf bots to find any > > regressions. There are enough differences between typical developer builds and > > official builds (incremental linking, whole program optimizations, etc.) that > I > > think it's difficult to reliably reproduce small performance deviations with > > developer builds. > > I think it's critical for us to first understand what is creating the apparent > GIF perf regression. We may well find that the perf tests aren't testing what > we thought they were testing. > > I had intended to do the "set a breakpoint and see if it's hit" test myself > later, but I got a bunch of things pushed onto my stack ahead of that :( Put break-points on all ImageFrame::setRGBA*, and decoded a GIF in debug chrome with the VS2013 debugger attached to the renderer: no break-points hit. Applied the current patch, re-compiled, and retested: no break-points hit.
On 2014/05/16 04:06:02, Noel Gordon (Google) wrote: > Put break-points on all ImageFrame::setRGBA*, and decoded a GIF in debug chrome > with the VS2013 debugger attached to the renderer: no break-points hit. Applied > the current patch, re-compiled, and retested: no break-points hit. Darn, I was hoping that somehow we were reaching this and then could just evaluate whether those callstacks made sense. If we're indeed not reaching this, it makes it harder to track down. I wonder whether reordering ImageFrame functions, or adding a bunch more functions to it, would have any effect. In principle, it seems like we don't want our perf tests to be sensitive to things like the code order here, because it would mean there's a higher chance of them acting flaky, or performance in the field varying from what we test (especially on other chips, e.g. AMD vs. Intel). Right? Ken, how do we proceed here? Are there folks with VTune experience that could help track this down?
Wait a minute, I just had an idea. Maybe we're seeing some of the effects of link-time code folding/optimization. GIFImageDecoder::haveDecodedRow() writes directly to the framebuffer instead of calling one of the setRGBA() functions to do the write. What if we first changed that function to use the appropriate setRGBA() function (which I think would always be setRGBARaw()). Then after that, maybe we wouldn't see so much variation between these different patches.
On 2014/05/16 20:36:10, Peter Kasting wrote: > Maybe we're seeing some of the effects of link-time code folding/optimization. > GIFImageDecoder::haveDecodedRow() writes directly to the framebuffer instead of > calling one of the setRGBA() functions to do the write. Yes. Using the GIF color pixel look-up table. (that is the fastest way to decode rows, btw, write direct to the frame buffer). > What if we first > changed that function to use the appropriate setRGBA() function (which I think > would always be setRGBARaw()). Then after that, maybe we wouldn't see so much > variation between these different patches. The GIF color table already has (pre-packed and premultiplied) pixel values, and the decoder reads from that table when it needs to write a decoded pixel to the framebuffer. The write code could changed I suppose, and yes, setRGBARaw() is the right routine to use, but its use would mean unpacking the pixel read from the color table, just to pass it to setRGBARaw(), which just repacks the pixel all over again. Concretely, in the innards of the GIF decoder pixel write loop, we'd see: // *currentAddress = colorTableIter[sourceValue]; buffer.setRGBARaw(currentAddress, SkGetPackedR32(colorTableIter[sourceValue]), SkGetPackedG32(colorTableIter[sourceValue]), SkGetPackedB32(colorTableIter[sourceValue]), SkGetPackedA32(colorTableIter[sourceValue])); Seems like it'd cost performance -- regression was in fact 72.0- > 85.0 secs for the GIF corpus.
On 2014/05/19 04:48:04, Noel Gordon (Google) wrote: > The GIF color table already has (pre-packed and premultiplied) pixel values, and > the decoder reads from that table when it needs to write a decoded pixel to the > framebuffer. The write code could changed I suppose, and yes, setRGBARaw() is > the right routine to use, but its use would mean unpacking the pixel read from > the color table, just to pass it to setRGBARaw(), which just repacks the pixel > all over again. Well, I don't know what to tell you then. We're back to having no easily-testable reason why GIF performance should appear to be affected by your patches. I really, really don't want to just say "well, no one understand why any of the numbers say what they do, but we believe they're accurate, so we're going to land patch X and move on". At the same time, I'm not personally capable of diving into the generated assembly here and really figuring out what's up. Do you have any ideas?
If the pref bots can't be trusted, I don't know who to trust anymore. I'm out of ideas here.
You can try reaching out to the speed team (tonyg and co) if you need perf testing assistance. But it sounds like this CL is DOA until we have a way to show that it makes a perf improvement.
On 2014/05/28 23:54:31, eseidel wrote: > You can try reaching out to the speed team (tonyg and co) if you need perf > testing assistance. But it sounds like this CL is DOA until we have a way to > show that it makes a perf improvement. It's not DOA; I'm willing to land it if people like Ken and Tony truly think it's likely to be a win. I just want the people who own this infrastructure to first have the opportunity to examine why these surprising results are happening, so if at all possible we can understand them or decide that they mean the current results aren't trustworthy.
Presumably by ken you mean kbr? ccing.
On 2014/05/28 23:54:31, eseidel wrote: > You can try reaching out to the speed team (tonyg and co) if you need perf > testing assistance. But it sounds like this CL is DOA until we have a way to > show that it makes a perf improvement. I didn't read all the context on this change, but I'd expect to see raw image decode perf changes here: $ tools/perf/run_benchmark --browser=android-content-shell image_decoding.image_decoding_measurement It is pretty microbenchmarky. If there're any tradeoffs involved, it would be more compelling to see if any smoothness or loading benchmarks are affected.
On 2014/05/29 00:04:39, tonyg wrote: > I didn't read all the context on this change, Understandable (long discussion). > but I'd expect to see raw image decode perf changes here: > $ tools/perf/run_benchmark --browser=android-content-shell > image_decoding.image_decoding_measurement One thing to note was that the pref regression was observed on win32 only: PNG decoding improved, but GIF decoding regressed. There's a picture of regression on https://crbug.com/258324 I could run the android perf tests I suppose, but I'm not sure they'd help shed light on why GIF decoding regressed on win32. The win32 perf bots spotted the regression and the original patch out was rolled out, creating bug 258324. The goal of the current patch is to avoid the GIF perf regression.
> On 2014/05/29 00:04:39, tonyg wrote: > > > > but I'd expect to see raw image decode perf changes here: > > $ tools/perf/run_benchmark --browser=android-content-shell > > image_decoding.image_decoding_measurement +tonyg :) On 2014/05/29 13:03:24, Noel Gordon (Google) wrote: > I could run the android perf tests I suppose, but I'm not sure they'd help shed > light on why GIF decoding regressed on win32. Ran the test on a nexus 7, seems to measure the temperature of the board. Where is this test running on the perf bots btw? Best I can tell, there's no useful data for the test that I could find. And the same for other decoding tests: image_decoding.tough_decoding_cases, image_decoding.tough_decoding_case. What am I missing?
This change should be landed. It yields a dramatic speedup in PNG decoding, measured by Noel above. LGTM
Filed issue 381548 about the perf bots reporting no data for the image decode tests.
The CQ bit was checked by noel@chromium.org
CQ is trying da patch. Follow status at https://chromium-status.appspot.com/cq/noel@chromium.org/18099004/115001
FYI, CQ is re-trying this CL (attempt #1). The failing builders are: win_gpu_triggered_tests on tryserver.chromium.gpu (http://build.chromium.org/p/tryserver.chromium.gpu/builders/win_gpu_triggered...)
The CQ bit was unchecked by commit-bot@chromium.org
Try jobs failed on following builders: win_gpu_triggered_tests on tryserver.chromium.gpu (http://build.chromium.org/p/tryserver.chromium.gpu/builders/win_gpu_triggered...)
The CQ bit was checked by noel@chromium.org
CQ is trying da patch. Follow status at https://chromium-status.appspot.com/cq/noel@chromium.org/18099004/135001
FYI, CQ is re-trying this CL (attempt #1). The failing builders are: blink_presubmit on tryserver.blink (http://build.chromium.org/p/tryserver.blink/builders/blink_presubmit/builds/7150)
Message was sent while issue was closed.
Change committed as 175666 |