|
|
Created:
7 years, 6 months ago by kbalazs_ Modified:
7 years, 5 months ago CC:
blink-reviews, eae+blinkwatch, jeez, Stephen White Base URL:
https://chromium.googlesource.com/chromium/blink.git@master Visibility:
Public. |
DescriptionSpeed up png decoding by eliminating branches from the core loop that fills ImageFrame with pixels
The original idea came from the work of Allan Jensen landed in WebKit as r150252.
He eliminated the branches in PNGImageDecoder::rowAvailable to make it benefit from
auto vectorization. I tweaked it further to make this change a performance win
without auto vectorization as well (which we do not use by default, only with -O3 or
-ftree-vectorize). I measured the performance gain with callgrind by comparing the
inclusive cost of the affected function. I used simple local pages with a few image
for benchmarking.
Results:
* opaque PNG: 7.8% faster
* PNG with alpha: 24% faster
R=thakis@chromium.org
Committed: https://src.chromium.org/viewvc/blink?view=rev&revision=153004
Patch Set 1 #
Total comments: 2
Patch Set 2 : Speed up png decoding by eliminating branches from the core loop that fills ImageFrame with pixels #Patch Set 3 : Speed up png decoding by eliminating branches from the core loop that fills ImageFrame with pixels #
Total comments: 3
Messages
Total messages: 23 (0 generated)
Continued from https://codereview.chromium.org/15466004/
lgtm since this is preserving the existing logic. I'd remove the mention of -O2 from the CL description since we use -O3 on mac and the best optimization msvc can muster on win. In general, we can probably simplify this a lot now that the only graphics backend is skia. I think skia always wants premultiplied data; the only caller passing ImageSource::AlphaNotPremultiplied is GraphicsContext3D.cpp and maybe that could be removed too. https://codereview.chromium.org/17448010/diff/1/Source/core/platform/image-de... File Source/core/platform/image-decoders/ImageDecoder.h (right): https://codereview.chromium.org/17448010/diff/1/Source/core/platform/image-de... Source/core/platform/image-decoders/ImageDecoder.h:208: *dest = SkPackARGB32NoCheck(a, r, g, b); Maybe the body of this function could be just *dest = SkPremultiplyARGBInline(a, r, g, b); ?
On 2013/06/24 17:54:15, Nico wrote: > lgtm since this is preserving the existing logic. > > I'd remove the mention of -O2 from the CL description since we use -O3 on mac > and the best optimization msvc can muster on win. > > In general, we can probably simplify this a lot now that the only graphics > backend is skia. I think skia always wants premultiplied data; the only caller > passing ImageSource::AlphaNotPremultiplied is GraphicsContext3D.cpp and maybe > that could be removed too. > > https://codereview.chromium.org/17448010/diff/1/Source/core/platform/image-de... > File Source/core/platform/image-decoders/ImageDecoder.h (right): > > https://codereview.chromium.org/17448010/diff/1/Source/core/platform/image-de... > Source/core/platform/image-decoders/ImageDecoder.h:208: *dest = > SkPackARGB32NoCheck(a, r, g, b); > Maybe the body of this function could be just > > *dest = SkPremultiplyARGBInline(a, r, g, b); > > ? I'm not sure which is better but it does more operation. It does one multiplication, 2 addition and 2 shift for each channel while currently we do just one multiplication and a shift for each channel plus one additional multiplication. I'm going to check which is the faster (at least on desktop where I can test) and land the better one.
On Mon, Jun 24, 2013 at 11:31 AM, <b.kelemen@sisa.samsung.com> wrote: > On 2013/06/24 17:54:15, Nico wrote: > >> lgtm since this is preserving the existing logic. >> > > I'd remove the mention of -O2 from the CL description since we use -O3 on >> mac >> and the best optimization msvc can muster on win. >> > > In general, we can probably simplify this a lot now that the only graphics >> backend is skia. I think skia always wants premultiplied data; the only >> caller >> passing ImageSource::**AlphaNotPremultiplied is GraphicsContext3D.cpp >> and maybe >> that could be removed too. >> > > > https://codereview.chromium.**org/17448010/diff/1/Source/** > core/platform/image-decoders/**ImageDecoder.h<https://codereview.chromium.org/17448010/diff/1/Source/core/platform/image-decoders/ImageDecoder.h> > >> File Source/core/platform/image-**decoders/ImageDecoder.h (right): >> > > > https://codereview.chromium.**org/17448010/diff/1/Source/** > core/platform/image-decoders/**ImageDecoder.h#newcode208<https://codereview.chromium.org/17448010/diff/1/Source/core/platform/image-decoders/ImageDecoder.h#newcode208> > >> Source/core/platform/image-**decoders/ImageDecoder.h:208: *dest = >> SkPackARGB32NoCheck(a, r, g, b); >> Maybe the body of this function could be just >> > > *dest = SkPremultiplyARGBInline(a, r, g, b); >> > > ? >> > > I'm not sure which is better but it does more operation. It does one > multiplication, 2 addition and 2 shift for each channel while currently we > do > just one multiplication and a shift for each channel plus one additional > multiplication. I'm going to check which is the faster (at least on desktop > where I can test) and land the better one. > Sounds good. You can also land this as is and switch to SkPremultiplyARGBInline() in a follow-up if it turns out to be faster. > > https://codereview.chromium.**org/17448010/<https://codereview.chromium.org/1... >
CQ is trying da patch. Follow status at https://chromium-status.appspot.com/cq/b.kelemen@sisa.samsung.com/17448010/1
> > Sounds good. You can also land this as is and switch to > SkPremultiplyARGBInline() in a follow-up if it turns out to be faster. > The current procedure in ImageDecoder.h is slightly faster.
Retried try job too often on mac_layout_rel for step(s) webkit_tests http://build.chromium.org/p/tryserver.chromium/buildstatus?builder=mac_layout...
https://codereview.chromium.org/17448010/diff/1/Source/core/platform/image-de... File Source/core/platform/image-decoders/png/PNGImageDecoder.cpp (right): https://codereview.chromium.org/17448010/diff/1/Source/core/platform/image-de... Source/core/platform/image-decoders/png/PNGImageDecoder.cpp:515: buffer.setHasAlpha(hasNonTrivialAlpha); Oops, this war wrong, we cannot make !hasAlpha() just because there was no alpha in the current line.
CQ is trying da patch. Follow status at https://chromium-status.appspot.com/cq/b.kelemen@sisa.samsung.com/17448010/27002
Message was sent while issue was closed.
Change committed as 153004
Message was sent while issue was closed.
Hey, any reason Peter or myself we not consulted here?
Message was sent while issue was closed.
https://chromiumcodereview.appspot.com/17448010/diff/27002/Source/core/platfo... File Source/core/platform/image-decoders/ImageDecoder.h (right): https://chromiumcodereview.appspot.com/17448010/diff/27002/Source/core/platfo... Source/core/platform/image-decoders/ImageDecoder.h:195: if (!a) { Why did you change the shape of this routine? The test of 0 was done after the a< 255 test before and for good reason. Why do we need the ASSERT at all (the routine does what it says)? https://chromiumcodereview.appspot.com/17448010/diff/27002/Source/core/platfo... Source/core/platform/image-decoders/ImageDecoder.h:213: ASSERT(!m_premultiplyAlpha); Ditto re the ASSERT, why do we need it all when the routine does what it says? https://chromiumcodereview.appspot.com/17448010/diff/27002/Source/core/platfo... File Source/core/platform/image-decoders/png/PNGImageDecoder.cpp (right): https://chromiumcodereview.appspot.com/17448010/diff/27002/Source/core/platfo... Source/core/platform/image-decoders/png/PNGImageDecoder.cpp:500: buffer.setRGBAPremultiply(address + x, pixel[0], pixel[1], pixel[2], pixel[3]); address++ not working for you? Why address + x or is there some measured benefit (I don't see any mention of it in the change log).
Message was sent while issue was closed.
And a as practical matter, the cost of writing row data is about 10-15% of the over decoding cost of a PNG image in my experience. Some timings here https://bugs.webkit.org/show_bug.cgi?id=103216 for example, prior to our moving the code to the even faster fixed point multiply. Yet this patch claims PNG with alpha: 24% faster?
Message was sent while issue was closed.
On 2013/06/26 04:58:16, Noel Gordon (Google) wrote: > Hey, any reason Peter or myself we not consulted here? Sorry for that, I was messing up the email addresses (and your names actually). My original intention was to make you review this. Sorry for both of you.
Message was sent while issue was closed.
On 2013/06/26 05:15:26, Noel Gordon (Google) wrote: > https://chromiumcodereview.appspot.com/17448010/diff/27002/Source/core/platfo... > File Source/core/platform/image-decoders/ImageDecoder.h (right): > > https://chromiumcodereview.appspot.com/17448010/diff/27002/Source/core/platfo... > Source/core/platform/image-decoders/ImageDecoder.h:195: if (!a) { > Why did you change the shape of this routine? The test of 0 was done after the > a< 255 test before and for good reason. Why do we need the ASSERT at all (the > routine does what it says)? The assert is to make sure callers not call it my mistake. Maybe it's not worth. I was not considering that I reordered the test. > https://chromiumcodereview.appspot.com/17448010/diff/27002/Source/core/platfo... > Source/core/platform/image-decoders/png/PNGImageDecoder.cpp:500: > buffer.setRGBAPremultiply(address + x, pixel[0], pixel[1], pixel[2], pixel[3]); > address++ not working for you? Why address + x or is there some measured > benefit (I don't see any mention of it in the change log). Yes I think address + x can be faster (although I can't tell you how much right now). At least in a previous version of this patch I experienced that using a running index (dst[i]) is faster than *dst++.
Message was sent while issue was closed.
On 2013/06/26 06:06:34, Noel Gordon (Google) wrote: > And a as practical matter, the cost of writing row data is about 10-15% of the > over decoding cost of a PNG image in my experience. Some timings here > https://bugs.webkit.org/show_bug.cgi?id=103216 for example, prior to our moving > the code to the even faster fixed point multiply. Yet this patch claims PNG > with alpha: 24% faster? This is what I was measuring by the method I explained in the commit message. What valgrind measures is executed instructions, so the actual time const can be different (yes it was a mistake to just say it 24% faster). Still I believe this is a notable speedup.
Message was sent while issue was closed.
On 2013/06/26 14:31:03, kbalazs wrote: > On 2013/06/26 06:06:34, Noel Gordon (Google) wrote: > > And a as practical matter, the cost of writing row data is about 10-15% of the > > over decoding cost of a PNG image in my experience. Some timings here > > https://bugs.webkit.org/show_bug.cgi?id=103216 for example, prior to our > moving > > the code to the even faster fixed point multiply. Yet this patch claims PNG > > with alpha: 24% faster? > > This is what I was measuring by the method I explained in the commit message. > What valgrind measures is executed instructions, so the actual time const can be > different (yes it was a mistake to just say it 24% faster). Still I believe this > is a notable speedup. I see now. In https://bugs.webkit.org/show_bug.cgi?id=103216 you measured the full cost of decoding and the win on it. I only compared the cost of the rowAvailable function (ref/patched). I don't think I stated otherwise (but sorry if I was not clear).
Message was sent while issue was closed.
On 2013/06/26 14:22:26, kbalazs wrote: > On 2013/06/26 04:58:16, Noel Gordon (Google) wrote: > > Hey, any reason Peter or myself we not consulted here? > > Sorry for that, I was messing up the email addresses (and your names actually). > My original intention was to make you review this. Sorry for both of you. Ok thanks for explanation. The "Publish and mail" reviewer box should auto-complete our names. It it slow sometimes, enough to appear broken.
Message was sent while issue was closed.
On 2013/06/26 14:26:24, kbalazs wrote: > https://chromiumcodereview.appspot.com/17448010/diff/27002/Source/core/platfo... > > Source/core/platform/image-decoders/png/PNGImageDecoder.cpp:500: > > buffer.setRGBAPremultiply(address + x, pixel[0], pixel[1], pixel[2],pixel[3]); > > address++ not working for you? Why address + x or is there some measured > > benefit (I don't see any mention of it in the change log). > > Yes I think address + x can be faster (although I can't tell you how much right > now). At least in a previous version of this patch I experienced that using a > running index (dst[i]) is faster than *dst++. I'd love to see some data on that if you have it. We use the ++ form is a few places.
Message was sent while issue was closed.
On 2013/06/26 15:00:09, kbalazs wrote: > > I see now. In https://bugs.webkit.org/show_bug.cgi?id=103216 you measured the > full cost of decoding and the win on it. Yes, measured the total decoding time, and how much of that time was spent writing row data in rowAvailable(). I see now that your stated 24% perf increase ... > I only compared the cost of the rowAvailable function (ref/patched). I don't > think I stated otherwise (but sorry if I was not clear). was for rowAvailable(). That would translate to ~2% improvement in decode speed for PNG based on the result seen in that webkit bug.
Message was sent while issue was closed.
Patch regressed GIF decode perf https://code.google.com/p/chromium/issues/detail?id=254462
Message was sent while issue was closed.
https://codereview.chromium.org/17906002 fix attempt
Message was sent while issue was closed.
Patch regressed JPEG decode perf on some Androids (eg., Nexus 4 uses the system JPEG). |