Issue 1563393002: Use SkOpts routines in SkSwizzler

msarett

Description was changed from ========== Use SkOpts routines in SkSwizzler This is a bit tricky ...

4 years, 11 months ago (2016-01-07 23:44:10 UTC) #1

msarett

msarett@google.com changed reviewers: + mtklein@google.com, scroggo@google.com

4 years, 11 months ago (2016-01-08 13:55:59 UTC) #3

msarett

No performance improvements yet. We haven't added the optimized routines to SkOpts yet. https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cpp File ...

4 years, 11 months ago (2016-01-08 13:56:00 UTC) #4

mtklein

What sampling rates would we need in practice? It'd be nice to keep those in ...

4 years, 11 months ago (2016-01-08 15:20:20 UTC) #5

What sampling rates would we need in practice?  It'd be nice to keep those in
mind while we write up the SIMD code.  Small powers of 1/2 shouldn't be terrible
to implement.

Have you measured perf, or are we just assuming there's no difference?

https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cpp
File src/codec/SkSwizzler.cpp (right):

https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cp...
src/codec/SkSwizzler.cpp:733: scalarProc = &swizzle_rgba_to_n32_premul_skipZ;
On 2016/01/08 13:56:00, msarett wrote:
> We'll want to benchmark against the regular version and the "skip zeros"
> version.  This is a good time to find out the benefit of skipping zeros.

There's more to this feature than just speed, right?  Even if we can go as fast,
seems like the feature is really to avoid physically touching those bytes so
that those pages stay backed by a shared zero page instead of their own physical
RAM?  (Note I am personally still skeptical that optimizing for ~1000 pixel
transparent blocks in Skia is better than, say, cropping the images...)

Skipping zeros is something we could do well in SSE2+ (especially SSE4.1+ or
AVX2), but I suspect it's the sort of thing that will completely hose NEON
performance (no movemask/ptest, and it's quite slow to communicate from NEON
back to ARM).  Still, if we must do it, it seems worth trying.

https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cp...
src/codec/SkSwizzler.cpp:734: } else {
If I'm right above about this being more than just a speed thing, can't we only
use the opt_ method inside the else here?

https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cp...
src/codec/SkSwizzler.cpp:825: int SkSwizzler::onSetSampleX(int sampleX) {
I don't suppose x-sampling can move to the constructor?

https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cp...
src/codec/SkSwizzler.cpp:835: fRowProc = fScalarRowProc;
This preference logic might be simpler to follow generally if we tweak it a
little.  Today it seems like the logic is "use optProc unless we can't, then
fall back to scalarProc".  What if we write it as "use optProc if we have one,
defaulting to scalarProc"?  I.e. make having optProc the special case, rather
than not having optProc?

That'd make this line something like

   fOptProc = nullptr;

and I think it might be less intrusive for the rest of SkSwizzler.

https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.h
File src/codec/SkSwizzler.h (right):

https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.h#...
src/codec/SkSwizzler.h:199: const RowProc       fScalarRowProc;
Scalar sounds like SkScalar to me.  What about just fRowProc + fFastRowProc?

https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzlerOpts.h
File src/codec/SkSwizzlerOpts.h (right):

https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzlerOpt...
src/codec/SkSwizzlerOpts.h:11: // This is quite common for PNGs.
My preference would be to plop this function in SkSwizzler.cpp among the rest of
the RowProcs.  Putting this into a header makes me wonder, why is it in a
header, are there two files that need to call this?  The scope of a static
function in a .cpp file, on the other hand, is pretty darn clear.

https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzlerOpt...
src/codec/SkSwizzlerOpts.h:14: int offset, const SkPMColor ctable[]) {
Have you ever thought about ditching offset as a parameter to these procs?
You don't ever do anything other than start off with src += offset, right?

https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzlerOpt...
src/codec/SkSwizzlerOpts.h:20: SkOpts::premul_xxxa((uint32_t*) dst, (uint32_t*)
(src + offset), width);
probably clearest to use (const uint32_t*)(src + offset) in these calls.

https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzlerOpt...
src/codec/SkSwizzlerOpts.h:26: }
We can also do UPM RGBA -> UPM N32 right?

#ifdef SK_PMCOLOR_IS_RGBA
  memmove(...)
#else
  SkOpts::swaprb_xxxa(...)
#endif

scroggo

> What sampling rates would we need in practice? It'd be nice to keep those ...

4 years, 11 months ago (2016-01-08 16:02:57 UTC) #6

> What sampling rates would we need in practice?  It'd be nice to keep those in
> mind while we write up the SIMD code.  Small powers of 1/2 shouldn't be
terrible
> to implement.

The common ones used by Android will be powers of 2 sampling (see
http://developer.android.com/reference/android/graphics/BitmapFactory.Options...),
but I've seen really big ones (e.g. 64)

https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cpp
File src/codec/SkSwizzler.cpp (right):

https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cp...
src/codec/SkSwizzler.cpp:733: scalarProc = &swizzle_rgba_to_n32_premul_skipZ;
On 2016/01/08 15:20:19, mtklein wrote:
> On 2016/01/08 13:56:00, msarett wrote:
> > We'll want to benchmark against the regular version and the "skip zeros"
> > version.  This is a good time to find out the benefit of skipping zeros.
> 
> There's more to this feature than just speed, right? 

Yes.

> Even if we can go as fast,
> seems like the feature is really to avoid physically touching those bytes so
> that those pages stay backed by a shared zero page instead of their own
physical
> RAM?  (Note I am personally still skeptical that optimizing for ~1000 pixel
> transparent blocks in Skia is better than, say, cropping the images...)

+1. But I think we'll need to update lots of client code in order to crop the
images instead. Not saying that it's not worth doing, but we'll need to change
the framework (and probably get buy-in, because they'll need to continue doing
it the new way) *and* make apps update to the new way. I think the right way to
switch is to still support this strange optimization but show the all-around
benefits of cropping the images, so people will opt-in. If we were to drop the
feature, suddenly Android devices may have a lot less free memory.

> 
> Skipping zeros is something we could do well in SSE2+ (especially SSE4.1+ or
> AVX2), but I suspect it's the sort of thing that will completely hose NEON
> performance (no movemask/ptest, and it's quite slow to communicate from NEON
> back to ARM).  Still, if we must do it, it seems worth trying.

https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cp...
src/codec/SkSwizzler.cpp:776: if (nullptr == optProc) {
So are we guaranteed that there's an optProc, but there might also be a
scalarProc?

https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cp...
src/codec/SkSwizzler.cpp:825: int SkSwizzler::onSetSampleX(int sampleX) {
On 2016/01/08 15:20:19, mtklein wrote:
> I don't suppose x-sampling can move to the constructor?

It used to be in the constructor, but we moved it out in crrev.com/1372973002.
This allowed us to factor out sampling from the codec implementations.

https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzlerOpts.h
File src/codec/SkSwizzlerOpts.h (right):

https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzlerOpt...
src/codec/SkSwizzlerOpts.h:24: // FIXME: Should we continue to support
reallyHasAlpha()?
I thought we had already decided not to.

mtklein

https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cpp File src/codec/SkSwizzler.cpp (right): https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cpp#newcode733 src/codec/SkSwizzler.cpp:733: scalarProc = &swizzle_rgba_to_n32_premul_skipZ; On 2016/01/08 16:02:56, scroggo wrote: > ...

4 years, 11 months ago (2016-01-08 16:23:01 UTC) #7

https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cpp
File src/codec/SkSwizzler.cpp (right):

https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cp...
src/codec/SkSwizzler.cpp:733: scalarProc = &swizzle_rgba_to_n32_premul_skipZ;
On 2016/01/08 16:02:56, scroggo wrote:
> On 2016/01/08 15:20:19, mtklein wrote:
> > On 2016/01/08 13:56:00, msarett wrote:
> > > We'll want to benchmark against the regular version and the "skip zeros"
> > > version.  This is a good time to find out the benefit of skipping zeros.
> > 
> > There's more to this feature than just speed, right? 
> 
> Yes.
> 
> > Even if we can go as fast,
> > seems like the feature is really to avoid physically touching those bytes so
> > that those pages stay backed by a shared zero page instead of their own
> physical
> > RAM?  (Note I am personally still skeptical that optimizing for ~1000 pixel
> > transparent blocks in Skia is better than, say, cropping the images...)
> 
> +1. But I think we'll need to update lots of client code in order to crop the
> images instead. Not saying that it's not worth doing, but we'll need to change
> the framework (and probably get buy-in, because they'll need to continue doing
> it the new way) *and* make apps update to the new way. I think the right way
to
> switch is to still support this strange optimization but show the all-around
> benefits of cropping the images, so people will opt-in. If we were to drop the
> feature, suddenly Android devices may have a lot less free memory.
> 
> > 
> > Skipping zeros is something we could do well in SSE2+ (especially SSE4.1+ or
> > AVX2), but I suspect it's the sort of thing that will completely hose NEON
> > performance (no movemask/ptest, and it's quite slow to communicate from NEON
> > back to ARM).  Still, if we must do it, it seems worth trying.
> 

And we're certain there really are runs of >=1024 transparent pixels right?  If
we touch any byte on a (4K) page, we might as well touch them all as far as
memory savings go.  1024 transparent pixels is the smallest value that makes
this worth doing, and that's when we're perfectly aligned to a page, which may
be optimistic.

https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cp...
src/codec/SkSwizzler.cpp:776: if (nullptr == optProc) {
On 2016/01/08 16:02:57, scroggo wrote:
> So are we guaranteed that there's an optProc, but there might also be a
> scalarProc?

(That's what I find confusing.  In reality, we're guaranteed there's a
general-case proc, and the optimized procs are sparse.  This code makes it feel
the other way around.)

https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzlerOpts.h
File src/codec/SkSwizzlerOpts.h (right):

https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzlerOpt...
src/codec/SkSwizzlerOpts.h:24: // FIXME: Should we continue to support
reallyHasAlpha()?
On 2016/01/08 16:02:57, scroggo wrote:
> I thought we had already decided not to. 

If we haven't yet, let's decide now to drop it unless later proven necessary.

scroggo

https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cpp File src/codec/SkSwizzler.cpp (right): https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cpp#newcode733 src/codec/SkSwizzler.cpp:733: scalarProc = &swizzle_rgba_to_n32_premul_skipZ; On 2016/01/08 16:23:01, mtklein wrote: > ...

4 years, 11 months ago (2016-01-08 16:50:51 UTC) #8

https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cpp
File src/codec/SkSwizzler.cpp (right):

https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cp...
src/codec/SkSwizzler.cpp:733: scalarProc = &swizzle_rgba_to_n32_premul_skipZ;
On 2016/01/08 16:23:01, mtklein wrote:
> On 2016/01/08 16:02:56, scroggo wrote:
> > On 2016/01/08 15:20:19, mtklein wrote:
> > > On 2016/01/08 13:56:00, msarett wrote:
> > > > We'll want to benchmark against the regular version and the "skip zeros"
> > > > version.  This is a good time to find out the benefit of skipping zeros.
> > > 
> > > There's more to this feature than just speed, right? 
> > 
> > Yes.
> > 
> > > Even if we can go as fast,
> > > seems like the feature is really to avoid physically touching those bytes
so
> > > that those pages stay backed by a shared zero page instead of their own
> > physical
> > > RAM?  (Note I am personally still skeptical that optimizing for ~1000
pixel
> > > transparent blocks in Skia is better than, say, cropping the images...)
> > 
> > +1. But I think we'll need to update lots of client code in order to crop
the
> > images instead. Not saying that it's not worth doing, but we'll need to
change
> > the framework (and probably get buy-in, because they'll need to continue
doing
> > it the new way) *and* make apps update to the new way. I think the right way
> to
> > switch is to still support this strange optimization but show the all-around
> > benefits of cropping the images, so people will opt-in. If we were to drop
the
> > feature, suddenly Android devices may have a lot less free memory.
> > 
> > > 
> > > Skipping zeros is something we could do well in SSE2+ (especially SSE4.1+
or
> > > AVX2), but I suspect it's the sort of thing that will completely hose NEON
> > > performance (no movemask/ptest, and it's quite slow to communicate from
NEON
> > > back to ARM).  Still, if we must do it, it seems worth trying.
> > 
> 
> And we're certain there really are runs of >=1024 transparent pixels right? 
If
> we touch any byte on a (4K) page, we might as well touch them all as far as
> memory savings go.  1024 transparent pixels is the smallest value that makes
> this worth doing, and that's when we're perfectly aligned to a page, which may
> be optimistic.

Yes. (I can dig up the bug, but when the change first landed it saved many MB of
RAM, IIRC.) For all around simplicity, I did not make any checks to see how many
consecutive transparent pixels there are. I'm not opposed to making the check if
we can do it efficiently.

mtklein

On 2016/01/08 16:50:51, scroggo wrote: > https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cpp > File src/codec/SkSwizzler.cpp (right): > > https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cpp#newcode733 > ...

4 years, 11 months ago (2016-01-08 17:13:15 UTC) #9

On 2016/01/08 16:50:51, scroggo wrote:
> https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cpp
> File src/codec/SkSwizzler.cpp (right):
> 
>
https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cp...
> src/codec/SkSwizzler.cpp:733: scalarProc = &swizzle_rgba_to_n32_premul_skipZ;
> On 2016/01/08 16:23:01, mtklein wrote:
> > On 2016/01/08 16:02:56, scroggo wrote:
> > > On 2016/01/08 15:20:19, mtklein wrote:
> > > > On 2016/01/08 13:56:00, msarett wrote:
> > > > > We'll want to benchmark against the regular version and the "skip
zeros"
> > > > > version.  This is a good time to find out the benefit of skipping
zeros.
> > > > 
> > > > There's more to this feature than just speed, right? 
> > > 
> > > Yes.
> > > 
> > > > Even if we can go as fast,
> > > > seems like the feature is really to avoid physically touching those
bytes
> so
> > > > that those pages stay backed by a shared zero page instead of their own
> > > physical
> > > > RAM?  (Note I am personally still skeptical that optimizing for ~1000
> pixel
> > > > transparent blocks in Skia is better than, say, cropping the images...)
> > > 
> > > +1. But I think we'll need to update lots of client code in order to crop
> the
> > > images instead. Not saying that it's not worth doing, but we'll need to
> change
> > > the framework (and probably get buy-in, because they'll need to continue
> doing
> > > it the new way) *and* make apps update to the new way. I think the right
way
> > to
> > > switch is to still support this strange optimization but show the
all-around
> > > benefits of cropping the images, so people will opt-in. If we were to drop
> the
> > > feature, suddenly Android devices may have a lot less free memory.
> > > 
> > > > 
> > > > Skipping zeros is something we could do well in SSE2+ (especially
SSE4.1+
> or
> > > > AVX2), but I suspect it's the sort of thing that will completely hose
NEON
> > > > performance (no movemask/ptest, and it's quite slow to communicate from
> NEON
> > > > back to ARM).  Still, if we must do it, it seems worth trying.
> > > 
> > 
> > And we're certain there really are runs of >=1024 transparent pixels right? 
> If
> > we touch any byte on a (4K) page, we might as well touch them all as far as
> > memory savings go.  1024 transparent pixels is the smallest value that makes
> > this worth doing, and that's when we're perfectly aligned to a page, which
may
> > be optimistic.
> 
> Yes. (I can dig up the bug, but when the change first landed it saved many MB
of
> RAM, IIRC.) For all around simplicity, I did not make any checks to see how
many
> consecutive transparent pixels there are. I'm not opposed to making the check
if
> we can do it efficiently.

OK, I'm game to try.  I'm pretty confident we can preserve the write-no-zeros
behavior on x86 while speeding up the code.  I don't know if we can speed up the
code on ARM while preserving write-no-zeros behavior, but we can experiment.

mtklein

On 2016/01/08 17:13:15, mtklein wrote: > On 2016/01/08 16:50:51, scroggo wrote: > > https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cpp > ...

4 years, 11 months ago (2016-01-08 17:33:36 UTC) #10

On 2016/01/08 17:13:15, mtklein wrote:
> On 2016/01/08 16:50:51, scroggo wrote:
> >
https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cpp
> > File src/codec/SkSwizzler.cpp (right):
> > 
> >
>
https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cp...
> > src/codec/SkSwizzler.cpp:733: scalarProc =
&swizzle_rgba_to_n32_premul_skipZ;
> > On 2016/01/08 16:23:01, mtklein wrote:
> > > On 2016/01/08 16:02:56, scroggo wrote:
> > > > On 2016/01/08 15:20:19, mtklein wrote:
> > > > > On 2016/01/08 13:56:00, msarett wrote:
> > > > > > We'll want to benchmark against the regular version and the "skip
> zeros"
> > > > > > version.  This is a good time to find out the benefit of skipping
> zeros.
> > > > > 
> > > > > There's more to this feature than just speed, right? 
> > > > 
> > > > Yes.
> > > > 
> > > > > Even if we can go as fast,
> > > > > seems like the feature is really to avoid physically touching those
> bytes
> > so
> > > > > that those pages stay backed by a shared zero page instead of their
own
> > > > physical
> > > > > RAM?  (Note I am personally still skeptical that optimizing for ~1000
> > pixel
> > > > > transparent blocks in Skia is better than, say, cropping the
images...)
> > > > 
> > > > +1. But I think we'll need to update lots of client code in order to
crop
> > the
> > > > images instead. Not saying that it's not worth doing, but we'll need to
> > change
> > > > the framework (and probably get buy-in, because they'll need to continue
> > doing
> > > > it the new way) *and* make apps update to the new way. I think the right
> way
> > > to
> > > > switch is to still support this strange optimization but show the
> all-around
> > > > benefits of cropping the images, so people will opt-in. If we were to
drop
> > the
> > > > feature, suddenly Android devices may have a lot less free memory.
> > > > 
> > > > > 
> > > > > Skipping zeros is something we could do well in SSE2+ (especially
> SSE4.1+
> > or
> > > > > AVX2), but I suspect it's the sort of thing that will completely hose
> NEON
> > > > > performance (no movemask/ptest, and it's quite slow to communicate
from
> > NEON
> > > > > back to ARM).  Still, if we must do it, it seems worth trying.
> > > > 
> > > 
> > > And we're certain there really are runs of >=1024 transparent pixels
right? 
> > If
> > > we touch any byte on a (4K) page, we might as well touch them all as far
as
> > > memory savings go.  1024 transparent pixels is the smallest value that
makes
> > > this worth doing, and that's when we're perfectly aligned to a page, which
> may
> > > be optimistic.
> > 
> > Yes. (I can dig up the bug, but when the change first landed it saved many
MB
> of
> > RAM, IIRC.) For all around simplicity, I did not make any checks to see how
> many
> > consecutive transparent pixels there are. I'm not opposed to making the
check
> if
> > we can do it efficiently.
> 
> OK, I'm game to try.  I'm pretty confident we can preserve the write-no-zeros
> behavior on x86 while speeding up the code.  I don't know if we can speed up
the
> code on ARM while preserving write-no-zeros behavior, but we can experiment.

Think there's an regularity we can exploit to get the bulk of our skip-zero
logic without having each swizzle care?

E.g. when skipping zeros, just skip them until the first non-zero pixel per
scanline?
This would skip all fully transparent scanlines, which I imagine is where we get
all the win.

while (width > 0 && *src == 0x00000000) {
   width--;
   src++;
}

mtklein

> while (width > 0 && *src == 0x00000000) { > width--; > src++; > ...

4 years, 11 months ago (2016-01-08 17:34:22 UTC) #11

scroggo

On 2016/01/08 17:33:36, mtklein wrote: > On 2016/01/08 17:13:15, mtklein wrote: > > On 2016/01/08 ...

4 years, 11 months ago (2016-01-08 18:31:29 UTC) #12

On 2016/01/08 17:33:36, mtklein wrote:
> On 2016/01/08 17:13:15, mtklein wrote:
> > On 2016/01/08 16:50:51, scroggo wrote:
> > >
> https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cpp
> > > File src/codec/SkSwizzler.cpp (right):
> > > 
> > >
> >
>
https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cp...
> > > src/codec/SkSwizzler.cpp:733: scalarProc =
> &swizzle_rgba_to_n32_premul_skipZ;
> > > On 2016/01/08 16:23:01, mtklein wrote:
> > > > On 2016/01/08 16:02:56, scroggo wrote:
> > > > > On 2016/01/08 15:20:19, mtklein wrote:
> > > > > > On 2016/01/08 13:56:00, msarett wrote:
> > > > > > > We'll want to benchmark against the regular version and the "skip
> > zeros"
> > > > > > > version.  This is a good time to find out the benefit of skipping
> > zeros.
> > > > > > 
> > > > > > There's more to this feature than just speed, right? 
> > > > > 
> > > > > Yes.
> > > > > 
> > > > > > Even if we can go as fast,
> > > > > > seems like the feature is really to avoid physically touching those
> > bytes
> > > so
> > > > > > that those pages stay backed by a shared zero page instead of their
> own
> > > > > physical
> > > > > > RAM?  (Note I am personally still skeptical that optimizing for
~1000
> > > pixel
> > > > > > transparent blocks in Skia is better than, say, cropping the
> images...)
> > > > > 
> > > > > +1. But I think we'll need to update lots of client code in order to
> crop
> > > the
> > > > > images instead. Not saying that it's not worth doing, but we'll need
to
> > > change
> > > > > the framework (and probably get buy-in, because they'll need to
continue
> > > doing
> > > > > it the new way) *and* make apps update to the new way. I think the
right
> > way
> > > > to
> > > > > switch is to still support this strange optimization but show the
> > all-around
> > > > > benefits of cropping the images, so people will opt-in. If we were to
> drop
> > > the
> > > > > feature, suddenly Android devices may have a lot less free memory.
> > > > > 
> > > > > > 
> > > > > > Skipping zeros is something we could do well in SSE2+ (especially
> > SSE4.1+
> > > or
> > > > > > AVX2), but I suspect it's the sort of thing that will completely
hose
> > NEON
> > > > > > performance (no movemask/ptest, and it's quite slow to communicate
> from
> > > NEON
> > > > > > back to ARM).  Still, if we must do it, it seems worth trying.
> > > > > 
> > > > 
> > > > And we're certain there really are runs of >=1024 transparent pixels
> right? 
> > > If
> > > > we touch any byte on a (4K) page, we might as well touch them all as far
> as
> > > > memory savings go.  1024 transparent pixels is the smallest value that
> makes
> > > > this worth doing, and that's when we're perfectly aligned to a page,
which
> > may
> > > > be optimistic.
> > > 
> > > Yes. (I can dig up the bug, but when the change first landed it saved many
> MB
> > of
> > > RAM, IIRC.) For all around simplicity, I did not make any checks to see
how
> > many
> > > consecutive transparent pixels there are. I'm not opposed to making the
> check
> > if
> > > we can do it efficiently.
> > 
> > OK, I'm game to try.  I'm pretty confident we can preserve the
write-no-zeros
> > behavior on x86 while speeding up the code.  I don't know if we can speed up
> the
> > code on ARM while preserving write-no-zeros behavior, but we can experiment.
> 
> Think there's an regularity we can exploit to get the bulk of our skip-zero
> logic without having each swizzle care?
> 
> E.g. when skipping zeros, just skip them until the first non-zero pixel per
> scanline?
> This would skip all fully transparent scanlines, which I imagine is where we
get
> all the win.
> 
> while (width > 0 && *src == 0x00000000) {
>    width--;
>    src++;
> }

Yes, that should be sufficient.

msarett

Not planning to land this until after "SkSwizzler: Factor skipping zeros out into a helper ...

4 years, 11 months ago (2016-01-11 20:33:25 UTC) #15

Not planning to land this until after "SkSwizzler: Factor skipping zeros out
into a helper function." lands.
https://codereview.chromium.org/1566653007/

Sorry about the rebase...I really thought there were going to be conflicts.

"Have you measured perf?"

I did on Patch Set 3.  Looks like no change on my desktop.

"What sampling rates do you need in practice?"

Technically we support any integer sampling rate.  Though,
obviously, we cannot optimize for every single possibility.

As Leon mentioned, powers of 2 are most common.  I could
see optimizing 2, 4, 8, but IMO it's a lower priority than
full decode optimizations.

https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cpp
File src/codec/SkSwizzler.cpp (right):

https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cp...
src/codec/SkSwizzler.cpp:733: scalarProc = &swizzle_rgba_to_n32_premul_skipZ;
On 2016/01/08 16:50:51, scroggo wrote:
> On 2016/01/08 16:23:01, mtklein wrote:
> > On 2016/01/08 16:02:56, scroggo wrote:
> > > On 2016/01/08 15:20:19, mtklein wrote:
> > > > On 2016/01/08 13:56:00, msarett wrote:
> > > > > We'll want to benchmark against the regular version and the "skip
zeros"
> > > > > version.  This is a good time to find out the benefit of skipping
zeros.
> > > > 
> > > > There's more to this feature than just speed, right? 
> > > 
> > > Yes.
> > > 
> > > > Even if we can go as fast,
> > > > seems like the feature is really to avoid physically touching those
bytes
> so
> > > > that those pages stay backed by a shared zero page instead of their own
> > > physical
> > > > RAM?  (Note I am personally still skeptical that optimizing for ~1000
> pixel
> > > > transparent blocks in Skia is better than, say, cropping the images...)
> > > 
> > > +1. But I think we'll need to update lots of client code in order to crop
> the
> > > images instead. Not saying that it's not worth doing, but we'll need to
> change
> > > the framework (and probably get buy-in, because they'll need to continue
> doing
> > > it the new way) *and* make apps update to the new way. I think the right
way
> > to
> > > switch is to still support this strange optimization but show the
all-around
> > > benefits of cropping the images, so people will opt-in. If we were to drop
> the
> > > feature, suddenly Android devices may have a lot less free memory.
> > > 
> > > > 
> > > > Skipping zeros is something we could do well in SSE2+ (especially
SSE4.1+
> or
> > > > AVX2), but I suspect it's the sort of thing that will completely hose
NEON
> > > > performance (no movemask/ptest, and it's quite slow to communicate from
> NEON
> > > > back to ARM).  Still, if we must do it, it seems worth trying.
> > > 
> > 
> > And we're certain there really are runs of >=1024 transparent pixels right? 
> If
> > we touch any byte on a (4K) page, we might as well touch them all as far as
> > memory savings go.  1024 transparent pixels is the smallest value that makes
> > this worth doing, and that's when we're perfectly aligned to a page, which
may
> > be optimistic.
> 
> Yes. (I can dig up the bug, but when the change first landed it saved many MB
of
> RAM, IIRC.) For all around simplicity, I did not make any checks to see how
many
> consecutive transparent pixels there are. I'm not opposed to making the check
if
> we can do it efficiently.

I think Mike's idea to skip leading zeros will be sufficient.
https://bugs.chromium.org/p/skia/issues/detail?id=4787

https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cp...
src/codec/SkSwizzler.cpp:734: } else {
On 2016/01/08 15:20:19, mtklein wrote:
> If I'm right above about this being more than just a speed thing, can't we
only
> use the opt_ method inside the else here?

Yes we could.  I don't think that's a good option though because Android decodes
are almost always zeroInit (and our first goal is to optimize Android).

https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cp...
src/codec/SkSwizzler.cpp:776: if (nullptr == optProc) {
On 2016/01/08 16:23:01, mtklein wrote:
> On 2016/01/08 16:02:57, scroggo wrote:
> > So are we guaranteed that there's an optProc, but there might also be a
> > scalarProc?
> 
> (That's what I find confusing.  In reality, we're guaranteed there's a
> general-case proc, and the optimized procs are sparse.  This code makes it
feel
> the other way around.)

Done.

https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cp...
src/codec/SkSwizzler.cpp:825: int SkSwizzler::onSetSampleX(int sampleX) {
On 2016/01/08 16:02:57, scroggo wrote:
> On 2016/01/08 15:20:19, mtklein wrote:
> > I don't suppose x-sampling can move to the constructor?
> 
> It used to be in the constructor, but we moved it out in crrev.com/1372973002.
> This allowed us to factor out sampling from the codec implementations.

I think it makes sense to keep things this way for now.

If we find another client of Codec that also wants sampling, maybe it would be
worth considering making sampling a part of Codec?

https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cp...
src/codec/SkSwizzler.cpp:835: fRowProc = fScalarRowProc;
On 2016/01/08 15:20:19, mtklein wrote:
> This preference logic might be simpler to follow generally if we tweak it a
> little.  Today it seems like the logic is "use optProc unless we can't, then
> fall back to scalarProc".  What if we write it as "use optProc if we have one,
> defaulting to scalarProc"?  I.e. make having optProc the special case, rather
> than not having optProc?
> 
> That'd make this line something like
> 
>    fOptProc = nullptr;
> 
> and I think it might be less intrusive for the rest of SkSwizzler.

Done.

https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.h
File src/codec/SkSwizzler.h (right):

https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.h#...
src/codec/SkSwizzler.h:199: const RowProc       fScalarRowProc;
On 2016/01/08 15:20:19, mtklein wrote:
> Scalar sounds like SkScalar to me.  What about just fRowProc + fFastRowProc?

Done.

https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzlerOpts.h
File src/codec/SkSwizzlerOpts.h (right):

https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzlerOpt...
src/codec/SkSwizzlerOpts.h:11: // This is quite common for PNGs.
On 2016/01/08 15:20:19, mtklein wrote:
> My preference would be to plop this function in SkSwizzler.cpp among the rest
of
> the RowProcs.  Putting this into a header makes me wonder, why is it in a
> header, are there two files that need to call this?  The scope of a static
> function in a .cpp file, on the other hand, is pretty darn clear.

Sounds good.  I was trying to avoid (more) clutter, but it's worth the extra
confusion.

https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzlerOpt...
src/codec/SkSwizzlerOpts.h:14: int offset, const SkPMColor ctable[]) {
On 2016/01/08 15:20:20, mtklein wrote:
> Have you ever thought about ditching offset as a parameter to these procs?
> You don't ever do anything other than start off with src += offset, right?

We have some swizzler routines where bits per pixel is less than 8.

In those cases, the offset could be some number of bits that is not a multiple
of 8.  In those cases, we need to know the offset in the swizzle function.

Alternatively, we could increment src by offset before calling swizzle(), and
then the offset parameter could most of the time be unused.

https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzlerOpt...
src/codec/SkSwizzlerOpts.h:20: SkOpts::premul_xxxa((uint32_t*) dst, (uint32_t*)
(src + offset), width);
On 2016/01/08 15:20:20, mtklein wrote:
> probably clearest to use (const uint32_t*)(src + offset) in these calls.

Done.

https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzlerOpt...
src/codec/SkSwizzlerOpts.h:24: // FIXME: Should we continue to support
reallyHasAlpha()?
On 2016/01/08 16:23:01, mtklein wrote:
> On 2016/01/08 16:02:57, scroggo wrote:
> > I thought we had already decided not to. 
> 
> If we haven't yet, let's decide now to drop it unless later proven necessary.

Done.

https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzlerOpt...
src/codec/SkSwizzlerOpts.h:26: }
On 2016/01/08 15:20:19, mtklein wrote:
> We can also do UPM RGBA -> UPM N32 right?
> 
> #ifdef SK_PMCOLOR_IS_RGBA
>   memmove(...)
> #else
>   SkOpts::swaprb_xxxa(...)
> #endif

Yeah I'm starting with one at a time :)

mtklein

lgtm mod skip zeros https://codereview.chromium.org/1563393002/diff/100001/src/codec/SkSwizzler.cpp File src/codec/SkSwizzler.cpp (right): https://codereview.chromium.org/1563393002/diff/100001/src/codec/SkSwizzler.cpp#newcode775 src/codec/SkSwizzler.cpp:775: if (fFastProc) { You might ...

4 years, 11 months ago (2016-01-11 21:07:21 UTC) #17

msarett

Rebase on SkipLeadingZeros. https://codereview.chromium.org/1563393002/diff/100001/src/codec/SkSwizzler.cpp File src/codec/SkSwizzler.cpp (right): https://codereview.chromium.org/1563393002/diff/100001/src/codec/SkSwizzler.cpp#newcode775 src/codec/SkSwizzler.cpp:775: if (fFastProc) { On 2016/01/11 21:07:21, ...

4 years, 11 months ago (2016-01-12 14:35:51 UTC) #18

mtklein

The patchset sent to the CQ was uploaded after l-g-t-m from scroggo@google.com Link to the ...

4 years, 11 months ago (2016-01-12 14:40:04 UTC) #21

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1563393002/120001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1563393002/120001

4 years, 11 months ago (2016-01-12 14:40:08 UTC) #22

commit-bot: I haz the power

Description was changed from ========== Use SkOpts routines in SkSwizzler This is a bit tricky ...

4 years, 11 months ago (2016-01-12 14:51:13 UTC) #23

commit-bot: I haz the power

4 years, 11 months ago (2016-01-12 14:51:14 UTC) #24

Message was sent while issue was closed.

Committed patchset #4 (id:120001) as
https://skia.googlesource.com/skia/+/a51e7782b2e028a38712a159c412e6151eca1666

Issue 1563393002: Use SkOpts routines in SkSwizzler (Closed)

Description

Patch Set 1 : #

Patch Set 2 : Rebase #

Patch Set 3 : Response to comments #

Patch Set 4 : Rebase on SkipLeadingZeros #

Messages