|
|
DescriptionUse SkOpts routines in SkSwizzler
This is a bit tricky because we don't intend to support sampling
in SkOpts. The swizzler will fallback to a default routine in
the event that sampling is requested.
BUG=skia:4767
GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1563393002
Committed: https://skia.googlesource.com/skia/+/a51e7782b2e028a38712a159c412e6151eca1666
Patch Set 1 : #
Total comments: 29
Patch Set 2 : Rebase #Patch Set 3 : Response to comments #
Total comments: 2
Patch Set 4 : Rebase on SkipLeadingZeros #
Depends on Patchset: Messages
Total messages: 24 (8 generated)
Description was changed from ========== Use SkOpts routines in SkSwizzler This is a bit tricky because we don't intend to support sampling in SkOpts. The swizzler will fallback to a default routine in the event that sampling is requested. BUG=skia:4767 ========== to ========== Use SkOpts routines in SkSwizzler This is a bit tricky because we don't intend to support sampling in SkOpts. The swizzler will fallback to a default routine in the event that sampling is requested. BUG=skia:4767 GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ==========
Patchset #1 (id:1) has been deleted
msarett@google.com changed reviewers: + mtklein@google.com, scroggo@google.com
No performance improvements yet. We haven't added the optimized routines to SkOpts yet. https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cpp File src/codec/SkSwizzler.cpp (right): https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cp... src/codec/SkSwizzler.cpp:733: scalarProc = &swizzle_rgba_to_n32_premul_skipZ; We'll want to benchmark against the regular version and the "skip zeros" version. This is a good time to find out the benefit of skipping zeros.
What sampling rates would we need in practice? It'd be nice to keep those in mind while we write up the SIMD code. Small powers of 1/2 shouldn't be terrible to implement. Have you measured perf, or are we just assuming there's no difference? https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cpp File src/codec/SkSwizzler.cpp (right): https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cp... src/codec/SkSwizzler.cpp:733: scalarProc = &swizzle_rgba_to_n32_premul_skipZ; On 2016/01/08 13:56:00, msarett wrote: > We'll want to benchmark against the regular version and the "skip zeros" > version. This is a good time to find out the benefit of skipping zeros. There's more to this feature than just speed, right? Even if we can go as fast, seems like the feature is really to avoid physically touching those bytes so that those pages stay backed by a shared zero page instead of their own physical RAM? (Note I am personally still skeptical that optimizing for ~1000 pixel transparent blocks in Skia is better than, say, cropping the images...) Skipping zeros is something we could do well in SSE2+ (especially SSE4.1+ or AVX2), but I suspect it's the sort of thing that will completely hose NEON performance (no movemask/ptest, and it's quite slow to communicate from NEON back to ARM). Still, if we must do it, it seems worth trying. https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cp... src/codec/SkSwizzler.cpp:734: } else { If I'm right above about this being more than just a speed thing, can't we only use the opt_ method inside the else here? https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cp... src/codec/SkSwizzler.cpp:825: int SkSwizzler::onSetSampleX(int sampleX) { I don't suppose x-sampling can move to the constructor? https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cp... src/codec/SkSwizzler.cpp:835: fRowProc = fScalarRowProc; This preference logic might be simpler to follow generally if we tweak it a little. Today it seems like the logic is "use optProc unless we can't, then fall back to scalarProc". What if we write it as "use optProc if we have one, defaulting to scalarProc"? I.e. make having optProc the special case, rather than not having optProc? That'd make this line something like fOptProc = nullptr; and I think it might be less intrusive for the rest of SkSwizzler. https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.h File src/codec/SkSwizzler.h (right): https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.h#... src/codec/SkSwizzler.h:199: const RowProc fScalarRowProc; Scalar sounds like SkScalar to me. What about just fRowProc + fFastRowProc? https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzlerOpts.h File src/codec/SkSwizzlerOpts.h (right): https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzlerOpt... src/codec/SkSwizzlerOpts.h:11: // This is quite common for PNGs. My preference would be to plop this function in SkSwizzler.cpp among the rest of the RowProcs. Putting this into a header makes me wonder, why is it in a header, are there two files that need to call this? The scope of a static function in a .cpp file, on the other hand, is pretty darn clear. https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzlerOpt... src/codec/SkSwizzlerOpts.h:14: int offset, const SkPMColor ctable[]) { Have you ever thought about ditching offset as a parameter to these procs? You don't ever do anything other than start off with src += offset, right? https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzlerOpt... src/codec/SkSwizzlerOpts.h:20: SkOpts::premul_xxxa((uint32_t*) dst, (uint32_t*) (src + offset), width); probably clearest to use (const uint32_t*)(src + offset) in these calls. https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzlerOpt... src/codec/SkSwizzlerOpts.h:26: } We can also do UPM RGBA -> UPM N32 right? #ifdef SK_PMCOLOR_IS_RGBA memmove(...) #else SkOpts::swaprb_xxxa(...) #endif
> What sampling rates would we need in practice? It'd be nice to keep those in > mind while we write up the SIMD code. Small powers of 1/2 shouldn't be terrible > to implement. The common ones used by Android will be powers of 2 sampling (see http://developer.android.com/reference/android/graphics/BitmapFactory.Options...), but I've seen really big ones (e.g. 64) https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cpp File src/codec/SkSwizzler.cpp (right): https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cp... src/codec/SkSwizzler.cpp:733: scalarProc = &swizzle_rgba_to_n32_premul_skipZ; On 2016/01/08 15:20:19, mtklein wrote: > On 2016/01/08 13:56:00, msarett wrote: > > We'll want to benchmark against the regular version and the "skip zeros" > > version. This is a good time to find out the benefit of skipping zeros. > > There's more to this feature than just speed, right? Yes. > Even if we can go as fast, > seems like the feature is really to avoid physically touching those bytes so > that those pages stay backed by a shared zero page instead of their own physical > RAM? (Note I am personally still skeptical that optimizing for ~1000 pixel > transparent blocks in Skia is better than, say, cropping the images...) +1. But I think we'll need to update lots of client code in order to crop the images instead. Not saying that it's not worth doing, but we'll need to change the framework (and probably get buy-in, because they'll need to continue doing it the new way) *and* make apps update to the new way. I think the right way to switch is to still support this strange optimization but show the all-around benefits of cropping the images, so people will opt-in. If we were to drop the feature, suddenly Android devices may have a lot less free memory. > > Skipping zeros is something we could do well in SSE2+ (especially SSE4.1+ or > AVX2), but I suspect it's the sort of thing that will completely hose NEON > performance (no movemask/ptest, and it's quite slow to communicate from NEON > back to ARM). Still, if we must do it, it seems worth trying. https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cp... src/codec/SkSwizzler.cpp:776: if (nullptr == optProc) { So are we guaranteed that there's an optProc, but there might also be a scalarProc? https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cp... src/codec/SkSwizzler.cpp:825: int SkSwizzler::onSetSampleX(int sampleX) { On 2016/01/08 15:20:19, mtklein wrote: > I don't suppose x-sampling can move to the constructor? It used to be in the constructor, but we moved it out in crrev.com/1372973002. This allowed us to factor out sampling from the codec implementations. https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzlerOpts.h File src/codec/SkSwizzlerOpts.h (right): https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzlerOpt... src/codec/SkSwizzlerOpts.h:24: // FIXME: Should we continue to support reallyHasAlpha()? I thought we had already decided not to.
https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cpp File src/codec/SkSwizzler.cpp (right): https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cp... src/codec/SkSwizzler.cpp:733: scalarProc = &swizzle_rgba_to_n32_premul_skipZ; On 2016/01/08 16:02:56, scroggo wrote: > On 2016/01/08 15:20:19, mtklein wrote: > > On 2016/01/08 13:56:00, msarett wrote: > > > We'll want to benchmark against the regular version and the "skip zeros" > > > version. This is a good time to find out the benefit of skipping zeros. > > > > There's more to this feature than just speed, right? > > Yes. > > > Even if we can go as fast, > > seems like the feature is really to avoid physically touching those bytes so > > that those pages stay backed by a shared zero page instead of their own > physical > > RAM? (Note I am personally still skeptical that optimizing for ~1000 pixel > > transparent blocks in Skia is better than, say, cropping the images...) > > +1. But I think we'll need to update lots of client code in order to crop the > images instead. Not saying that it's not worth doing, but we'll need to change > the framework (and probably get buy-in, because they'll need to continue doing > it the new way) *and* make apps update to the new way. I think the right way to > switch is to still support this strange optimization but show the all-around > benefits of cropping the images, so people will opt-in. If we were to drop the > feature, suddenly Android devices may have a lot less free memory. > > > > > Skipping zeros is something we could do well in SSE2+ (especially SSE4.1+ or > > AVX2), but I suspect it's the sort of thing that will completely hose NEON > > performance (no movemask/ptest, and it's quite slow to communicate from NEON > > back to ARM). Still, if we must do it, it seems worth trying. > And we're certain there really are runs of >=1024 transparent pixels right? If we touch any byte on a (4K) page, we might as well touch them all as far as memory savings go. 1024 transparent pixels is the smallest value that makes this worth doing, and that's when we're perfectly aligned to a page, which may be optimistic. https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cp... src/codec/SkSwizzler.cpp:776: if (nullptr == optProc) { On 2016/01/08 16:02:57, scroggo wrote: > So are we guaranteed that there's an optProc, but there might also be a > scalarProc? (That's what I find confusing. In reality, we're guaranteed there's a general-case proc, and the optimized procs are sparse. This code makes it feel the other way around.) https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzlerOpts.h File src/codec/SkSwizzlerOpts.h (right): https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzlerOpt... src/codec/SkSwizzlerOpts.h:24: // FIXME: Should we continue to support reallyHasAlpha()? On 2016/01/08 16:02:57, scroggo wrote: > I thought we had already decided not to. If we haven't yet, let's decide now to drop it unless later proven necessary.
https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cpp File src/codec/SkSwizzler.cpp (right): https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cp... src/codec/SkSwizzler.cpp:733: scalarProc = &swizzle_rgba_to_n32_premul_skipZ; On 2016/01/08 16:23:01, mtklein wrote: > On 2016/01/08 16:02:56, scroggo wrote: > > On 2016/01/08 15:20:19, mtklein wrote: > > > On 2016/01/08 13:56:00, msarett wrote: > > > > We'll want to benchmark against the regular version and the "skip zeros" > > > > version. This is a good time to find out the benefit of skipping zeros. > > > > > > There's more to this feature than just speed, right? > > > > Yes. > > > > > Even if we can go as fast, > > > seems like the feature is really to avoid physically touching those bytes so > > > that those pages stay backed by a shared zero page instead of their own > > physical > > > RAM? (Note I am personally still skeptical that optimizing for ~1000 pixel > > > transparent blocks in Skia is better than, say, cropping the images...) > > > > +1. But I think we'll need to update lots of client code in order to crop the > > images instead. Not saying that it's not worth doing, but we'll need to change > > the framework (and probably get buy-in, because they'll need to continue doing > > it the new way) *and* make apps update to the new way. I think the right way > to > > switch is to still support this strange optimization but show the all-around > > benefits of cropping the images, so people will opt-in. If we were to drop the > > feature, suddenly Android devices may have a lot less free memory. > > > > > > > > Skipping zeros is something we could do well in SSE2+ (especially SSE4.1+ or > > > AVX2), but I suspect it's the sort of thing that will completely hose NEON > > > performance (no movemask/ptest, and it's quite slow to communicate from NEON > > > back to ARM). Still, if we must do it, it seems worth trying. > > > > And we're certain there really are runs of >=1024 transparent pixels right? If > we touch any byte on a (4K) page, we might as well touch them all as far as > memory savings go. 1024 transparent pixels is the smallest value that makes > this worth doing, and that's when we're perfectly aligned to a page, which may > be optimistic. Yes. (I can dig up the bug, but when the change first landed it saved many MB of RAM, IIRC.) For all around simplicity, I did not make any checks to see how many consecutive transparent pixels there are. I'm not opposed to making the check if we can do it efficiently.
On 2016/01/08 16:50:51, scroggo wrote: > https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cpp > File src/codec/SkSwizzler.cpp (right): > > https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cp... > src/codec/SkSwizzler.cpp:733: scalarProc = &swizzle_rgba_to_n32_premul_skipZ; > On 2016/01/08 16:23:01, mtklein wrote: > > On 2016/01/08 16:02:56, scroggo wrote: > > > On 2016/01/08 15:20:19, mtklein wrote: > > > > On 2016/01/08 13:56:00, msarett wrote: > > > > > We'll want to benchmark against the regular version and the "skip zeros" > > > > > version. This is a good time to find out the benefit of skipping zeros. > > > > > > > > There's more to this feature than just speed, right? > > > > > > Yes. > > > > > > > Even if we can go as fast, > > > > seems like the feature is really to avoid physically touching those bytes > so > > > > that those pages stay backed by a shared zero page instead of their own > > > physical > > > > RAM? (Note I am personally still skeptical that optimizing for ~1000 > pixel > > > > transparent blocks in Skia is better than, say, cropping the images...) > > > > > > +1. But I think we'll need to update lots of client code in order to crop > the > > > images instead. Not saying that it's not worth doing, but we'll need to > change > > > the framework (and probably get buy-in, because they'll need to continue > doing > > > it the new way) *and* make apps update to the new way. I think the right way > > to > > > switch is to still support this strange optimization but show the all-around > > > benefits of cropping the images, so people will opt-in. If we were to drop > the > > > feature, suddenly Android devices may have a lot less free memory. > > > > > > > > > > > Skipping zeros is something we could do well in SSE2+ (especially SSE4.1+ > or > > > > AVX2), but I suspect it's the sort of thing that will completely hose NEON > > > > performance (no movemask/ptest, and it's quite slow to communicate from > NEON > > > > back to ARM). Still, if we must do it, it seems worth trying. > > > > > > > And we're certain there really are runs of >=1024 transparent pixels right? > If > > we touch any byte on a (4K) page, we might as well touch them all as far as > > memory savings go. 1024 transparent pixels is the smallest value that makes > > this worth doing, and that's when we're perfectly aligned to a page, which may > > be optimistic. > > Yes. (I can dig up the bug, but when the change first landed it saved many MB of > RAM, IIRC.) For all around simplicity, I did not make any checks to see how many > consecutive transparent pixels there are. I'm not opposed to making the check if > we can do it efficiently. OK, I'm game to try. I'm pretty confident we can preserve the write-no-zeros behavior on x86 while speeding up the code. I don't know if we can speed up the code on ARM while preserving write-no-zeros behavior, but we can experiment.
On 2016/01/08 17:13:15, mtklein wrote: > On 2016/01/08 16:50:51, scroggo wrote: > > https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cpp > > File src/codec/SkSwizzler.cpp (right): > > > > > https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cp... > > src/codec/SkSwizzler.cpp:733: scalarProc = &swizzle_rgba_to_n32_premul_skipZ; > > On 2016/01/08 16:23:01, mtklein wrote: > > > On 2016/01/08 16:02:56, scroggo wrote: > > > > On 2016/01/08 15:20:19, mtklein wrote: > > > > > On 2016/01/08 13:56:00, msarett wrote: > > > > > > We'll want to benchmark against the regular version and the "skip > zeros" > > > > > > version. This is a good time to find out the benefit of skipping > zeros. > > > > > > > > > > There's more to this feature than just speed, right? > > > > > > > > Yes. > > > > > > > > > Even if we can go as fast, > > > > > seems like the feature is really to avoid physically touching those > bytes > > so > > > > > that those pages stay backed by a shared zero page instead of their own > > > > physical > > > > > RAM? (Note I am personally still skeptical that optimizing for ~1000 > > pixel > > > > > transparent blocks in Skia is better than, say, cropping the images...) > > > > > > > > +1. But I think we'll need to update lots of client code in order to crop > > the > > > > images instead. Not saying that it's not worth doing, but we'll need to > > change > > > > the framework (and probably get buy-in, because they'll need to continue > > doing > > > > it the new way) *and* make apps update to the new way. I think the right > way > > > to > > > > switch is to still support this strange optimization but show the > all-around > > > > benefits of cropping the images, so people will opt-in. If we were to drop > > the > > > > feature, suddenly Android devices may have a lot less free memory. > > > > > > > > > > > > > > Skipping zeros is something we could do well in SSE2+ (especially > SSE4.1+ > > or > > > > > AVX2), but I suspect it's the sort of thing that will completely hose > NEON > > > > > performance (no movemask/ptest, and it's quite slow to communicate from > > NEON > > > > > back to ARM). Still, if we must do it, it seems worth trying. > > > > > > > > > > And we're certain there really are runs of >=1024 transparent pixels right? > > If > > > we touch any byte on a (4K) page, we might as well touch them all as far as > > > memory savings go. 1024 transparent pixels is the smallest value that makes > > > this worth doing, and that's when we're perfectly aligned to a page, which > may > > > be optimistic. > > > > Yes. (I can dig up the bug, but when the change first landed it saved many MB > of > > RAM, IIRC.) For all around simplicity, I did not make any checks to see how > many > > consecutive transparent pixels there are. I'm not opposed to making the check > if > > we can do it efficiently. > > OK, I'm game to try. I'm pretty confident we can preserve the write-no-zeros > behavior on x86 while speeding up the code. I don't know if we can speed up the > code on ARM while preserving write-no-zeros behavior, but we can experiment. Think there's an regularity we can exploit to get the bulk of our skip-zero logic without having each swizzle care? E.g. when skipping zeros, just skip them until the first non-zero pixel per scanline? This would skip all fully transparent scanlines, which I imagine is where we get all the win. while (width > 0 && *src == 0x00000000) { width--; src++; }
> while (width > 0 && *src == 0x00000000) { > width--; > src++; > } (er, clearly also dst++ or whatever's needed to move dst along with src.)
On 2016/01/08 17:33:36, mtklein wrote: > On 2016/01/08 17:13:15, mtklein wrote: > > On 2016/01/08 16:50:51, scroggo wrote: > > > > https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cpp > > > File src/codec/SkSwizzler.cpp (right): > > > > > > > > > https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cp... > > > src/codec/SkSwizzler.cpp:733: scalarProc = > &swizzle_rgba_to_n32_premul_skipZ; > > > On 2016/01/08 16:23:01, mtklein wrote: > > > > On 2016/01/08 16:02:56, scroggo wrote: > > > > > On 2016/01/08 15:20:19, mtklein wrote: > > > > > > On 2016/01/08 13:56:00, msarett wrote: > > > > > > > We'll want to benchmark against the regular version and the "skip > > zeros" > > > > > > > version. This is a good time to find out the benefit of skipping > > zeros. > > > > > > > > > > > > There's more to this feature than just speed, right? > > > > > > > > > > Yes. > > > > > > > > > > > Even if we can go as fast, > > > > > > seems like the feature is really to avoid physically touching those > > bytes > > > so > > > > > > that those pages stay backed by a shared zero page instead of their > own > > > > > physical > > > > > > RAM? (Note I am personally still skeptical that optimizing for ~1000 > > > pixel > > > > > > transparent blocks in Skia is better than, say, cropping the > images...) > > > > > > > > > > +1. But I think we'll need to update lots of client code in order to > crop > > > the > > > > > images instead. Not saying that it's not worth doing, but we'll need to > > > change > > > > > the framework (and probably get buy-in, because they'll need to continue > > > doing > > > > > it the new way) *and* make apps update to the new way. I think the right > > way > > > > to > > > > > switch is to still support this strange optimization but show the > > all-around > > > > > benefits of cropping the images, so people will opt-in. If we were to > drop > > > the > > > > > feature, suddenly Android devices may have a lot less free memory. > > > > > > > > > > > > > > > > > Skipping zeros is something we could do well in SSE2+ (especially > > SSE4.1+ > > > or > > > > > > AVX2), but I suspect it's the sort of thing that will completely hose > > NEON > > > > > > performance (no movemask/ptest, and it's quite slow to communicate > from > > > NEON > > > > > > back to ARM). Still, if we must do it, it seems worth trying. > > > > > > > > > > > > > And we're certain there really are runs of >=1024 transparent pixels > right? > > > If > > > > we touch any byte on a (4K) page, we might as well touch them all as far > as > > > > memory savings go. 1024 transparent pixels is the smallest value that > makes > > > > this worth doing, and that's when we're perfectly aligned to a page, which > > may > > > > be optimistic. > > > > > > Yes. (I can dig up the bug, but when the change first landed it saved many > MB > > of > > > RAM, IIRC.) For all around simplicity, I did not make any checks to see how > > many > > > consecutive transparent pixels there are. I'm not opposed to making the > check > > if > > > we can do it efficiently. > > > > OK, I'm game to try. I'm pretty confident we can preserve the write-no-zeros > > behavior on x86 while speeding up the code. I don't know if we can speed up > the > > code on ARM while preserving write-no-zeros behavior, but we can experiment. > > Think there's an regularity we can exploit to get the bulk of our skip-zero > logic without having each swizzle care? > > E.g. when skipping zeros, just skip them until the first non-zero pixel per > scanline? > This would skip all fully transparent scanlines, which I imagine is where we get > all the win. > > while (width > 0 && *src == 0x00000000) { > width--; > src++; > } Yes, that should be sufficient.
Patchset #2 (id:40001) has been deleted
Patchset #3 (id:80001) has been deleted
Not planning to land this until after "SkSwizzler: Factor skipping zeros out into a helper function." lands. https://codereview.chromium.org/1566653007/ Sorry about the rebase...I really thought there were going to be conflicts. "Have you measured perf?" I did on Patch Set 3. Looks like no change on my desktop. "What sampling rates do you need in practice?" Technically we support any integer sampling rate. Though, obviously, we cannot optimize for every single possibility. As Leon mentioned, powers of 2 are most common. I could see optimizing 2, 4, 8, but IMO it's a lower priority than full decode optimizations. https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cpp File src/codec/SkSwizzler.cpp (right): https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cp... src/codec/SkSwizzler.cpp:733: scalarProc = &swizzle_rgba_to_n32_premul_skipZ; On 2016/01/08 16:50:51, scroggo wrote: > On 2016/01/08 16:23:01, mtklein wrote: > > On 2016/01/08 16:02:56, scroggo wrote: > > > On 2016/01/08 15:20:19, mtklein wrote: > > > > On 2016/01/08 13:56:00, msarett wrote: > > > > > We'll want to benchmark against the regular version and the "skip zeros" > > > > > version. This is a good time to find out the benefit of skipping zeros. > > > > > > > > There's more to this feature than just speed, right? > > > > > > Yes. > > > > > > > Even if we can go as fast, > > > > seems like the feature is really to avoid physically touching those bytes > so > > > > that those pages stay backed by a shared zero page instead of their own > > > physical > > > > RAM? (Note I am personally still skeptical that optimizing for ~1000 > pixel > > > > transparent blocks in Skia is better than, say, cropping the images...) > > > > > > +1. But I think we'll need to update lots of client code in order to crop > the > > > images instead. Not saying that it's not worth doing, but we'll need to > change > > > the framework (and probably get buy-in, because they'll need to continue > doing > > > it the new way) *and* make apps update to the new way. I think the right way > > to > > > switch is to still support this strange optimization but show the all-around > > > benefits of cropping the images, so people will opt-in. If we were to drop > the > > > feature, suddenly Android devices may have a lot less free memory. > > > > > > > > > > > Skipping zeros is something we could do well in SSE2+ (especially SSE4.1+ > or > > > > AVX2), but I suspect it's the sort of thing that will completely hose NEON > > > > performance (no movemask/ptest, and it's quite slow to communicate from > NEON > > > > back to ARM). Still, if we must do it, it seems worth trying. > > > > > > > And we're certain there really are runs of >=1024 transparent pixels right? > If > > we touch any byte on a (4K) page, we might as well touch them all as far as > > memory savings go. 1024 transparent pixels is the smallest value that makes > > this worth doing, and that's when we're perfectly aligned to a page, which may > > be optimistic. > > Yes. (I can dig up the bug, but when the change first landed it saved many MB of > RAM, IIRC.) For all around simplicity, I did not make any checks to see how many > consecutive transparent pixels there are. I'm not opposed to making the check if > we can do it efficiently. I think Mike's idea to skip leading zeros will be sufficient. https://bugs.chromium.org/p/skia/issues/detail?id=4787 https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cp... src/codec/SkSwizzler.cpp:734: } else { On 2016/01/08 15:20:19, mtklein wrote: > If I'm right above about this being more than just a speed thing, can't we only > use the opt_ method inside the else here? Yes we could. I don't think that's a good option though because Android decodes are almost always zeroInit (and our first goal is to optimize Android). https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cp... src/codec/SkSwizzler.cpp:776: if (nullptr == optProc) { On 2016/01/08 16:23:01, mtklein wrote: > On 2016/01/08 16:02:57, scroggo wrote: > > So are we guaranteed that there's an optProc, but there might also be a > > scalarProc? > > (That's what I find confusing. In reality, we're guaranteed there's a > general-case proc, and the optimized procs are sparse. This code makes it feel > the other way around.) Done. https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cp... src/codec/SkSwizzler.cpp:825: int SkSwizzler::onSetSampleX(int sampleX) { On 2016/01/08 16:02:57, scroggo wrote: > On 2016/01/08 15:20:19, mtklein wrote: > > I don't suppose x-sampling can move to the constructor? > > It used to be in the constructor, but we moved it out in crrev.com/1372973002. > This allowed us to factor out sampling from the codec implementations. I think it makes sense to keep things this way for now. If we find another client of Codec that also wants sampling, maybe it would be worth considering making sampling a part of Codec? https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.cp... src/codec/SkSwizzler.cpp:835: fRowProc = fScalarRowProc; On 2016/01/08 15:20:19, mtklein wrote: > This preference logic might be simpler to follow generally if we tweak it a > little. Today it seems like the logic is "use optProc unless we can't, then > fall back to scalarProc". What if we write it as "use optProc if we have one, > defaulting to scalarProc"? I.e. make having optProc the special case, rather > than not having optProc? > > That'd make this line something like > > fOptProc = nullptr; > > and I think it might be less intrusive for the rest of SkSwizzler. Done. https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.h File src/codec/SkSwizzler.h (right): https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzler.h#... src/codec/SkSwizzler.h:199: const RowProc fScalarRowProc; On 2016/01/08 15:20:19, mtklein wrote: > Scalar sounds like SkScalar to me. What about just fRowProc + fFastRowProc? Done. https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzlerOpts.h File src/codec/SkSwizzlerOpts.h (right): https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzlerOpt... src/codec/SkSwizzlerOpts.h:11: // This is quite common for PNGs. On 2016/01/08 15:20:19, mtklein wrote: > My preference would be to plop this function in SkSwizzler.cpp among the rest of > the RowProcs. Putting this into a header makes me wonder, why is it in a > header, are there two files that need to call this? The scope of a static > function in a .cpp file, on the other hand, is pretty darn clear. Sounds good. I was trying to avoid (more) clutter, but it's worth the extra confusion. https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzlerOpt... src/codec/SkSwizzlerOpts.h:14: int offset, const SkPMColor ctable[]) { On 2016/01/08 15:20:20, mtklein wrote: > Have you ever thought about ditching offset as a parameter to these procs? > You don't ever do anything other than start off with src += offset, right? We have some swizzler routines where bits per pixel is less than 8. In those cases, the offset could be some number of bits that is not a multiple of 8. In those cases, we need to know the offset in the swizzle function. Alternatively, we could increment src by offset before calling swizzle(), and then the offset parameter could most of the time be unused. https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzlerOpt... src/codec/SkSwizzlerOpts.h:20: SkOpts::premul_xxxa((uint32_t*) dst, (uint32_t*) (src + offset), width); On 2016/01/08 15:20:20, mtklein wrote: > probably clearest to use (const uint32_t*)(src + offset) in these calls. Done. https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzlerOpt... src/codec/SkSwizzlerOpts.h:24: // FIXME: Should we continue to support reallyHasAlpha()? On 2016/01/08 16:23:01, mtklein wrote: > On 2016/01/08 16:02:57, scroggo wrote: > > I thought we had already decided not to. > > If we haven't yet, let's decide now to drop it unless later proven necessary. Done. https://codereview.chromium.org/1563393002/diff/20001/src/codec/SkSwizzlerOpt... src/codec/SkSwizzlerOpts.h:26: } On 2016/01/08 15:20:19, mtklein wrote: > We can also do UPM RGBA -> UPM N32 right? > > #ifdef SK_PMCOLOR_IS_RGBA > memmove(...) > #else > SkOpts::swaprb_xxxa(...) > #endif Yeah I'm starting with one at a time :)
lgtm
lgtm mod skip zeros https://codereview.chromium.org/1563393002/diff/100001/src/codec/SkSwizzler.cpp File src/codec/SkSwizzler.cpp (right): https://codereview.chromium.org/1563393002/diff/100001/src/codec/SkSwizzler.c... src/codec/SkSwizzler.cpp:775: if (fFastProc) { You might do RowProc proc = fFastProc ? fFastProc : fProc; proc(...)
Rebase on SkipLeadingZeros. https://codereview.chromium.org/1563393002/diff/100001/src/codec/SkSwizzler.cpp File src/codec/SkSwizzler.cpp (right): https://codereview.chromium.org/1563393002/diff/100001/src/codec/SkSwizzler.c... src/codec/SkSwizzler.cpp:775: if (fFastProc) { On 2016/01/11 21:07:21, mtklein wrote: > You might do > > RowProc proc = fFastProc ? fFastProc : fProc; > proc(...) Done.
The CQ bit was checked by mtklein@google.com
lgtm
The patchset sent to the CQ was uploaded after l-g-t-m from scroggo@google.com Link to the patchset: https://codereview.chromium.org/1563393002/#ps120001 (title: "Rebase on SkipLeadingZeros")
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1563393002/120001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1563393002/120001
Message was sent while issue was closed.
Description was changed from ========== Use SkOpts routines in SkSwizzler This is a bit tricky because we don't intend to support sampling in SkOpts. The swizzler will fallback to a default routine in the event that sampling is requested. BUG=skia:4767 GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ========== to ========== Use SkOpts routines in SkSwizzler This is a bit tricky because we don't intend to support sampling in SkOpts. The swizzler will fallback to a default routine in the event that sampling is requested. BUG=skia:4767 GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... Committed: https://skia.googlesource.com/skia/+/a51e7782b2e028a38712a159c412e6151eca1666 ==========
Message was sent while issue was closed.
Committed patchset #4 (id:120001) as https://skia.googlesource.com/skia/+/a51e7782b2e028a38712a159c412e6151eca1666 |