|
|
DescriptionOptional gradient dithering
Update raster gradient impls to observe the paint dithering flag.
For now this new behavior is disabled pending internal/external
client updates to mitigate diffs.
BUG=skia:4436
R=reed@google.com,mtklein@google.com,tomhudson@google.com
Committed: https://skia.googlesource.com/skia/+/37d8688927988cc71e968e0b5e90689e4e6ed7ab
Patch Set 1 #Patch Set 2 : cleanup #Patch Set 3 : template-based #Patch Set 4 : straight kDitherStride32 runtime plumbing #Patch Set 5 : table-based approach #
Messages
Total messages: 27 (6 generated)
The CQ bit was checked by fmalita@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1391303002/40001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1391303002/40001
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
fmalita@chromium.org changed reviewers: + mtklein@google.com, reed@google.com, tomhudson@google.com
Patch set #2 is fairly contained, parameterizing the toggle dither stride (0 disables dithering). Unfortunately it yields a significant nanobench regression (~15% for linear gradients). Apparently this little toggle is quite hot :( Patch set #3 is template-based and intrusive, but not only it doesn't regress - it improves perf when dithering is disabled by ~13%. Based on these numbers, the complexity cost of PS#2 may be justified. WDYT? ToT: out/Release/nanobench -m gradient_linear_clamp --samplingTime 1s --config 8888 565 Timer overhead: 37.1ns curr/maxrss loops min median mean max stddev samples config bench 33/33 MB 2 94.1µs 94.7µs 102µs 279µs 19% 3995 8888 gradient_linear_clamp_shallow 33/33 MB 5 80.9µs 81.1µs 81.4µs 139µs 2% 2325 565 gradient_linear_clamp_shallow 33/33 MB 5 76.5µs 76.8µs 77.1µs 119µs 2% 2330 8888 gradient_linear_clamp_shallow_dither 33/33 MB 5 80.9µs 81.1µs 81.5µs 144µs 3% 2323 565 gradient_linear_clamp_shallow_dither 33/33 MB 5 76.5µs 76.8µs 77.1µs 120µs 3% 2329 8888 gradient_linear_clamp_3color 33/33 MB 5 80.9µs 81.1µs 81.3µs 93.7µs 1% 2328 565 gradient_linear_clamp_3color 33/33 MB 5 76.5µs 76.8µs 77.2µs 135µs 3% 2328 8888 gradient_linear_clamp_hicolor 33/33 MB 5 80.8µs 81.1µs 81.3µs 109µs 1% 2328 565 gradient_linear_clamp_hicolor 33/33 MB 5 76.5µs 76.8µs 77.3µs 137µs 5% 2323 8888 gradient_linear_clamp 33/33 MB 5 80.9µs 81.1µs 81.4µs 136µs 2% 2326 565 gradient_linear_clamp PS#2: out/Release/nanobench -m gradient_linear_clamp --samplingTime 1s --config 8888 565 Timer overhead: 39.8ns curr/maxrss loops min median mean max stddev samples config bench 33/33 MB 2 90.9µs 91.1µs 98µs 333µs 20% 4109 8888 gradient_linear_clamp_shallow 32/33 MB 5 89.9µs 90.7µs 90.9µs 127µs 2% 2095 565 gradient_linear_clamp_shallow 33/33 MB 5 90.9µs 91.2µs 91.8µs 156µs 4% 1988 8888 gradient_linear_clamp_shallow_dither 33/33 MB 5 89.9µs 90.6µs 90.9µs 114µs 1% 2096 565 gradient_linear_clamp_shallow_dither 33/33 MB 5 90.9µs 91.2µs 92.4µs 188µs 8% 1975 8888 gradient_linear_clamp_3color 33/33 MB 5 89.9µs 90.7µs 90.9µs 127µs 2% 2096 565 gradient_linear_clamp_3color 33/33 MB 5 90.9µs 91.2µs 91.6µs 151µs 3% 1992 8888 gradient_linear_clamp_hicolor 33/33 MB 5 89.9µs 90.9µs 91µs 159µs 2% 2094 565 gradient_linear_clamp_hicolor 33/33 MB 5 90.9µs 91.2µs 91.5µs 149µs 2% 1994 8888 gradient_linear_clamp 33/33 MB 5 89.9µs 90.7µs 90.9µs 127µs 2% 2096 565 gradient_linear_clamp PS#3: out/Release/nanobench -m gradient_linear_clamp --samplingTime 1s --config 8888 565 Timer overhead: 36.3ns curr/maxrss loops min median mean max stddev samples config bench 33/33 MB 2 93.5µs 95.8µs 111µs 261µs 27% 3651 8888 gradient_linear_clamp_shallow 32/33 MB 5 68.5µs 70.1µs 70.2µs 110µs 2% 2675 565 gradient_linear_clamp_shallow 33/33 MB 5 79µs 80.3µs 80.4µs 120µs 2% 2242 8888 gradient_linear_clamp_shallow_dither 33/33 MB 5 81µs 81.5µs 81.8µs 139µs 3% 2315 565 gradient_linear_clamp_shallow_dither 33/33 MB 5 65.8µs 69.2µs 69.2µs 85.4µs 2% 2564 8888 gradient_linear_clamp_3color 33/33 MB 5 68.2µs 70µs 70.1µs 119µs 3% 2676 565 gradient_linear_clamp_3color 33/33 MB 5 66.1µs 69.1µs 69.2µs 80.8µs 2% 2564 8888 gradient_linear_clamp_hicolor 33/33 MB 5 67.9µs 70µs 70.2µs 111µs 3% 2675 565 gradient_linear_clamp_hicolor 33/33 MB 5 66.3µs 69.2µs 69.3µs 103µs 2% 2561 8888 gradient_linear_clamp 33/33 MB 5 67.9µs 69.9µs 70.1µs 122µs 3% 2678 565 gradient_linear_clamp
Could we not have just made SkGradientShaderBase::kDitherStride32 be a runtime value instead of a const?
On 2015/10/08 14:39:20, reed1 wrote: > Could we not have just made SkGradientShaderBase::kDitherStride32 be a runtime > value instead of a const? That's what I tried in set #2, but it had a surprisingly high perf impact. I can try again without the wrapper/helper class just to rule out potential in that area.
On 2015/10/08 15:42:48, f(malita) wrote: > On 2015/10/08 14:39:20, reed1 wrote: > > Could we not have just made SkGradientShaderBase::kDitherStride32 be a runtime > > value instead of a const? > > That's what I tried in set #2, but it had a surprisingly high perf impact. I > can try again without the wrapper/helper class just to rule out potential in > that area. "potential overhead"
Another 2nd-order consideration: If we're going to NOT dither, then the tables we built are not exactly what we want. In the non-dither case, we want a round-bias of 1/2, not the 1/8,3/8,5/8,7/8 bias in the tables... Not critical but something to consider for completeness.
On 2015/10/08 15:44:23, reed1 wrote: > Another 2nd-order consideration: > > If we're going to NOT dither, then the tables we built are not exactly what we > want. In the non-dither case, we want a round-bias of 1/2, not the > 1/8,3/8,5/8,7/8 bias in the tables... Not critical but something to consider for > completeness. I see. I was toying with the idea of playing some table tricks (making all tables identical should also disable dithering IIUC). If we're not dithering we don't even need 4 tables, and since the round-bias also needs tweaking maybe that's a good thing to focus on.
On 2015/10/08 16:01:03, f(malita) wrote: > On 2015/10/08 15:44:23, reed1 wrote: > > Another 2nd-order consideration: > > > > If we're going to NOT dither, then the tables we built are not exactly what we > > want. In the non-dither case, we want a round-bias of 1/2, not the > > 1/8,3/8,5/8,7/8 bias in the tables... Not critical but something to consider > for > > completeness. > > I see. > > I was toying with the idea of playing some table tricks (making all tables > identical should also disable dithering IIUC). If we're not dithering we don't > even need 4 tables, and since the round-bias also needs tweaking maybe that's a > good thing to focus on. Also a good trick/idea...
Since Mike's got concerns about the approach, reading the code may not have been useful, but the implementation in PS#3 looks pretty good to me. What PS#2 does looks like a really expensive, inefficient way to do it - object construction, object assignment, obj-to-int cast overridden to return a member field. Are we confident that the compiler was really optimizing all that away? Or could we get a more primitive implementation that wouldn't have the perf impact? Also, it isn't quite exactly the same thing as Mike proposed in #7, "make kDitherStride32 a runtime value".
On 2015/10/08 16:43:04, tomhudson wrote: > What PS#2 does looks like a really expensive, inefficient way to do it - object > construction, object assignment, obj-to-int cast overridden to return a member > field. Are we confident that the compiler was really optimizing all that away? Confident is overstating it, but yes, I would hope modern compilers inline/optimize all that away. > Or could we get a more primitive implementation that wouldn't have the perf > impact? Also, it isn't quite exactly the same thing as Mike proposed in #7, > "make kDitherStride32 a runtime value". Yup, that's what I was just trying - see patch set #4. Mike, is this what you had in mind? Unfortunately it still shows the same perf degradation as PS#2. It's surprising to me that swapping a constant with something that should fit in a register for the duration of the hot loops has such an impact. Register spilling? [ToT] out/Release/nanobench -m gradient_linear_clamp --samplingTime 1s --config 8888 565 Timer overhead: 36.9ns curr/maxrss loops min median mean max stddev samples config bench 33/33 MB 2 93.1µs 93.6µs 101µs 226µs 19% 4020 8888 gradient_linear_clamp_shallow 32/33 MB 5 80.7µs 81.1µs 81.4µs 209µs 4% 2324 565 gradient_linear_clamp_shallow 33/33 MB 5 76.4µs 79.1µs 79.1µs 146µs 3% 2275 8888 gradient_linear_clamp_shallow_dither 33/33 MB 5 80.8µs 81µs 81.4µs 125µs 2% 2325 565 gradient_linear_clamp_shallow_dither 33/33 MB 5 76.4µs 78.6µs 78.8µs 135µs 3% 2282 8888 gradient_linear_clamp_3color 33/33 MB 5 80.8µs 81.1µs 81.5µs 158µs 3% 2321 565 gradient_linear_clamp_3color 33/33 MB 5 76.4µs 78.5µs 78.8µs 115µs 2% 2284 8888 gradient_linear_clamp_hicolor 33/33 MB 5 80.8µs 81µs 81.4µs 140µs 2% 2325 565 gradient_linear_clamp_hicolor 33/33 MB 5 76.4µs 78.8µs 79µs 149µs 4% 2278 8888 gradient_linear_clamp 33/33 MB 5 80.8µs 81.3µs 85.7µs 156µs 15% 2209 565 gradient_linear_clamp [PS#4] out/Release/nanobench -m gradient_linear_clamp --samplingTime 1s --config 8888 565 Timer overhead: 33ns curr/maxrss loops min median mean max stddev samples config bench 33/33 MB 2 87.2µs 96.5µs 108µs 349µs 25% 3737 8888 gradient_linear_clamp_shallow 32/33 MB 4 87.2µs 89.5µs 89.8µs 162µs 3% 2616 565 gradient_linear_clamp_shallow 33/33 MB 4 87.1µs 93.1µs 93.7µs 206µs 4% 2388 8888 gradient_linear_clamp_shallow_dither 33/33 MB 4 87.3µs 87.5µs 88µs 135µs 2% 2666 565 gradient_linear_clamp_shallow_dither 33/33 MB 4 86.9µs 92.7µs 94.4µs 149µs 7% 2366 8888 gradient_linear_clamp_3color 33/33 MB 4 87.8µs 89.6µs 89.8µs 122µs 1% 2617 565 gradient_linear_clamp_3color 33/33 MB 4 86.9µs 92.9µs 93.8µs 169µs 4% 2386 8888 gradient_linear_clamp_hicolor 33/33 MB 4 87.3µs 89.6µs 89.8µs 177µs 3% 2615 565 gradient_linear_clamp_hicolor 33/33 MB 4 86.9µs 92.6µs 93.5µs 178µs 6% 2390 8888 gradient_linear_clamp 33/33 MB 4 87.3µs 89.6µs 90.2µs 170µs 5% 2603 565 gradient_linear_clamp
On 2015/10/08 19:47:26, f(malita) wrote: > On 2015/10/08 16:43:04, tomhudson wrote: > > What PS#2 does looks like a really expensive, inefficient way to do it - > object > > construction, object assignment, obj-to-int cast overridden to return a member > > field. Are we confident that the compiler was really optimizing all that away? > > Confident is overstating it, but yes, I would hope modern compilers > inline/optimize > all that away. > > > Or could we get a more primitive implementation that wouldn't have the perf > > impact? Also, it isn't quite exactly the same thing as Mike proposed in #7, > > "make kDitherStride32 a runtime value". > > Yup, that's what I was just trying - see patch set #4. Mike, is this what you > had in > mind? > > Unfortunately it still shows the same perf degradation as PS#2. It's surprising > to me > that swapping a constant with something that should fit in a register for the > duration > of the hot loops has such an impact. Register spilling? We may not be able to use perf on Android Framework (grumble), but you can run perf on this, then hit 'a' to annotate the function and see what the actual assembly is and where the hotspot is.
Hmmmm. 1. I can believe the slow down, esp. in the very trivial cases where our loop is only 3-4 instructions. 2. Not sure that I like using the template per-se, in that it is a lot of binary code-bloat. 3. In the "big" picture, not sure how important this perf-diff is 4. In the medium term, we are very likely to rewrite much of this code anyway, to handle the new hot bugs around steep gradients with lots of color-stops. 5. Given all of that, I am willing to go with whatever approach you choose, but I do think we should consider the table itself (i.e. setting the bias to 1/2).
On 2015/10/08 20:32:18, reed1 wrote: > Hmmmm. > > 1. I can believe the slow down, esp. in the very trivial cases where our loop is > only 3-4 instructions. > > 2. Not sure that I like using the template per-se, in that it is a lot of binary > code-bloat. > > 3. In the "big" picture, not sure how important this perf-diff is > > 4. In the medium term, we are very likely to rewrite much of this code anyway, > to handle the new hot bugs around steep gradients with lots of color-stops. ACK > 5. Given all of that, I am willing to go with whatever approach you choose, but > I do think we should consider the table itself (i.e. setting the bias to 1/2). Done, see patch set #5. I like this approach because it's contained and solves both the rounding bias & disables dithering in one swoop - plus there's no perf regression. The gotcha is we may need to recompute the cache tables if the shader is reused with a different dither flag, but I presume this case is uncommon. (not sure whether there was any hidden bias in Build16bitCache that I missed, please double-check) [ToT] out/Release/nanobench -m gradient_linear_clamp --samplingTime 1s --config 8888 565 Timer overhead: 37.1ns curr/maxrss loops min median mean max stddev samples config bench 33/33 MB 2 93.9µs 94.5µs 102µs 221µs 19% 3997 8888 gradient_linear_clamp_shallow 32/33 MB 5 80.8µs 81.2µs 81.7µs 140µs 4% 2318 565 gradient_linear_clamp_shallow 33/33 MB 5 76.5µs 77.4µs 77.6µs 110µs 2% 2316 8888 gradient_linear_clamp_shallow_dither 33/33 MB 5 80.8µs 81.2µs 81.5µs 135µs 3% 2323 565 gradient_linear_clamp_shallow_dither 33/33 MB 5 76.5µs 77.4µs 77.5µs 97.8µs 1% 2319 8888 gradient_linear_clamp_3color 33/33 MB 5 80.8µs 81.2µs 81.5µs 140µs 3% 2323 565 gradient_linear_clamp_3color 33/33 MB 5 76.5µs 77.4µs 77.6µs 108µs 2% 2317 8888 gradient_linear_clamp_hicolor 33/33 MB 5 80.8µs 81.2µs 81.6µs 134µs 4% 2320 565 gradient_linear_clamp_hicolor 33/33 MB 5 76.5µs 77.4µs 77.4µs 89.4µs 1% 2320 8888 gradient_linear_clamp 33/33 MB 5 80.8µs 81.2µs 81.6µs 133µs 3% 2322 565 gradient_linear_clamp [PS#5] out/Release/nanobench -m gradient_linear_clamp --samplingTime 1s --config 8888 565 Timer overhead: 37.3ns curr/maxrss loops min median mean max stddev samples config bench 33/33 MB 2 94.1µs 94.6µs 102µs 218µs 19% 3994 8888 gradient_linear_clamp_shallow 32/33 MB 5 80.7µs 81.1µs 81.5µs 103µs 2% 2322 565 gradient_linear_clamp_shallow 33/33 MB 5 76.6µs 77.2µs 77.4µs 124µs 2% 2321 8888 gradient_linear_clamp_shallow_dither 33/33 MB 5 80.8µs 81.2µs 81.7µs 134µs 3% 2318 565 gradient_linear_clamp_shallow_dither 33/33 MB 5 76.6µs 77.2µs 77.5µs 97.4µs 2% 2320 8888 gradient_linear_clamp_3color 33/33 MB 5 80.8µs 81.1µs 81.6µs 142µs 3% 2320 565 gradient_linear_clamp_3color 33/33 MB 5 76.6µs 77.2µs 77.5µs 127µs 3% 2318 8888 gradient_linear_clamp_hicolor 33/33 MB 5 80.7µs 81.3µs 82.6µs 121µs 5% 2293 565 gradient_linear_clamp_hicolor 33/33 MB 5 76.6µs 77.2µs 77.5µs 96.1µs 2% 2319 8888 gradient_linear_clamp 33/33 MB 5 80.7µs 81.4µs 88.6µs 218µs 22% 2137 565 gradient_linear_clamp
The CQ bit was checked by fmalita@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1391303002/80001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1391303002/80001
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
> The gotcha is we may need to recompute the cache tables if the shader is reused > with a different dither flag, but I presume this case is uncommon. We're turning off dithering when we detect that we're on hardware that has problems (temporal aliasing? driver bugs?) with our dithering implementation, right? Were there any other times we anticipate turning it off? If not, then yeah, that would be really uncommon. I think chrishtr@ had another case, but it doesn't seem to be in the record of the linked crbug.
On 2015/10/08 21:37:32, tomhudson wrote: > > The gotcha is we may need to recompute the cache tables if the shader is > reused > > with a different dither flag, but I presume this case is uncommon. > > We're turning off dithering when we detect that we're on hardware that has > problems > (temporal aliasing? driver bugs?) with our dithering implementation, right? Were > there > any other times we anticipate turning it off? If not, then yeah, that would be > really > uncommon. I think chrishtr@ had another case, but it doesn't seem to be in the > record > of the linked crbug. It's unclear what the final fix for that bug will look like - detecting problematic hw seems intractable, last I heard Chromium folks were tempted to use a bigger hammer and always disable dithering on Mac. But it's safe to say that whatever we'll do, for a given [Chromium_version,machine] tuple, dithering should be consistent (either on of off).
lgtm
The CQ bit was checked by fmalita@chromium.org
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1391303002/80001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1391303002/80001
Message was sent while issue was closed.
Committed patchset #5 (id:80001) as https://skia.googlesource.com/skia/+/37d8688927988cc71e968e0b5e90689e4e6ed7ab |