|
|
Created:
6 years, 7 months ago by Tom Hudson Modified:
6 years, 7 months ago CC:
skia-review_googlegroups.com Base URL:
https://skia.googlesource.com/skia.git@master Visibility:
Public. |
DescriptionBy default, Chromium optimizes for size when compiling on Android.
Forcing the SkAlphaMulQ() function inline can yield as much as a 5-10%
improvement in rasterization time for some of Chromium's telemetry
tests on the Nexus 10, since it's in the inner loop of complex blends.
R=mtklein@google.com,reed@google.com
Committed: http://code.google.com/p/skia/source/detail?r=14723
Patch Set 1 #Patch Set 2 : Document use of forced inlining. #Patch Set 3 : Rebase #Messages
Total messages: 11 (0 generated)
PTAL - there were 18 candidates for forced inlining on Blink, but in a first pass this is the only one that stands out in Skia. It has ~ 50 call sites, but isn't very complex, so the executable bloat should be pretty small.
+reed for include/ Tack on a comment about this in the code so we don't wonder in the future why we're forcing it? Can you share around the profiles or stack traces you're seeing this under? It's disappointing performance-wise if SkAlphaMulQ is taking up any time. That means we've missed an opportunity for a bigger win with NEON.
lgtm
On 2014/05/13 12:07:41, mtklein wrote: > Tack on a comment about this in the code so we don't wonder in the future why > we're forcing it? Done. > Can you share around the profiles or stack traces you're seeing this under? > It's disappointing performance-wise if SkAlphaMulQ is taking up any time. That > means we've missed an opportunity for a bigger win with NEON. At ToT on a Nexus 10, SkAlphaMulQ shows up as 2% of CPU cycles during the mobile-news swipe test from telemetry's rasterize_and_record_micro.key_silk_cases. + 8.03% 1454 CompositorRaste libchrome.1988.0.so [.] memset32_loop128 + 7.81% 1433 CompositorRaste [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore + 7.64% 1376 CompositorRaste [kernel.kallsyms] [k] __memzero + 6.87% 1242 CompositorRaste libchrome.1988.0.so [.] sk_memset32(unsigned int*, unsigned int, int) + 4.08% 744 CompositorRaste libchrome.1988.0.so [.] D32_A8_Opaque(void*, unsigned int, void const*, unsigned int, unsigned int, int, in + 3.24% 584 CompositorRaste [kernel.kallsyms] [k] handle_pte_fault + 2.78% 499 CompositorRaste [kernel.kallsyms] [k] do_page_fault + 2.09% 379 CompositorRaste libchrome.1988.0.so [.] SkAlphaMulQ(unsigned int, unsigned int) + 2.01% 361 CompositorRaste [kernel.kallsyms] [k] get_page_from_freelist + 1.87% 339 CompositorRaste [kernel.kallsyms] [k] _raw_spin_unlock + 1.46% 262 CompositorRaste [kernel.kallsyms] [k] free_hot_cold_page + 1.11% 200 CompositorRaste [kernel.kallsyms] [k] mem_cgroup_count_vm_event + 0.99% 178 CompositorRaste [kernel.kallsyms] [k] _test_and_set_bit + 0.90% 173 CompositorRaste libc.so [.] 0x000000000000e73c + 0.88% 157 CompositorRaste [kernel.kallsyms] [k] page_remove_rmap + 0.85% 154 CompositorRaste [kernel.kallsyms] [k] unmap_single_vma + 0.84% 151 CompositorRaste [kernel.kallsyms] [k] handle_mm_fault + 0.79% 142 CompositorRaste [kernel.kallsyms] [k] _clear_bit + 0.78% 140 CompositorRaste [kernel.kallsyms] [k] __might_sleep + 0.75% 134 CompositorRaste [kernel.kallsyms] [k] __mem_cgroup_uncharge_common + 0.68% 122 CompositorRaste [kernel.kallsyms] [k] _set_bit + 0.67% 120 CompositorRaste [kernel.kallsyms] [k] release_pages + 0.63% 114 CompositorRaste libchrome.1988.0.so [.] S32_opaque_D32_filter_DX_neon(SkBitmapProcState const&, unsigned int const*, int, u + 0.61% 110 CompositorRaste [kernel.kallsyms] [k] __raw_spin_lock_irqsave + 0.60% 108 CompositorRaste libchrome.1988.0.so [.] Filter_32_opaque_neon(unsigned int, unsigned int, unsigned int, unsigned int, unsig + 0.54% 97 CompositorRaste [kernel.kallsyms] [k] free_pages_prepare + 0.52% 92 CompositorRaste [kernel.kallsyms] [k] __mem_cgroup_try_charge.constprop.42 + 0.51% 92 CompositorRaste libchrome.1988.0.so [.] S32A_Blend_BlitRow32_neon(unsigned int*, unsigned int const*, int, unsigned int) + 0.51% 91 CompositorRaste [kernel.kallsyms] [k] __mem_cgroup_commit_charge.constprop.45 I think half of those cycles are spent stalling for the doubly-indirect load of gMask_00FF00FF, which should be fixed by a recent change, and 40% more are spent in function prelude and postlude, which this change helps address. │ Disassembly of section .text: ▒ │ ▒ │ 02454148 <SkAlphaMulQ(unsigned int, unsigned int)>: ▒ │ return SkPackARGB32(a, r, g, b); ▒ │ } ▒ │ ▒ │ SK_API extern const uint32_t gMask_00FF00FF; ▒ │ ▒ │ static /* SK_ALWAYS_INLINE */ uint32_t SkAlphaMulQ(uint32_t c, unsigned scale) { ▒ 20.32 │ ldr r2, [pc, #48] ; 2454180 <SkAlphaMulQ(unsigned int, unsigned int)+0x38> ▒ │ mov ip, sp ▒ 6.86 │ push {fp, ip, lr, pc} ▒ │ uint32_t mask = gMask_00FF00FF; ▒ 23.22 │ ldr r2, [pc, r2] ▒ │ return SkPackARGB32(a, r, g, b); ▒ │ } ▒ │ ▒ │ SK_API extern const uint32_t gMask_00FF00FF; ▒ │ ▒ │ static /* SK_ALWAYS_INLINE */ uint32_t SkAlphaMulQ(uint32_t c, unsigned scale) { ◆ │ sub fp, ip, #4 ▒ │ uint32_t mask = gMask_00FF00FF; ▒ 27.70 │ ldr r3, [r2] ▒ │ ▒ │ uint32_t rb = ((c & mask) * scale) >> 8; ▒ │ uint32_t ag = ((c >> 8) & mask) * scale; ▒ │ and r2, r3, r0, lsr #8 ▒ │ SK_API extern const uint32_t gMask_00FF00FF; ▒ │ ▒ │ static /* SK_ALWAYS_INLINE */ uint32_t SkAlphaMulQ(uint32_t c, unsigned scale) { ▒ │ uint32_t mask = gMask_00FF00FF; ▒ │ ▒ │ uint32_t rb = ((c & mask) * scale) >> 8; ▒ │ and r0, r0, r3 ▒ │ uint32_t ag = ((c >> 8) & mask) * scale; ▒ │ mul r2, r1, r2 ▒ │ SK_API extern const uint32_t gMask_00FF00FF; ▒ │ ▒ │ static /* SK_ALWAYS_INLINE */ uint32_t SkAlphaMulQ(uint32_t c, unsigned scale) { ▒ │ uint32_t mask = gMask_00FF00FF; ▒ │ ▒ │ uint32_t rb = ((c & mask) * scale) >> 8; ▒ 0.26 │ mul r1, r0, r1 ▒ │ uint32_t ag = ((c >> 8) & mask) * scale; ▒ │ return (rb & mask) | (ag & ~mask); ▒ │ bic r2, r2, r3 ▒ │ and r3, r3, r1, lsr #8 ▒ │ } ▒ │ orr r0, r2, r3 ▒ 21.64 │ ldm sp, {fp, sp, pc} ▒ │ .word 0x00cc64fc The function is also responsible for ~2.7% of dcache misses, which again mostly come from the indirect load of the constant. When I read a profile of dcache misses from that test, I see a seemingly high miss rate for blitters / blenders (expected?), for memzero / memset (expected?), and then less expectedly for SkAlphaMulQ, SkGlyphCache::lookupMetrics (0.83%, bug filed), SkPicturePlayback::draw (0.52%), and stuff in the weeds.
The CQ bit was checked by tomhudson@google.com
CQ is trying da patch. Follow status at https://skia-tree-status.appspot.com/cq/tomhudson@chromium.org/283753003/20001
The CQ bit was unchecked by commit-bot@chromium.org
Failed to apply patch for include/core/SkColorPriv.h: While running patch -p1 --forward --force --no-backup-if-mismatch; patching file include/core/SkColorPriv.h Hunk #1 FAILED at 517. 1 out of 1 hunk FAILED -- saving rejects to file include/core/SkColorPriv.h.rej Patch: include/core/SkColorPriv.h Index: include/core/SkColorPriv.h diff --git a/include/core/SkColorPriv.h b/include/core/SkColorPriv.h index 9591f22725a81289059a04bb82bddc79c849cc25..e056b68f58e4d7c0539c98a77243260a233fa597 100644 --- a/include/core/SkColorPriv.h +++ b/include/core/SkColorPriv.h @@ -517,7 +517,9 @@ SkPMColor SkPremultiplyARGBInline(U8CPU a, U8CPU r, U8CPU g, U8CPU b) { SK_API extern const uint32_t gMask_00FF00FF; -static inline uint32_t SkAlphaMulQ(uint32_t c, unsigned scale) { +// When Android is compiled optimizing for size, SkAlphaMulQ doesn't get +// inlined; forcing inlining significantly improves performance. +static SK_ALWAYS_INLINE uint32_t SkAlphaMulQ(uint32_t c, unsigned scale) { uint32_t mask = gMask_00FF00FF; uint32_t rb = ((c & mask) * scale) >> 8;
The CQ bit was checked by tomhudson@google.com
CQ is trying da patch. Follow status at https://skia-tree-status.appspot.com/cq/tomhudson@chromium.org/283753003/40001
Message was sent while issue was closed.
Change committed as 14723 |