Issue 283753003: Blends ought to be fast

Issue 283753003: Blends ought to be fast (Closed)

Created:
6 years, 7 months ago by Tom Hudson

Modified:
6 years, 7 months ago

Reviewers:
reed1, mtklein, tomhudson

CC:
skia-review_googlegroups.com

Base URL:
https://skia.googlesource.com/skia.git@master

Visibility:
Public.

More Reviews

Description

By default, Chromium optimizes for size when compiling on Android. Forcing the SkAlphaMulQ() function inline can yield as much as a 5-10% improvement in rasterization time for some of Chromium's telemetry tests on the Nexus 10, since it's in the inner loop of complex blends. R=mtklein@google.com,reed@google.com Committed: http://code.google.com/p/skia/source/detail?r=14723

Patch Set 1 #

Patch Set 2 : Document use of forced inlining. #

Patch Set 3 : Rebase #

Created: 6 years, 7 months ago

Download [raw] [tar.bz2]

		Unified diffs	Side-by-side diffs	Delta from patch set	Stats (+3 lines, -1 line)			Patch
	M	include/core/SkColorPriv.h	View	1 2	1 chunk	+3 lines, -1 line	0 comments	Download

Messages

Total messages: 11 (0 generated)

Expand Messages | Collapse Messages

tomhudson

PTAL - there were 18 candidates for forced inlining on Blink, but in a first ...

6 years, 7 months ago (2014-05-13 10:04:15 UTC) #1

mtklein

+reed for include/ Tack on a comment about this in the code so we don't ...

6 years, 7 months ago (2014-05-13 12:07:41 UTC) #2

tomhudson

On 2014/05/13 12:07:41, mtklein wrote: > Tack on a comment about this in the code ...

6 years, 7 months ago (2014-05-14 13:03:25 UTC) #4

On 2014/05/13 12:07:41, mtklein wrote:
> Tack on a comment about this in the code so we don't wonder in the future why
> we're forcing it?

Done.

> Can you share around the profiles or stack traces you're seeing this under? 
> It's disappointing performance-wise if SkAlphaMulQ is taking up any time. 
That
> means we've missed an opportunity for a bigger win with NEON.

At ToT on a Nexus 10, SkAlphaMulQ shows up as 2% of CPU cycles during the
mobile-news swipe test from telemetry's
rasterize_and_record_micro.key_silk_cases.

+   8.03%          1454  CompositorRaste  libchrome.1988.0.so  [.]
memset32_loop128
+   7.81%          1433  CompositorRaste  [kernel.kallsyms]    [k]
_raw_spin_unlock_irqrestore
+   7.64%          1376  CompositorRaste  [kernel.kallsyms]    [k] __memzero
+   6.87%          1242  CompositorRaste  libchrome.1988.0.so  [.]
sk_memset32(unsigned int*, unsigned int, int)
+   4.08%           744  CompositorRaste  libchrome.1988.0.so  [.]
D32_A8_Opaque(void*, unsigned int, void const*, unsigned int, unsigned int, int,
in
+   3.24%           584  CompositorRaste  [kernel.kallsyms]    [k]
handle_pte_fault
+   2.78%           499  CompositorRaste  [kernel.kallsyms]    [k] do_page_fault
+   2.09%           379  CompositorRaste  libchrome.1988.0.so  [.]
SkAlphaMulQ(unsigned int, unsigned int)
+   2.01%           361  CompositorRaste  [kernel.kallsyms]    [k]
get_page_from_freelist
+   1.87%           339  CompositorRaste  [kernel.kallsyms]    [k]
_raw_spin_unlock
+   1.46%           262  CompositorRaste  [kernel.kallsyms]    [k]
free_hot_cold_page
+   1.11%           200  CompositorRaste  [kernel.kallsyms]    [k]
mem_cgroup_count_vm_event
+   0.99%           178  CompositorRaste  [kernel.kallsyms]    [k]
_test_and_set_bit
+   0.90%           173  CompositorRaste  libc.so              [.]
0x000000000000e73c
+   0.88%           157  CompositorRaste  [kernel.kallsyms]    [k]
page_remove_rmap
+   0.85%           154  CompositorRaste  [kernel.kallsyms]    [k]
unmap_single_vma
+   0.84%           151  CompositorRaste  [kernel.kallsyms]    [k]
handle_mm_fault
+   0.79%           142  CompositorRaste  [kernel.kallsyms]    [k] _clear_bit
+   0.78%           140  CompositorRaste  [kernel.kallsyms]    [k] __might_sleep
+   0.75%           134  CompositorRaste  [kernel.kallsyms]    [k]
__mem_cgroup_uncharge_common
+   0.68%           122  CompositorRaste  [kernel.kallsyms]    [k] _set_bit
+   0.67%           120  CompositorRaste  [kernel.kallsyms]    [k] release_pages
+   0.63%           114  CompositorRaste  libchrome.1988.0.so  [.]
S32_opaque_D32_filter_DX_neon(SkBitmapProcState const&, unsigned int const*,
int, u
+   0.61%           110  CompositorRaste  [kernel.kallsyms]    [k]
__raw_spin_lock_irqsave
+   0.60%           108  CompositorRaste  libchrome.1988.0.so  [.]
Filter_32_opaque_neon(unsigned int, unsigned int, unsigned int, unsigned int,
unsig
+   0.54%            97  CompositorRaste  [kernel.kallsyms]    [k]
free_pages_prepare
+   0.52%            92  CompositorRaste  [kernel.kallsyms]    [k]
__mem_cgroup_try_charge.constprop.42
+   0.51%            92  CompositorRaste  libchrome.1988.0.so  [.]
S32A_Blend_BlitRow32_neon(unsigned int*, unsigned int const*, int, unsigned int)
+   0.51%            91  CompositorRaste  [kernel.kallsyms]    [k]
__mem_cgroup_commit_charge.constprop.45

I think half of those cycles are spent stalling for the doubly-indirect load of
gMask_00FF00FF, which should be fixed by a recent change, and 40% more are spent
in function prelude and postlude, which this change helps address.

       │    Disassembly of section .text:                                       
               ▒
       │                                                                        
               ▒
       │    02454148 <SkAlphaMulQ(unsigned int, unsigned int)>:                 
               ▒
       │        return SkPackARGB32(a, r, g, b);                                
               ▒
       │    }                                                                   
               ▒
       │                                                                        
               ▒
       │    SK_API extern const uint32_t gMask_00FF00FF;                        
               ▒
       │                                                                        
               ▒
       │    static /* SK_ALWAYS_INLINE */ uint32_t SkAlphaMulQ(uint32_t c,
unsigned scale) {    ▒
 20.32 │      ldr    r2, [pc, #48]   ; 2454180 <SkAlphaMulQ(unsigned int,
unsigned int)+0x38>   ▒
       │      mov    ip, sp                                                     
               ▒
  6.86 │      push   {fp, ip, lr, pc}                                           
               ▒
       │        uint32_t mask = gMask_00FF00FF;                                 
               ▒
 23.22 │      ldr    r2, [pc, r2]                                               
               ▒
       │        return SkPackARGB32(a, r, g, b);                                
               ▒
       │    }                                                                   
               ▒
       │                                                                        
               ▒
       │    SK_API extern const uint32_t gMask_00FF00FF;                        
               ▒
       │                                                                        
               ▒
       │    static /* SK_ALWAYS_INLINE */ uint32_t SkAlphaMulQ(uint32_t c,
unsigned scale) {    ◆
       │      sub    fp, ip, #4                                                 
               ▒
       │        uint32_t mask = gMask_00FF00FF;                                 
               ▒
 27.70 │      ldr    r3, [r2]                                                   
               ▒
       │                                                                        
               ▒
       │        uint32_t rb = ((c & mask) * scale) >> 8;                        
               ▒
       │        uint32_t ag = ((c >> 8) & mask) * scale;                        
               ▒
       │      and    r2, r3, r0, lsr #8                                         
               ▒
       │    SK_API extern const uint32_t gMask_00FF00FF;                        
               ▒
       │                                                                        
               ▒
       │    static /* SK_ALWAYS_INLINE */ uint32_t SkAlphaMulQ(uint32_t c,
unsigned scale) {    ▒
       │        uint32_t mask = gMask_00FF00FF;                                 
               ▒
       │                                                                        
               ▒
       │        uint32_t rb = ((c & mask) * scale) >> 8;                        
               ▒
       │      and    r0, r0, r3                                                 
               ▒
       │        uint32_t ag = ((c >> 8) & mask) * scale;                        
               ▒
       │      mul    r2, r1, r2                                                 
               ▒
       │    SK_API extern const uint32_t gMask_00FF00FF;                        
               ▒
       │                                                                        
               ▒
       │    static /* SK_ALWAYS_INLINE */ uint32_t SkAlphaMulQ(uint32_t c,
unsigned scale) {    ▒
       │        uint32_t mask = gMask_00FF00FF;                                 
               ▒
       │                                                                        
               ▒
       │        uint32_t rb = ((c & mask) * scale) >> 8;                        
               ▒
  0.26 │      mul    r1, r0, r1                                                 
               ▒
       │        uint32_t ag = ((c >> 8) & mask) * scale;                        
               ▒
       │        return (rb & mask) | (ag & ~mask);                              
               ▒
       │      bic    r2, r2, r3                                                 
               ▒
       │      and    r3, r3, r1, lsr #8                                         
               ▒
       │    }                                                                   
               ▒
       │      orr    r0, r2, r3                                                 
               ▒
 21.64 │      ldm    sp, {fp, sp, pc}                                           
               ▒
       │      .word  0x00cc64fc                                                 
        

The function is also responsible for ~2.7% of dcache misses, which again mostly
come from the indirect load of the constant. When I read a profile of dcache
misses from that test, I see a seemingly high miss rate for blitters / blenders
(expected?), for memzero / memset (expected?), and then less expectedly for
SkAlphaMulQ, SkGlyphCache::lookupMetrics (0.83%, bug filed),
SkPicturePlayback::draw (0.52%), and stuff in the weeds.

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://skia-tree-status.appspot.com/cq/tomhudson@chromium.org/283753003/20001

6 years, 7 months ago (2014-05-14 13:04:32 UTC) #6

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

6 years, 7 months ago (2014-05-14 13:04:34 UTC) #7

commit-bot: I haz the power

Failed to apply patch for include/core/SkColorPriv.h: While running patch -p1 --forward --force --no-backup-if-mismatch; patching file ...

6 years, 7 months ago (2014-05-14 13:04:34 UTC) #8

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://skia-tree-status.appspot.com/cq/tomhudson@chromium.org/283753003/40001

6 years, 7 months ago (2014-05-14 13:25:06 UTC) #10

Message was sent while issue was closed.

Change committed as 14723

Expand Messages | Collapse Messages