|
|
Created:
6 years, 7 months ago by henrik.smiding Modified:
6 years, 5 months ago CC:
skia-review_googlegroups.com Base URL:
https://skia.googlesource.com/skia.git@master Visibility:
Public. |
DescriptionAdd SSE4 optimization of S32A_Opaque_Blitrow
Adds optimization of Skia S32A_Opaque_Blitrow blitter using SSE4.2 SIMD
instruction set. Special case for when alpha is zero or opaque.
Performance increase of 10%-400% compared to the existing SSE2
optimization (measured on Silvermont architecture).
Noticeable in ~25 different skia bench subtests, especially in
bitmap_8888_*, repeatTile_*, and morph_*.
bitmap_8888_A - 100% faster
bitmap_8888_A_source_transparent - 250% faster
bitmap_8888_A_source_opaque - 25% faster
bitmap_8888_A_scale_bicubic - 75% faster
Signed-off-by: Henrik Smiding <henrik.smiding@intel.com>
Committed: https://skia.googlesource.com/skia/+/e2527b147679b0c43019fae7d59cc3777d2d097e
Committed: https://skia.googlesource.com/skia/+/b5c281e1e06af3be804309877de1dac6145686b9
Committed: https://skia.googlesource.com/skia/+/3bb195ef0d9691384027d7b61b0b8ef8379aaf5d
Patch Set 1 #
Total comments: 15
Patch Set 2 : Fixed review comments #Patch Set 3 : Fixed section type #Patch Set 4 : Another section fix... #Patch Set 5 : Fix section, again #Patch Set 6 : Trying to please clang #Patch Set 7 : And again... #Patch Set 8 : Clang is being difficult #Patch Set 9 : Linker fix(?) #Patch Set 10 : Excluded clang #Patch Set 11 : More Clang syntax fixes #Patch Set 12 : Change from exclude to include check #Patch Set 13 : Exclude gcc-llvm-combo #Patch Set 14 : exclude Mac10.6 #Patch Set 15 : llvm section fix #Patch Set 16 : llvm really doesn't support gcc/gas... #
Total comments: 2
Patch Set 17 : Added comment and flag #Patch Set 18 : Fixed Valgrind errors in blitter #Patch Set 19 : Test to find which clang version wants which label prefix #Patch Set 20 : Test of label fix #Patch Set 21 : Fix for double label #Patch Set 22 : Fix symbols V2 #Patch Set 23 : Fix deferred test case for Valgrind #
Messages
Total messages: 66 (0 generated)
Is it possible to write these SSE4 blits in intrinsics without major performance loss? Skia's not completely opposed to SIMD-in-assembly (notably, we use it in places where NEON intrinsics just can't express what we want), but we do like to use intrinsics if reasonable. We find them more maintainable, more readable (though your comments are excellent help), and it'd be nice to get these wins on Windows too.
On 2014/05/14 17:12:43, mtklein wrote: > Is it possible to write these SSE4 blits in intrinsics without major performance > loss? > > Skia's not completely opposed to SIMD-in-assembly (notably, we use it in places > where NEON intrinsics just can't express what we want), but we do like to use > intrinsics if reasonable. We find them more maintainable, more readable (though > your comments are excellent help), and it'd be nice to get these wins on Windows > too. Unfortunately, no. I did parts of an initial version using intrinsics, but the performance, especially for smaller widths, was about half compared to the assembly version. This was probably, in part, due to the limitations of the 'ptest' instruction when using intrinsics. Since this is one of the most used blitters in Skia, I wanted it to be as fast as I could make it. I hope to at least get time convert it to Intel assembly syntax, so it can be build with Visual Studio as well, but it'll probably have to wait until after the Android L-deadline.
https://codereview.chromium.org/289473009/diff/1/gyp/skia_lib.gyp File gyp/skia_lib.gyp (right): https://codereview.chromium.org/289473009/diff/1/gyp/skia_lib.gyp#newcode18 gyp/skia_lib.gyp:18: 'opts.gyp:opts_sse4', Derek, how come we exclude ssse3 and sse4 from android? https://codereview.chromium.org/289473009/diff/1/src/opts/SkBlitRow_opts_SSE4.h File src/opts/SkBlitRow_opts_SSE4.h (right): https://codereview.chromium.org/289473009/diff/1/src/opts/SkBlitRow_opts_SSE4... src/opts/SkBlitRow_opts_SSE4.h:18: extern "C" void S32A_Blend_BlitRow32_SSE4_asm(SkPMColor* SK_RESTRICT dst, Remove until implemented? https://codereview.chromium.org/289473009/diff/1/src/opts/SkBlitRow_opts_SSE4... File src/opts/SkBlitRow_opts_SSE4_asm.S (right): https://codereview.chromium.org/289473009/diff/1/src/opts/SkBlitRow_opts_SSE4... src/opts/SkBlitRow_opts_SSE4_asm.S:48: pcmpeqw %xmm6, %xmm6 // 16-bit 256 to calculate inv. alpha Does the interlaced instruction scheduling here really help? If not, it sure hurts comprehensibility. https://codereview.chromium.org/289473009/diff/1/src/opts/SkBlitRow_opts_SSE4... src/opts/SkBlitRow_opts_SSE4_asm.S:66: ptest %xmm7, %xmm1 // Check if all alphas are zero or opaque Is this the sort of place intrinsics fail us? I guess you'd have to have used two _mm_test* here to get the same effect as this three-way ptest? https://codereview.chromium.org/289473009/diff/1/src/opts/SkBlitRow_opts_SSE4... src/opts/SkBlitRow_opts_SSE4_asm.S:108: lddqu (%eax, %edi), %xmm1 // Load last four source pixels (overlapping) I was expecting we'd fall back on non-SIMD or do something complicated to handle the tail. I will have to remember this trick. Can you add a note that we can only do all four when alpha == 0 or FF, and that when we actually blend we have to be careful about not double-blending the overlapping pixels? https://codereview.chromium.org/289473009/diff/1/src/opts/SkBlitRow_opts_SSE4... src/opts/SkBlitRow_opts_SSE4_asm.S:113: // Handle mixed alphas (calculate and scale) Can we share or macro away this big blend block? I think I count it 11 times, sometimes with slightly different register choices. Even if it can't all be shared, it'd be nice to factor apart the required and incidental differences. https://codereview.chromium.org/289473009/diff/1/src/opts/SkBlitRow_opts_SSE4... src/opts/SkBlitRow_opts_SSE4_asm.S:219: ptest %xmm7, %xmm1 // Check if all alphas are zero or opaque Do you think we're benefitting by having everything here in SSE? I'm wondering if it'd simplify things to handle these edgy cases (small, unaligned) in portable code and call into this asm only for the long-haul when we can assert dst is aligned and count >= 4 or 16? If it doesn't kill performance, I want to break this monster method up into chunks that are easier to reason about. If we're going to handle all of unaligned dst, aligned dst/unaligned src, and aligned src+dst, maybe we can pull those tests out into C++ code up front, and have it dispatch to three different methods in here? https://codereview.chromium.org/289473009/diff/1/src/opts/SkBlitRow_opts_SSE4... File src/opts/SkBlitRow_opts_SSE4_x64_asm.S (right): https://codereview.chromium.org/289473009/diff/1/src/opts/SkBlitRow_opts_SSE4... src/opts/SkBlitRow_opts_SSE4_x64_asm.S:571: .LAlphaCheckMask: Looks like the differences here are: 1) calling convention is different (of course) 2) pointers and offsets go in different registers and use different op suffixes (of course) 3) We store constants in the x86-64 code but construct them programmatically in the x86 code Can we do 3) in both? Or, more aggressively, perhaps we can just forget about writing x86 fast paths and focus on x86-64? I take it Silvermont is x86-64, right? And so will be all future chips?
https://codereview.chromium.org/289473009/diff/1/src/opts/SkBlitRow_opts_SSE4.h File src/opts/SkBlitRow_opts_SSE4.h (right): https://codereview.chromium.org/289473009/diff/1/src/opts/SkBlitRow_opts_SSE4... src/opts/SkBlitRow_opts_SSE4.h:18: extern "C" void S32A_Blend_BlitRow32_SSE4_asm(SkPMColor* SK_RESTRICT dst, On 2014/05/16 18:06:38, mtklein wrote: > Remove until implemented? Done. https://codereview.chromium.org/289473009/diff/1/src/opts/SkBlitRow_opts_SSE4... File src/opts/SkBlitRow_opts_SSE4_asm.S (right): https://codereview.chromium.org/289473009/diff/1/src/opts/SkBlitRow_opts_SSE4... src/opts/SkBlitRow_opts_SSE4_asm.S:48: pcmpeqw %xmm6, %xmm6 // 16-bit 256 to calculate inv. alpha On 2014/05/16 18:06:38, mtklein wrote: > Does the interlaced instruction scheduling here really help? If not, it sure > hurts comprehensibility. On a Haswell core, probably not. On a Silvermont/Atom, especially for SSE, it does. But in this case I see no problem doing it in sequence, since it's initialization and not a loop. https://codereview.chromium.org/289473009/diff/1/src/opts/SkBlitRow_opts_SSE4... src/opts/SkBlitRow_opts_SSE4_asm.S:66: ptest %xmm7, %xmm1 // Check if all alphas are zero or opaque On 2014/05/16 18:06:38, mtklein wrote: > Is this the sort of place intrinsics fail us? I guess you'd have to have used > two _mm_test* here to get the same effect as this three-way ptest? That's correct. It was all about not making the worst case worse (pixel alphas in the range 1-254). I'm guessing there was a problem with register allocation and op-code ordering as well. https://codereview.chromium.org/289473009/diff/1/src/opts/SkBlitRow_opts_SSE4... src/opts/SkBlitRow_opts_SSE4_asm.S:108: lddqu (%eax, %edi), %xmm1 // Load last four source pixels (overlapping) On 2014/05/16 18:06:38, mtklein wrote: > I was expecting we'd fall back on non-SIMD or do something complicated to handle > the tail. I will have to remember this trick. Can you add a note that we can > only do all four when alpha == 0 or FF, and that when we actually blend we have > to be careful about not double-blending the overlapping pixels? I've improved the comments a bit. https://codereview.chromium.org/289473009/diff/1/src/opts/SkBlitRow_opts_SSE4... src/opts/SkBlitRow_opts_SSE4_asm.S:113: // Handle mixed alphas (calculate and scale) On 2014/05/16 18:06:38, mtklein wrote: > Can we share or macro away this big blend block? I think I count it 11 times, > sometimes with slightly different register choices. Even if it can't all be > shared, it'd be nice to factor apart the required and incidental differences. Done. I replaced about 200 lines of code with macros. It shouldn't affect performance at all. https://codereview.chromium.org/289473009/diff/1/src/opts/SkBlitRow_opts_SSE4... src/opts/SkBlitRow_opts_SSE4_asm.S:219: ptest %xmm7, %xmm1 // Check if all alphas are zero or opaque On 2014/05/16 18:06:38, mtklein wrote: > Do you think we're benefitting by having everything here in SSE? I'm wondering > if it'd simplify things to handle these edgy cases (small, unaligned) in > portable code and call into this asm only for the long-haul when we can assert > dst is aligned and count >= 4 or 16? > > If it doesn't kill performance, I want to break this monster method up into > chunks that are easier to reason about. If we're going to handle all of > unaligned dst, aligned dst/unaligned src, and aligned src+dst, maybe we can pull > those tests out into C++ code up front, and have it dispatch to three different > methods in here? That would kill performance of short blits, like 1-16 pixels, which are quite common. It would also force at least three copies of the initialization code, and duplication of the code paths that are reused between the blocks. If you want to reduce the number of lines further, I could test if the 'perfect alignment' path could be removed without affecting performance. https://codereview.chromium.org/289473009/diff/1/src/opts/SkBlitRow_opts_SSE4... File src/opts/SkBlitRow_opts_SSE4_x64_asm.S (right): https://codereview.chromium.org/289473009/diff/1/src/opts/SkBlitRow_opts_SSE4... src/opts/SkBlitRow_opts_SSE4_x64_asm.S:571: .LAlphaCheckMask: On 2014/05/16 18:06:38, mtklein wrote: > Looks like the differences here are: > 1) calling convention is different (of course) > 2) pointers and offsets go in different registers and use different op > suffixes (of course) > 3) We store constants in the x86-64 code but construct them programmatically > in the x86 code > > Can we do 3) in both? Or, more aggressively, perhaps we can just forget about > writing x86 fast paths and focus on x86-64? I take it Silvermont is x86-64, > right? And so will be all future chips? I tested doing a position independent version in 32-bit. It was slower. Probably due to the time it took to fetch the IP. Yes, Silvermont supports x86-64. But just like on Windows, many apps still run in x86 or x32 mode. In Android L there will be both a 64-bit and a 32-bit version of Skia, simultaneously. Some legacy apps with native code 32-bit code will still use 32-bit Skia. I'm guessing the webview app will also run both in 32 and 64-bit versions.
Thanks for your explanations, changes, and patience. This LGTM.
The CQ bit was checked by henrik.smiding@intel.com
CQ is trying da patch. Follow status at https://skia-tree-status.appspot.com/cq/henrik.smiding@intel.com/289473009/20001
FYI, CQ is re-trying this CL (attempt #1). The failing builders are: Build-Mac10.8-Clang-x86-Release-Trybot on tryserver.chromium (http://108.170.220.76:10117/builders/Build-Mac10.8-Clang-x86-Release-Trybot/b...)
The CQ bit was unchecked by commit-bot@chromium.org
Try jobs failed on following builders: Build-Mac10.8-Clang-x86-Release-Trybot on tryserver.chromium (http://108.170.220.76:10117/builders/Build-Mac10.8-Clang-x86-Release-Trybot/b...)
The CQ bit was checked by henrik.smiding@intel.com
CQ is trying da patch. Follow status at https://skia-tree-status.appspot.com/cq/henrik.smiding@intel.com/289473009/40001
FYI, CQ is re-trying this CL (attempt #1). The failing builders are: Build-Mac10.8-Clang-x86-Release-Trybot on tryserver.chromium (http://108.170.220.76:10117/builders/Build-Mac10.8-Clang-x86-Release-Trybot/b...)
The CQ bit was unchecked by commit-bot@chromium.org
Try jobs failed on following builders: Build-Mac10.8-Clang-x86-Release-Trybot on tryserver.chromium (http://108.170.220.76:10117/builders/Build-Mac10.8-Clang-x86-Release-Trybot/b...)
The CQ bit was checked by henrik.smiding@intel.com
CQ is trying da patch. Follow status at https://skia-tree-status.appspot.com/cq/henrik.smiding@intel.com/289473009/60001
FYI, CQ is re-trying this CL (attempt #1). The failing builders are: Build-Mac10.8-Clang-x86-Release-Trybot on tryserver.chromium (http://108.170.220.76:10117/builders/Build-Mac10.8-Clang-x86-Release-Trybot/b...)
The CQ bit was checked by henrik.smiding@intel.com
CQ is trying da patch. Follow status at https://skia-tree-status.appspot.com/cq/henrik.smiding@intel.com/289473009/80001
FYI, CQ is re-trying this CL (attempt #1). The failing builders are: Build-Mac10.8-Clang-x86-Release-Trybot on tryserver.chromium (http://108.170.220.76:10117/builders/Build-Mac10.8-Clang-x86-Release-Trybot/b...)
The CQ bit was unchecked by commit-bot@chromium.org
Try jobs failed on following builders: Build-Mac10.8-Clang-x86-Release-Trybot on tryserver.chromium (http://108.170.220.76:10117/builders/Build-Mac10.8-Clang-x86-Release-Trybot/b...)
The CQ bit was checked by henrik.smiding@intel.com
CQ is trying da patch. Follow status at https://skia-tree-status.appspot.com/cq/henrik.smiding@intel.com/289473009/10...
FYI, CQ is re-trying this CL (attempt #1). The failing builders are: Build-Mac10.8-Clang-x86-Release-Trybot on tryserver.chromium (http://108.170.220.76:10117/builders/Build-Mac10.8-Clang-x86-Release-Trybot/b...)
The CQ bit was unchecked by commit-bot@chromium.org
Try jobs failed on following builders: Build-Mac10.8-Clang-x86-Release-Trybot on tryserver.chromium (http://108.170.220.76:10117/builders/Build-Mac10.8-Clang-x86-Release-Trybot/b...)
The CQ bit was checked by henrik.smiding@intel.com
CQ is trying da patch. Follow status at https://skia-tree-status.appspot.com/cq/henrik.smiding@intel.com/289473009/12...
FYI, CQ is re-trying this CL (attempt #1). The failing builders are: Build-Mac10.8-Clang-x86-Release-Trybot on tryserver.chromium (http://108.170.220.76:10117/builders/Build-Mac10.8-Clang-x86-Release-Trybot/b...)
The CQ bit was unchecked by commit-bot@chromium.org
Try jobs failed on following builders: Build-Mac10.8-Clang-x86-Release-Trybot on tryserver.chromium (http://108.170.220.76:10117/builders/Build-Mac10.8-Clang-x86-Release-Trybot/b...)
The CQ bit was checked by henrik.smiding@intel.com
CQ is trying da patch. Follow status at https://skia-tree-status.appspot.com/cq/henrik.smiding@intel.com/289473009/14...
FYI, CQ is re-trying this CL (attempt #1). The failing builders are: Build-Mac10.8-Clang-x86-Release-Trybot on tryserver.chromium (http://108.170.220.76:10117/builders/Build-Mac10.8-Clang-x86-Release-Trybot/b...)
The CQ bit was unchecked by commit-bot@chromium.org
Try jobs failed on following builders: Build-Mac10.8-Clang-x86-Release-Trybot on tryserver.chromium (http://108.170.220.76:10117/builders/Build-Mac10.8-Clang-x86-Release-Trybot/b...)
The CQ bit was checked by henrik.smiding@intel.com
CQ is trying da patch. Follow status at https://skia-tree-status.appspot.com/cq/henrik.smiding@intel.com/289473009/16...
Message was sent while issue was closed.
Change committed as e2527b147679b0c43019fae7d59cc3777d2d097e
Message was sent while issue was closed.
A revert of this CL has been created in https://codereview.chromium.org/311053009/ by jvanverth@google.com. The reason for reverting is: Buildbot failures on Mac 10.6 and Mac 10.7..
Message was sent while issue was closed.
On 2014/06/05 15:15:34, jvanverth1 wrote: > A revert of this CL has been created in > https://codereview.chromium.org/311053009/ by mailto:jvanverth@google.com. > > The reason for reverting is: Buildbot failures on Mac 10.6 and Mac 10.7.. Thanks Jim.
Message was sent while issue was closed.
On 2014/06/05 15:22:12, mtklein wrote: > On 2014/06/05 15:15:34, jvanverth1 wrote: > > A revert of this CL has been created in > > https://codereview.chromium.org/311053009/ by mailto:jvanverth@google.com. > > > > The reason for reverting is: Buildbot failures on Mac 10.6 and Mac 10.7.. > > Thanks Jim. For what it's worth, our ASAN bot (Clang 3.4 or 3.5, using -m64 -fsanitize=address) failed while linking: http://108.170.220.120:10117/builders/Test-Ubuntu13.10-GCE-NoGPU-x86_64-Debug...
On 2014/06/05 15:29:39, mtklein wrote: > On 2014/06/05 15:22:12, mtklein wrote: > > On 2014/06/05 15:15:34, jvanverth1 wrote: > > > A revert of this CL has been created in > > > https://codereview.chromium.org/311053009/ by mailto:jvanverth@google.com. > > > > > > The reason for reverting is: Buildbot failures on Mac 10.6 and Mac 10.7.. > > > > Thanks Jim. > > For what it's worth, our ASAN bot (Clang 3.4 or 3.5, using -m64 > -fsanitize=address) failed while linking: > http://108.170.220.120:10117/builders/Test-Ubuntu13.10-GCE-NoGPU-x86_64-Debug... Please add Clang Build-bots for Mac 10.6, 10.7, and 10.8
On 2014/06/09 14:14:34, henrik.smiding wrote: > On 2014/06/05 15:29:39, mtklein wrote: > > On 2014/06/05 15:22:12, mtklein wrote: > > > On 2014/06/05 15:15:34, jvanverth1 wrote: > > > > A revert of this CL has been created in > > > > https://codereview.chromium.org/311053009/ by mailto:jvanverth@google.com. > > > > > > > > The reason for reverting is: Buildbot failures on Mac 10.6 and Mac 10.7.. > > > > > > Thanks Jim. > > > > For what it's worth, our ASAN bot (Clang 3.4 or 3.5, using -m64 > > -fsanitize=address) failed while linking: > > > http://108.170.220.120:10117/builders/Test-Ubuntu13.10-GCE-NoGPU-x86_64-Debug... > > Please add Clang Build-bots for Mac 10.6, 10.7, and 10.8 Done. And, should be that tomorrow morning you'll have access to trigger these yourself.
On 2014/06/09 14:22:28, mtklein wrote: > On 2014/06/09 14:14:34, henrik.smiding wrote: > > On 2014/06/05 15:29:39, mtklein wrote: > > > On 2014/06/05 15:22:12, mtklein wrote: > > > > On 2014/06/05 15:15:34, jvanverth1 wrote: > > > > > A revert of this CL has been created in > > > > > https://codereview.chromium.org/311053009/ by > mailto:jvanverth@google.com. > > > > > > > > > > The reason for reverting is: Buildbot failures on Mac 10.6 and Mac > 10.7.. > > > > > > > > Thanks Jim. > > > > > > For what it's worth, our ASAN bot (Clang 3.4 or 3.5, using -m64 > > > -fsanitize=address) failed while linking: > > > > > > http://108.170.220.120:10117/builders/Test-Ubuntu13.10-GCE-NoGPU-x86_64-Debug... > > > > Please add Clang Build-bots for Mac 10.6, 10.7, and 10.8 > > Done. And, should be that tomorrow morning you'll have access to trigger these > yourself. Awesome. No more embarrassing reverts for me!
LGTM Checked out those bots. The IntelRHB bot is just an incredibly slow bot, so it's no surprise it timed out. Sort of worthless. Yesterday we had a lot of "fun" with our build bots, with Windows bots in particular. I wouldn't worry about that one stray update failure. The GM comparison failures on the Ubuntu bot are a known, unrelated issue. https://codereview.chromium.org/289473009/diff/300001/src/opts/SkBlitRow_opts... File src/opts/SkBlitRow_opts_SSE4.h (right): https://codereview.chromium.org/289473009/diff/300001/src/opts/SkBlitRow_opts... src/opts/SkBlitRow_opts_SSE4.h:13: #if defined(__clang__) || (defined(__GNUC__) && !defined(SK_BUILD_FOR_MAC)) So, help me parse this condition now? Clang -> Yes GCC -> Yes, unless on Mac Meaning, we'll build for Mac with Clang, Linux with Clang or GCC, Android with Clang or GCC, and not at all for Windows. Is that right? If so, SGTM. Can you just tack on a reminder comment that the GCC-on-Mac check is to guard against LLVM-GCC 4.2? Presumably vanilla GCC on Mac would work fine, but that's not really a viable choice anymore anyway. Maybe define SK_SSE4_ASM_SUPPORTED when this is true so that the other places that use this same condition can have something a bit more semantic to check?
https://codereview.chromium.org/289473009/diff/300001/src/opts/SkBlitRow_opts... File src/opts/SkBlitRow_opts_SSE4.h (right): https://codereview.chromium.org/289473009/diff/300001/src/opts/SkBlitRow_opts... src/opts/SkBlitRow_opts_SSE4.h:13: #if defined(__clang__) || (defined(__GNUC__) && !defined(SK_BUILD_FOR_MAC)) On 2014/06/17 15:14:29, mtklein wrote: > So, help me parse this condition now? > > Clang -> Yes > GCC -> Yes, unless on Mac > > Meaning, we'll build for Mac with Clang, Linux with Clang or GCC, Android with > Clang or GCC, and not at all for Windows. Is that right? If so, SGTM. > > Can you just tack on a reminder comment that the GCC-on-Mac check is to guard > against LLVM-GCC 4.2? Presumably vanilla GCC on Mac would work fine, but that's > not really a viable choice anymore anyway. > > Maybe define SK_SSE4_ASM_SUPPORTED when this is true so that the other places > that use this same condition can have something a bit more semantic to check? Correct. Added a flag and updated the two checks in opts_check file.
The CQ bit was checked by henrik.smiding@intel.com
CQ is trying da patch. Follow status at https://skia-tree-status.appspot.com/cq/henrik.smiding@intel.com/289473009/32...
Message was sent while issue was closed.
Change committed as b5c281e1e06af3be804309877de1dac6145686b9
Message was sent while issue was closed.
I'm seeing a lot of valgrind errors popping up after this commit: % valgrind out/Debug/tests [...] ==8063== Conditional jump or move depends on uninitialised value(s) ==8063== at 0x885D73: S32A_Opaque_BlitRow32_SSE4_asm (SkBlitRow_opts_SSE4_x64_asm.S:431) ==8063== by 0x6104A2: SkARGB32_Shader_Blitter::blitH(int, int, int) (SkBlitter_ARGB32.cpp:331) ==8063== by 0x6ADD6B: walk_convex_edges(SkEdge*, SkPath::FillType, SkBlitter*, int, int, void (*)(SkBlitter*, int, bool)) (SkScan_Path.cpp:272) ==8063== by 0x6AEB83: sk_fill_triangle(SkPoint const*, SkIRect const*, SkBlitter*, SkIRect const&) (SkScan_Path.cpp:700) ==8063== by 0x6AED21: SkScan::FillTriangle(SkPoint const*, SkRasterClip const&, SkBlitter*) (SkScan_Path.cpp:731) ==8063== by 0x635186: SkDraw::drawVertices(SkCanvas::VertexMode, int, SkPoint const*, SkPoint const*, unsigned int const*, SkXfermode*, unsigned short const*, int, SkPaint const&) const (SkDraw.cpp:2236) ==8063== by 0x5EA45E: SkBitmapDevice::drawVertices(SkDraw const&, SkCanvas::VertexMode, int, SkPoint const*, SkPoint const*, unsigned int const*, SkXfermode*, unsigned short const*, int, SkPaint const&) (SkBitmapDevice.cpp:429) ==8063== by 0x61DC70: SkCanvas::drawVertices(SkCanvas::VertexMode, int, SkPoint const*, SkPoint const*, unsigned int const*, SkXfermode*, unsigned short const*, int, SkPaint const&) (SkCanvas.cpp:2346) ==8063== by 0x48A769: Mesh::draw(SkCanvas*, SkPaint*) (BlitRowTest.cpp:177) ==8063== by 0x48A3C2: test_diagonal(skiatest::Reporter*) (BlitRowTest.cpp:241) [...] ==8063== Conditional jump or move depends on uninitialised value(s) ==8063== at 0x885D00: S32A_Opaque_BlitRow32_SSE4_asm (SkBlitRow_opts_SSE4_x64_asm.S:406) ==8063== by 0x61086A: SkARGB32_Shader_Blitter::blitRect(int, int, int, int) (SkBlitter_ARGB32.cpp:404) ==8063== by 0x6A59FF: blitrect(SkBlitter*, SkIRect const&) (SkScan.cpp:15) ==8063== by 0x6A5A96: SkScan::FillIRect(SkIRect const&, SkRegion const*, SkBlitter*) (SkScan.cpp:26) ==8063== by 0x6A5C72: SkScan::FillIRect(SkIRect const&, SkRasterClip const&, SkBlitter*) (SkScan.cpp:73) ==8063== by 0x62E8EE: SkDraw::drawPaint(SkPaint const&) const (SkDraw.cpp:288) ==8063== by 0x5E9D7A: SkBitmapDevice::drawPaint(SkDraw const&, SkPaint const&) (SkBitmapDevice.cpp:261) ==8063== by 0x61BB18: SkCanvas::internalDrawPaint(SkPaint const&) (SkCanvas.cpp:1867) ==8063== by 0x61BA5A: SkCanvas::drawPaint(SkPaint const&) (SkCanvas.cpp:1860) ==8063== by 0x6CCD88: drawPaint_rp(SkCanvas*, SkReader32*, unsigned int, SkGPipeState*) (SkGPipeRead.cpp:358) ==8063== by 0x6CE79E: SkGPipeReader::playback(void const*, unsigned long, unsigned int, unsigned long*) (SkGPipeRead.cpp:890) ==8063== by 0x84AE2E: DeferredPipeController::playback(bool) (SkDeferredCanva
Message was sent while issue was closed.
(As a side note, it would be really nice if these functions used intrinsics instead. I do realize that it's sometimes not possible to match the performance of hand-tuned asm with intrinsics, but in those cases where it is, it gives us portability to x86/x86_64/win/win64 in one shot.)
Message was sent while issue was closed.
On 2014/06/17 22:43:35, Stephen White wrote: > (As a side note, it would be really nice if these functions used intrinsics > instead. I do realize that it's sometimes not possible to match the performance > of hand-tuned asm with intrinsics, but in those cases where it is, it gives us > portability to x86/x86_64/win/win64 in one shot.) This is one of those times. The key here is ptest, which is more flexible for three-way branching in assembly than intrinsics.
Message was sent while issue was closed.
On 2014/06/17 23:04:22, mtklein wrote: > On 2014/06/17 22:43:35, Stephen White wrote: > > (As a side note, it would be really nice if these functions used intrinsics > > instead. I do realize that it's sometimes not possible to match the > performance > > of hand-tuned asm with intrinsics, but in those cases where it is, it gives us > > portability to x86/x86_64/win/win64 in one shot.) > > This is one of those times. The key here is ptest, which is more flexible for > three-way branching in assembly than intrinsics. Cool, thanks for the explanation. BTW, the valgrind bot seems to corroborate my findings: [12:47:12.735195] [176/304] 9293ms DeviceLooper [12:47:12.735222] [177/304] 16ms Deque==28584== Conditional jump or move depends on uninitialised value(s) [12:47:12.735246] ==28584== at 0x6A9A15: S32A_Opaque_BlitRow32_SSE4_asm (SkBlitRow_opts_SSE4_x64_asm.S:347) [12:47:12.735283] ==28584== by 0x5E28B6: Sprite_D32_S32::blitRect(int, int, int, int) (SkSpriteBlitter_ARGB32.cpp:48) [12:47:12.735306] ==28584== by 0x593E30: SkScan::FillIRect(SkIRect const&, SkRegion const*, SkBlitter*) (SkScan.cpp:15) [12:47:12.735331] ==28584== by 0x59403E: SkScan::FillIRect(SkIRect const&, SkRasterClip const&, SkBlitter*) (SkScan.cpp:73) [12:47:12.735354] ==28584== by 0x55F94D: SkDraw::drawBitmap(SkBitmap const&, SkMatrix const&, SkPaint const&) const (SkDraw.cpp:1260) [12:47:12.735376] ==28584== by 0x537352: SkBitmapDevice::drawBitmap(SkDraw const&, SkBitmap const&, SkMatrix const&, SkPaint const&) (SkBitmapDevice.cpp:309) [12:47:12.735403] ==28584== by 0x554C6F: SkCanvas::internalDrawBitmap(SkBitmap const&, SkMatrix const&, SkPaint const*) (SkCanvas.cpp:1181) [12:47:12.735428] ==28584== by 0x554E76: SkCanvas::drawBitmap(SkBitmap const&, float, float, SkPaint const*) (SkCanvas.cpp:2044) http://108.170.220.120:10117/builders/Test-Ubuntu12-ShuttleA-GTX550Ti-x86_64-...
Message was sent while issue was closed.
On 2014/06/17 23:10:46, Stephen White wrote: > On 2014/06/17 23:04:22, mtklein wrote: > > On 2014/06/17 22:43:35, Stephen White wrote: > > > (As a side note, it would be really nice if these functions used intrinsics > > > instead. I do realize that it's sometimes not possible to match the > > performance > > > of hand-tuned asm with intrinsics, but in those cases where it is, it gives > us > > > portability to x86/x86_64/win/win64 in one shot.) > > > > This is one of those times. The key here is ptest, which is more flexible for > > three-way branching in assembly than intrinsics. > > Cool, thanks for the explanation. > > BTW, the valgrind bot seems to corroborate my findings: > > [12:47:12.735195] [176/304] 9293ms DeviceLooper > [12:47:12.735222] [177/304] 16ms Deque==28584== Conditional jump or move > depends on uninitialised value(s) > [12:47:12.735246] ==28584== at 0x6A9A15: S32A_Opaque_BlitRow32_SSE4_asm > (SkBlitRow_opts_SSE4_x64_asm.S:347) > [12:47:12.735283] ==28584== by 0x5E28B6: Sprite_D32_S32::blitRect(int, int, > int, int) (SkSpriteBlitter_ARGB32.cpp:48) > [12:47:12.735306] ==28584== by 0x593E30: SkScan::FillIRect(SkIRect const&, > SkRegion const*, SkBlitter*) (SkScan.cpp:15) > [12:47:12.735331] ==28584== by 0x59403E: SkScan::FillIRect(SkIRect const&, > SkRasterClip const&, SkBlitter*) (SkScan.cpp:73) > [12:47:12.735354] ==28584== by 0x55F94D: SkDraw::drawBitmap(SkBitmap const&, > SkMatrix const&, SkPaint const&) const (SkDraw.cpp:1260) > [12:47:12.735376] ==28584== by 0x537352: SkBitmapDevice::drawBitmap(SkDraw > const&, SkBitmap const&, SkMatrix const&, SkPaint const&) > (SkBitmapDevice.cpp:309) > [12:47:12.735403] ==28584== by 0x554C6F: > SkCanvas::internalDrawBitmap(SkBitmap const&, SkMatrix const&, SkPaint const*) > (SkCanvas.cpp:1181) > [12:47:12.735428] ==28584== by 0x554E76: SkCanvas::drawBitmap(SkBitmap > const&, float, float, SkPaint const*) (SkCanvas.cpp:2044) > > http://108.170.220.120:10117/builders/Test-Ubuntu12-ShuttleA-GTX550Ti-x86_64-... Yup. Let's disable for now. https://codereview.chromium.org/338353009/
Message was sent while issue was closed.
A revert of this CL has been created in https://codereview.chromium.org/336413007/ by mtklein@google.com. The reason for reverting is: Valgrind bot's seeing this code use uninitialized memory, and it's somehow blocking our roll into Chrome too: > ld: warning: could not create compact unwind for S32A_Opaque_BlitRow32_SSE4_asm: > stack subq instruction is too different from dwarf stack size > [10339/10982 | 3247.792] PACKAGE FRAMEWORK "Chromium Framework.framework", > POSTBUILDS > FAILED: ./gyp-mac-tool package-framework "Chromium Framework.framework" A && > (export > BUILT_PRODUCTS_DIR=/Volumes/data/b/build/slave/mac_gpu/build/src/out/Release; > export CONFIGURATION=Release; export CONTENTS_FOLDER_PATH="Chromium > Framework.framework/Versions/A"; export > DYLIB_INSTALL_NAME_BASE=@executable_path/../Versions/37.0.2056.0; export > EXECUTABLE_NAME="Chromium Framework"; export EXECUTABLE_PATH="Chromium > Framework.framework/Versions/A/Chromium Framework"; export > FULL_PRODUCT_NAME="Chromium Framework.framework"; export > INFOPLIST_PATH="Chromium Framework.framework/Versions/A/Resources/Info.plist"; > export LD_DYLIB_INSTALL_NAME="@executable_path/../Versions/37.0.2056.0/Chromium > Framework.framework/Chromium Framework"; export MACH_O_TYPE=mh_dylib; export > PRODUCT_NAME="Chromium Framework"; export > PRODUCT_TYPE=com.apple.product-type.framework; export > SDKROOT=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.6.sdk; > export > SRCROOT=/Volumes/data/b/build/slave/mac_gpu/build/src/out/Release/../../chrome; > export SOURCE_ROOT="${SRCROOT}"; export > TARGET_BUILD_DIR=/Volumes/data/b/build/slave/mac_gpu/build/src/out/Release; > export TEMP_DIR="${TMPDIR}"; export UNLOCALIZED_RESOURCES_FOLDER_PATH="Chromium > Framework.framework/Versions/A/Resources"; export WRAPPER_NAME="Chromium > Framework.framework"; (cd ../../chrome && ../build/mac/tweak_info_plist.py > "--breakpad=1" "--breakpad_uploads=0" "--keystone=0" "--scm=1" > "--branding=Chromium" && ln -fns Versions/Current/Libraries > "${BUILT_PRODUCTS_DIR}/${WRAPPER_NAME}/Libraries" && > tools/build/mac/verify_order _ChromeMain > "${BUILT_PRODUCTS_DIR}/${EXECUTABLE_PATH}"); G=$?; ((exit $G) || rm -rf > 'Chromium Framework.framework') && exit $G) && touch "Chromium > Framework.framework" > tools/build/mac/verify_order: unordered symbols in > /Volumes/data/b/build/slave/mac_gpu/build/src/out/Release/Chromium > Framework.framework/Versions/A/Chromium Framework: > S32A_Opaque_BlitRow32_SSE4_asm > _S32A_Opaque_BlitRow32_SSE4_asm > ninja: build stopped: subcommand failed. .
Remaining valgrind errors comes from the 'DeferredCanvas_CPU' test. They seem to come from the 'TestDeferredCanvasBitmapShaderNoLeak' subtest. The test sends in uninitialized bitmaps to the blitter. Verified this by adding the same pixel alpha check in the existing SSE2 versions of the blitter, and getting the same errors. Should I start investigating and modifying the test, or do you have someone better acquainted with the code? Git blame showed a lot of different authors.
Thanks for fixing the test. Everything LGTM. We're still going to have the problem of tools/build/mac/verify_order failing when we roll into Chromium, no? Any thoughts on that?
On 2014/06/27 13:53:44, mtklein wrote: > Thanks for fixing the test. Everything LGTM. > > We're still going to have the problem of tools/build/mac/verify_order failing > when we roll into Chromium, no? Any thoughts on that? I don't think so. I've split the labels so there should never be more than one. The problem was that previously there were two labels with the same offset, right?
On 2014/06/27 14:46:11, henrik.smiding wrote: > On 2014/06/27 13:53:44, mtklein wrote: > > Thanks for fixing the test. Everything LGTM. > > > > We're still going to have the problem of tools/build/mac/verify_order failing > > when we roll into Chromium, no? Any thoughts on that? > > I don't think so. I've split the labels so there should never be more than one. > The problem was that previously there were two labels with the same offset, > right? I'm still not clear what the heck they're checking there. I think we can give this another try. If it breaks the roll, I'll disable it by commenting out the #define SK_ATT_ASM_SUPPORTED so we can keep it landed but off.
The CQ bit was checked by henrik.smiding@intel.com
CQ is trying da patch. Follow status at https://skia-tree-status.appspot.com/cq/henrik.smiding@intel.com/289473009/43...
Message was sent while issue was closed.
Change committed as 3bb195ef0d9691384027d7b61b0b8ef8379aaf5d |