skia - 2D graphics library

	Commit message (Collapse)	Author	Age
*	Port S32A_opaque blit row to SkOpts.	mtklein	2016-03-23
\| \| \| \| \| \| \| \| \| \|	This should be a pixel-for-pixel (i.e. bug-for-bug) port. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1820313002 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1820313002
*	Revert of Clean up SSSE3 and SSE4 stubs. (patchset #1 id:1 of ↵	mtklein	2016-03-22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	https://codereview.chromium.org/1810183003/ ) Reason for revert: I've just had a better idea about how to fix this. Let's revert while I work on it. Original issue's description: > Clean up SSSE3 and SSE4 stubs. > > We added these stubs to work around OpenBSD's old compiler, which had > support for SSE2 but not SSSE3 or SSE4. > > We now already have other unstubbed files that require SSSE3 and SSE4 compiler > support. All the compilers we support have SSSE3 and SSE4 support, and all the > way up to at least AVX2. > > (Requiring C++11 has had some nice ripple effects...) > > > And, <immintrin.h> is already auto-included for these files, so no need for smmintrin or tmmintrin. > > BUG=skia: > GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1810183003 > CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot > > Committed: https://skia.googlesource.com/skia/+/2b1b40e11afc41452b4d2f74cdebb1b6e6f7cc96 TBR=djsollen@google.com,mtklein@chromium.org # Not skipping CQ checks because original CL landed more than 1 days ago. BUG=skia: Review URL: https://codereview.chromium.org/1819223003
*	Clean up SSSE3 and SSE4 stubs.	mtklein	2016-03-18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We added these stubs to work around OpenBSD's old compiler, which had support for SSE2 but not SSSE3 or SSE4. We now already have other unstubbed files that require SSSE3 and SSE4 compiler support. All the compilers we support have SSSE3 and SSE4 support, and all the way up to at least AVX2. (Requiring C++11 has had some nice ripple effects...) And, <immintrin.h> is already auto-included for these files, so no need for smmintrin or tmmintrin. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1810183003 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1810183003
*	Add SkMSAN.h	mtklein	2016-02-03
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This lets us tag up pieces of code as requiring initialized inputs. Almost all code requires initialized inputs, of course. This is for code that works correctly with uninitialized data but triggers false positive warnings in MSAN. E.g., imagine MSAN's found use of uninitialized data in this max function: static uint8_t max(uint8_t x, uint8_t y) { return x > y ? x : y; } There's no bug in here... if there's uninitialized data being branched upon here for the first time, it's sure not max's fault, it's its caller's fault. So we might do this: static uint8_t max(uint8_t x, uint8_t y) { // This function uses branching, so if MSAN finds a problem here, // we can assert x and y are initialized. This will remind us the // problem somewhere in the caller or above, not here. sk_msan_assert_initialized(&x, &x+1); sk_masn_assert_initialized(&y, &y+1); return x > y ? x : y; } By allowing code to assert its inputs must be initialized, we can make the blame for use of uninitialized data more clear. (Sometimes we have another option, to rewrite the code to avoid branching: static uint8_t max(uint8_t x, uint8_t y) { // This function is branchfree, so MSAN won't complain here. // No real need to assert anything as requiring initialization. int diff = x - y; int negative = diff >> (sizeof(int)*8 - 1); return (y & negative) \| (x & ~negative); } These approaches to fixing MSAN false positives are orthogonal.) BUG=chromium:574114 GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1658913005 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1658913005
*	add/fix copyrights	reed	2015-06-26
\| \| \| \| \| \| \|	BUG=skia: TBR= Review URL: https://codereview.chromium.org/1212393002
*	Replace SSE optimization of Color32A_D565	henrik.smiding	2015-03-20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Adds an SSE2 version of the Color32A_D565 function, to replace the existing SSE4 version. Also does some minor cleanup. Performance improvement in the following Skia benchmarks. Measured on Atom Silvermont: Xfermode_SrcOver - x3 luma_colorfilter_large - x4.6 luma_colorfilter_small - x2 tablebench - ~15% chart_bw - ~10% Measured on Corei7 Haswell: luma_colorfilter_large running SSE2 - x2 luma_colorfilter_large running SSE4 - x2.3 Also improves performance in WPS Office application and 2D subtest of 0xbenchmark on Android. Signed-off-by: Henrik Smiding <henrik.smiding@intel.com> Review URL: https://codereview.chromium.org/923523002
*	Add SSE optimization of Color32A_D565	henrik.smiding	2015-02-10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Adds an SSE4.1 version of the Color32A_D565 function. Performance improvement in the following benchmarks: Xfermode_SrcOver - ~100% luma_colorfilter_large - ~150% luma_colorfilter_small - ~60% tablebench - ~10% chart_bw - ~10% (Measured on a Atom Silvermont core) Signed-off-by: Henrik Smiding <henrik.smiding@intel.com> Review URL: https://codereview.chromium.org/892623002
*	Revert of Revert of SSE4 opaque blend using intrinsics instead of assembly. ↵	stephana	2015-02-02
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	(patchset #1 id:1 of https://codereview.chromium.org/873553003/) Reason for revert: Reverted the wrong CL. Original issue's description: > Revert of SSE4 opaque blend using intrinsics instead of assembly. (patchset #16 id:300001 of https://codereview.chromium.org/874863002/) > > Reason for revert: > This causes a bug on the 'hittestpath' GM on MacMini 4,1 > > See: > > https://gold.skia.org/#/triage/hittestpath?head=0 > > for details. > > Original issue's description: > > SSE4 opaque blend using intrinsics instead of assembly. > > > > Since we had such a hard time with the assembly versions of this blit (to the > > point that we have them completely disabled everywhere), I thought I'd take > > a shot at writing a version of the blit using intrinsics. > > > > The key feature of SSE4 we're exploiting is that we can use ptest (_mm_test*) > > to skip the blend when the 16 src pixels we consider each loop are all opaque > > or all transparent. _mm_shuffle_epi8 from SSSE3 also lends a hand to extract > > all those alphas. > > > > It's worth looking to see if we can backport this type of logic to SSE2 using > > _mm_movemask_epi8, or up to 32 pixels at a time using AVX. > > > > My local performance testing doesn't show this to be an unambiguous win > > (there are probably microbenchmarks and SKPs where we'd be better off just > > powering through the blend rather than looking at alphas), but the potential > > does seem tantalizing enough to let skiaperf vet it on the bots. (< 1.0x is a win.) > > > > DM says it draws pixel perfect compare to the old code. > > > > Microbenchmarks: > > bitmap_RGBA_8888_A_source_stripes_two 14us -> 14.4us 1.03x > > bitmap_RGBA_8888_A_source_stripes_three 14.3us -> 14.5us 1.01x > > bitmap_RGBA_8888_scale_bilerp 61.9us -> 62.2us 1.01x > > bitmap_RGBA_8888_update_volatile_scale_rotate_bilerp 102us -> 101us 0.99x > > bitmap_RGBA_8888_scale_rotate_bilerp 103us -> 101us 0.99x > > bitmap_RGBA_8888_scale 18.4us -> 18.2us 0.99x > > bitmap_RGBA_8888_A_scale_rotate_bicubic 71us -> 70us 0.99x > > bitmap_RGBA_8888_update_scale_rotate_bilerp 103us -> 101us 0.99x > > bitmap_RGBA_8888_A_scale_rotate_bilerp 112us -> 109us 0.98x > > bitmap_RGBA_8888_update_volatile 5.72us -> 5.58us 0.98x > > bitmap_RGBA_8888 5.73us -> 5.58us 0.97x > > bitmap_RGBA_8888_update 5.78us -> 5.6us 0.97x > > bitmap_RGBA_8888_A_scale_bilerp 70.7us -> 68us 0.96x > > bitmap_RGBA_8888_A_scale_bicubic 23.7us -> 21.8us 0.92x > > bitmap_RGBA_8888_A 13.9us -> 10.9us 0.78x > > bitmap_RGBA_8888_A_source_opaque 14us -> 6.29us 0.45x > > bitmap_RGBA_8888_A_source_transparent 14us -> 3.65us 0.26x > > > > Running over our ~70 SKP web page captures, this looks like we spend 0.7x > > the time in S32A_Opaque_BlitRow compared to the SSE2 version, which should > > be a decent predictor of real-world impact. > > > > BUG=chromium:399842 > > > > Committed: https://skia.googlesource.com/skia/+/04bc91b972417038fecfa87c484771eac2b9b785 > > > > CQ_EXTRA_TRYBOTS=client.skia:Test-Mac10.6-MacMini4.1-GeForce320M-x86_64-Release-Trybot > > > > Committed: https://skia.googlesource.com/skia/+/6dbfb21a6c88af6d94e8c823c3ad559f1a41b493 > > TBR=henrik.smiding@intel.com,mtklein@google.com,herb@google.com,reed@google.com,thakis@chromium.org,mtklein@chromium.org > NOPRESUBMIT=true > NOTREECHECKS=true > NOTRY=true > BUG=chromium:399842 > > Committed: https://skia.googlesource.com/skia/+/4988891a1173cd405bf1c1dd3a3668c451f45e4c TBR=henrik.smiding@intel.com,mtklein@google.com,herb@google.com,reed@google.com,thakis@chromium.org,mtklein@chromium.org NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG=chromium:399842 Review URL: https://codereview.chromium.org/894083002
*	Revert of SSE4 opaque blend using intrinsics instead of assembly. (patchset ↵	stephana	2015-02-02
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	#16 id:300001 of https://codereview.chromium.org/874863002/) Reason for revert: This causes a bug on the 'hittestpath' GM on MacMini 4,1 See: https://gold.skia.org/#/triage/hittestpath?head=0 for details. Original issue's description: > SSE4 opaque blend using intrinsics instead of assembly. > > Since we had such a hard time with the assembly versions of this blit (to the > point that we have them completely disabled everywhere), I thought I'd take > a shot at writing a version of the blit using intrinsics. > > The key feature of SSE4 we're exploiting is that we can use ptest (_mm_test*) > to skip the blend when the 16 src pixels we consider each loop are all opaque > or all transparent. _mm_shuffle_epi8 from SSSE3 also lends a hand to extract > all those alphas. > > It's worth looking to see if we can backport this type of logic to SSE2 using > _mm_movemask_epi8, or up to 32 pixels at a time using AVX. > > My local performance testing doesn't show this to be an unambiguous win > (there are probably microbenchmarks and SKPs where we'd be better off just > powering through the blend rather than looking at alphas), but the potential > does seem tantalizing enough to let skiaperf vet it on the bots. (< 1.0x is a win.) > > DM says it draws pixel perfect compare to the old code. > > Microbenchmarks: > bitmap_RGBA_8888_A_source_stripes_two 14us -> 14.4us 1.03x > bitmap_RGBA_8888_A_source_stripes_three 14.3us -> 14.5us 1.01x > bitmap_RGBA_8888_scale_bilerp 61.9us -> 62.2us 1.01x > bitmap_RGBA_8888_update_volatile_scale_rotate_bilerp 102us -> 101us 0.99x > bitmap_RGBA_8888_scale_rotate_bilerp 103us -> 101us 0.99x > bitmap_RGBA_8888_scale 18.4us -> 18.2us 0.99x > bitmap_RGBA_8888_A_scale_rotate_bicubic 71us -> 70us 0.99x > bitmap_RGBA_8888_update_scale_rotate_bilerp 103us -> 101us 0.99x > bitmap_RGBA_8888_A_scale_rotate_bilerp 112us -> 109us 0.98x > bitmap_RGBA_8888_update_volatile 5.72us -> 5.58us 0.98x > bitmap_RGBA_8888 5.73us -> 5.58us 0.97x > bitmap_RGBA_8888_update 5.78us -> 5.6us 0.97x > bitmap_RGBA_8888_A_scale_bilerp 70.7us -> 68us 0.96x > bitmap_RGBA_8888_A_scale_bicubic 23.7us -> 21.8us 0.92x > bitmap_RGBA_8888_A 13.9us -> 10.9us 0.78x > bitmap_RGBA_8888_A_source_opaque 14us -> 6.29us 0.45x > bitmap_RGBA_8888_A_source_transparent 14us -> 3.65us 0.26x > > Running over our ~70 SKP web page captures, this looks like we spend 0.7x > the time in S32A_Opaque_BlitRow compared to the SSE2 version, which should > be a decent predictor of real-world impact. > > BUG=chromium:399842 > > Committed: https://skia.googlesource.com/skia/+/04bc91b972417038fecfa87c484771eac2b9b785 > > CQ_EXTRA_TRYBOTS=client.skia:Test-Mac10.6-MacMini4.1-GeForce320M-x86_64-Release-Trybot > > Committed: https://skia.googlesource.com/skia/+/6dbfb21a6c88af6d94e8c823c3ad559f1a41b493 TBR=henrik.smiding@intel.com,mtklein@google.com,herb@google.com,reed@google.com,thakis@chromium.org,mtklein@chromium.org NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG=chromium:399842 Review URL: https://codereview.chromium.org/873553003
*	add a paranoid assert	mtklein	2015-01-28
\| \| \| \| \| \| \| \|	NOTREECHECKS=true BUG=chromium:399842 Review URL: https://codereview.chromium.org/881253003
*	SSE4 opaque blend using intrinsics instead of assembly.	mtklein	2015-01-27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Since we had such a hard time with the assembly versions of this blit (to the point that we have them completely disabled everywhere), I thought I'd take a shot at writing a version of the blit using intrinsics. The key feature of SSE4 we're exploiting is that we can use ptest (_mm_test*) to skip the blend when the 16 src pixels we consider each loop are all opaque or all transparent. _mm_shuffle_epi8 from SSSE3 also lends a hand to extract all those alphas. It's worth looking to see if we can backport this type of logic to SSE2 using _mm_movemask_epi8, or up to 32 pixels at a time using AVX. My local performance testing doesn't show this to be an unambiguous win (there are probably microbenchmarks and SKPs where we'd be better off just powering through the blend rather than looking at alphas), but the potential does seem tantalizing enough to let skiaperf vet it on the bots. (< 1.0x is a win.) DM says it draws pixel perfect compare to the old code. Microbenchmarks: bitmap_RGBA_8888_A_source_stripes_two 14us -> 14.4us 1.03x bitmap_RGBA_8888_A_source_stripes_three 14.3us -> 14.5us 1.01x bitmap_RGBA_8888_scale_bilerp 61.9us -> 62.2us 1.01x bitmap_RGBA_8888_update_volatile_scale_rotate_bilerp 102us -> 101us 0.99x bitmap_RGBA_8888_scale_rotate_bilerp 103us -> 101us 0.99x bitmap_RGBA_8888_scale 18.4us -> 18.2us 0.99x bitmap_RGBA_8888_A_scale_rotate_bicubic 71us -> 70us 0.99x bitmap_RGBA_8888_update_scale_rotate_bilerp 103us -> 101us 0.99x bitmap_RGBA_8888_A_scale_rotate_bilerp 112us -> 109us 0.98x bitmap_RGBA_8888_update_volatile 5.72us -> 5.58us 0.98x bitmap_RGBA_8888 5.73us -> 5.58us 0.97x bitmap_RGBA_8888_update 5.78us -> 5.6us 0.97x bitmap_RGBA_8888_A_scale_bilerp 70.7us -> 68us 0.96x bitmap_RGBA_8888_A_scale_bicubic 23.7us -> 21.8us 0.92x bitmap_RGBA_8888_A 13.9us -> 10.9us 0.78x bitmap_RGBA_8888_A_source_opaque 14us -> 6.29us 0.45x bitmap_RGBA_8888_A_source_transparent 14us -> 3.65us 0.26x Running over our ~70 SKP web page captures, this looks like we spend 0.7x the time in S32A_Opaque_BlitRow compared to the SSE2 version, which should be a decent predictor of real-world impact. BUG=chromium:399842 Committed: https://skia.googlesource.com/skia/+/04bc91b972417038fecfa87c484771eac2b9b785 CQ_EXTRA_TRYBOTS=client.skia:Test-Mac10.6-MacMini4.1-GeForce320M-x86_64-Release-Trybot Review URL: https://codereview.chromium.org/874863002
*	Revert of SSE4 opaque blend using intrinsics instead of assembly. (patchset ↵	bungeman	2015-01-26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	#14 id:260001 of https://codereview.chromium.org/874863002/) Reason for revert: This kills Mac 10.6 bots. FAILED: c++ -MMD -MF obj/src/opts/opts_sse4.SkBlitRow_opts_SSE4.o.d -DSK_INTERNAL -DSK_GAMMA_SRGB -DSK_GAMMA_APPLY_TO_A8 -DSK_SCALAR_TO_FLOAT_EXCLUDED -DSK_ALLOW_STATIC_GLOBAL_INITIALIZERS=1 -DSK_SUPPORT_GPU=1 -DSK_SUPPORT_OPENCL=0 -DSK_FORCE_DISTANCE_FIELD_TEXT=0 -DSK_BUILD_FOR_MAC -DSK_CRASH_HANDLER -DSK_DEVELOPER=1 -I../../src/core -I../../src/utils -I../../include/c -I../../include/config -I../../include/core -I../../include/pathops -I../../include/pipe -I../../include/utils/mac -I../../include/effects -O0 -gdwarf-2 -mmacosx-version-min=10.6 -arch x86_64 -mssse3 -Wall -Wextra -Winit-self -Wpointer-arith -Wsign-compare -Wno-unused-parameter -Wno-invalid-offsetof -msse4.1 -c ../../src/opts/SkBlitRow_opts_SSE4.cpp -o obj/src/opts/opts_sse4.SkBlitRow_opts_SSE4.o ../../src/opts/SkBlitRow_opts_SSE4.cpp:15:27: warning: x86intrin.h: No such file or directory ../../src/opts/SkBlitRow_opts_SSE4.cpp: In function 'void S32A_Opaque_BlitRow32_SSE4(SkPMColor, const SkPMColor, int, U8CPU)': ../../src/opts/SkBlitRow_opts_SSE4.cpp:40: error: '_mm_testz_si128' was not declared in this scope ../../src/opts/SkBlitRow_opts_SSE4.cpp:45: error: '_mm_testc_si128' was not declared in this scope Original issue's description: > SSE4 opaque blend using intrinsics instead of assembly. > > Since we had such a hard time with the assembly versions of this blit (to the > point that we have them completely disabled everywhere), I thought I'd take > a shot at writing a version of the blit using intrinsics. > > The key feature of SSE4 we're exploiting is that we can use ptest (_mm_test*) > to skip the blend when the 16 src pixels we consider each loop are all opaque > or all transparent. _mm_shuffle_epi8 from SSSE3 also lends a hand to extract > all those alphas. > > It's worth looking to see if we can backport this type of logic to SSE2 using > _mm_movemask_epi8, or up to 32 pixels at a time using AVX. > > My local performance testing doesn't show this to be an unambiguous win > (there are probably microbenchmarks and SKPs where we'd be better off just > powering through the blend rather than looking at alphas), but the potential > does seem tantalizing enough to let skiaperf vet it on the bots. (< 1.0x is a win.) > > DM says it draws pixel perfect compare to the old code. > > Microbenchmarks: > bitmap_RGBA_8888_A_source_stripes_two 14us -> 14.4us 1.03x > bitmap_RGBA_8888_A_source_stripes_three 14.3us -> 14.5us 1.01x > bitmap_RGBA_8888_scale_bilerp 61.9us -> 62.2us 1.01x > bitmap_RGBA_8888_update_volatile_scale_rotate_bilerp 102us -> 101us 0.99x > bitmap_RGBA_8888_scale_rotate_bilerp 103us -> 101us 0.99x > bitmap_RGBA_8888_scale 18.4us -> 18.2us 0.99x > bitmap_RGBA_8888_A_scale_rotate_bicubic 71us -> 70us 0.99x > bitmap_RGBA_8888_update_scale_rotate_bilerp 103us -> 101us 0.99x > bitmap_RGBA_8888_A_scale_rotate_bilerp 112us -> 109us 0.98x > bitmap_RGBA_8888_update_volatile 5.72us -> 5.58us 0.98x > bitmap_RGBA_8888 5.73us -> 5.58us 0.97x > bitmap_RGBA_8888_update 5.78us -> 5.6us 0.97x > bitmap_RGBA_8888_A_scale_bilerp 70.7us -> 68us 0.96x > bitmap_RGBA_8888_A_scale_bicubic 23.7us -> 21.8us 0.92x > bitmap_RGBA_8888_A 13.9us -> 10.9us 0.78x > bitmap_RGBA_8888_A_source_opaque 14us -> 6.29us 0.45x > bitmap_RGBA_8888_A_source_transparent 14us -> 3.65us 0.26x > > Running over our ~70 SKP web page captures, this looks like we spend 0.7x > the time in S32A_Opaque_BlitRow compared to the SSE2 version, which should > be a decent predictor of real-world impact. > > BUG=chromium:399842 > > Committed: https://skia.googlesource.com/skia/+/04bc91b972417038fecfa87c484771eac2b9b785 TBR=henrik.smiding@intel.com,mtklein@google.com,herb@google.com,reed@google.com,thakis@chromium.org,mtklein@chromium.org NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG=chromium:399842 Review URL: https://codereview.chromium.org/874033004
*	SSE4 opaque blend using intrinsics instead of assembly.	mtklein	2015-01-26
	Since we had such a hard time with the assembly versions of this blit (to the point that we have them completely disabled everywhere), I thought I'd take a shot at writing a version of the blit using intrinsics. The key feature of SSE4 we're exploiting is that we can use ptest (_mm_test*) to skip the blend when the 16 src pixels we consider each loop are all opaque or all transparent. _mm_shuffle_epi8 from SSSE3 also lends a hand to extract all those alphas. It's worth looking to see if we can backport this type of logic to SSE2 using _mm_movemask_epi8, or up to 32 pixels at a time using AVX. My local performance testing doesn't show this to be an unambiguous win (there are probably microbenchmarks and SKPs where we'd be better off just powering through the blend rather than looking at alphas), but the potential does seem tantalizing enough to let skiaperf vet it on the bots. (< 1.0x is a win.) DM says it draws pixel perfect compare to the old code. Microbenchmarks: bitmap_RGBA_8888_A_source_stripes_two 14us -> 14.4us 1.03x bitmap_RGBA_8888_A_source_stripes_three 14.3us -> 14.5us 1.01x bitmap_RGBA_8888_scale_bilerp 61.9us -> 62.2us 1.01x bitmap_RGBA_8888_update_volatile_scale_rotate_bilerp 102us -> 101us 0.99x bitmap_RGBA_8888_scale_rotate_bilerp 103us -> 101us 0.99x bitmap_RGBA_8888_scale 18.4us -> 18.2us 0.99x bitmap_RGBA_8888_A_scale_rotate_bicubic 71us -> 70us 0.99x bitmap_RGBA_8888_update_scale_rotate_bilerp 103us -> 101us 0.99x bitmap_RGBA_8888_A_scale_rotate_bilerp 112us -> 109us 0.98x bitmap_RGBA_8888_update_volatile 5.72us -> 5.58us 0.98x bitmap_RGBA_8888 5.73us -> 5.58us 0.97x bitmap_RGBA_8888_update 5.78us -> 5.6us 0.97x bitmap_RGBA_8888_A_scale_bilerp 70.7us -> 68us 0.96x bitmap_RGBA_8888_A_scale_bicubic 23.7us -> 21.8us 0.92x bitmap_RGBA_8888_A 13.9us -> 10.9us 0.78x bitmap_RGBA_8888_A_source_opaque 14us -> 6.29us 0.45x bitmap_RGBA_8888_A_source_transparent 14us -> 3.65us 0.26x Running over our ~70 SKP web page captures, this looks like we spend 0.7x the time in S32A_Opaque_BlitRow compared to the SSE2 version, which should be a decent predictor of real-world impact. BUG=chromium:399842 Review URL: https://codereview.chromium.org/874863002