| Commit message (Collapse) | Author | Age |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Adds an SSE2 version of the Color32A_D565 function, to replace
the existing SSE4 version. Also does some minor cleanup.
Performance improvement in the following Skia benchmarks.
Measured on Atom Silvermont:
Xfermode_SrcOver - x3
luma_colorfilter_large - x4.6
luma_colorfilter_small - x2
tablebench - ~15%
chart_bw - ~10%
Measured on Corei7 Haswell:
luma_colorfilter_large running SSE2 - x2
luma_colorfilter_large running SSE4 - x2.3
Also improves performance in WPS Office application and 2D subtest of 0xbenchmark on Android.
Signed-off-by: Henrik Smiding <henrik.smiding@intel.com>
Review URL: https://codereview.chromium.org/923523002
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Adds an SSE4.1 version of the Color32A_D565 function.
Performance improvement in the following benchmarks:
Xfermode_SrcOver - ~100%
luma_colorfilter_large - ~150%
luma_colorfilter_small - ~60%
tablebench - ~10%
chart_bw - ~10%
(Measured on a Atom Silvermont core)
Signed-off-by: Henrik Smiding <henrik.smiding@intel.com>
Review URL: https://codereview.chromium.org/892623002
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
(patchset #1 id:1 of https://codereview.chromium.org/873553003/)
Reason for revert:
Reverted the wrong CL.
Original issue's description:
> Revert of SSE4 opaque blend using intrinsics instead of assembly. (patchset #16 id:300001 of https://codereview.chromium.org/874863002/)
>
> Reason for revert:
> This causes a bug on the 'hittestpath' GM on MacMini 4,1
>
> See:
>
> https://gold.skia.org/#/triage/hittestpath?head=0
>
> for details.
>
> Original issue's description:
> > SSE4 opaque blend using intrinsics instead of assembly.
> >
> > Since we had such a hard time with the assembly versions of this blit (to the
> > point that we have them completely disabled everywhere), I thought I'd take
> > a shot at writing a version of the blit using intrinsics.
> >
> > The key feature of SSE4 we're exploiting is that we can use ptest (_mm_test*)
> > to skip the blend when the 16 src pixels we consider each loop are all opaque
> > or all transparent. _mm_shuffle_epi8 from SSSE3 also lends a hand to extract
> > all those alphas.
> >
> > It's worth looking to see if we can backport this type of logic to SSE2 using
> > _mm_movemask_epi8, or up to 32 pixels at a time using AVX.
> >
> > My local performance testing doesn't show this to be an unambiguous win
> > (there are probably microbenchmarks and SKPs where we'd be better off just
> > powering through the blend rather than looking at alphas), but the potential
> > does seem tantalizing enough to let skiaperf vet it on the bots. (< 1.0x is a win.)
> >
> > DM says it draws pixel perfect compare to the old code.
> >
> > Microbenchmarks:
> > bitmap_RGBA_8888_A_source_stripes_two 14us -> 14.4us 1.03x
> > bitmap_RGBA_8888_A_source_stripes_three 14.3us -> 14.5us 1.01x
> > bitmap_RGBA_8888_scale_bilerp 61.9us -> 62.2us 1.01x
> > bitmap_RGBA_8888_update_volatile_scale_rotate_bilerp 102us -> 101us 0.99x
> > bitmap_RGBA_8888_scale_rotate_bilerp 103us -> 101us 0.99x
> > bitmap_RGBA_8888_scale 18.4us -> 18.2us 0.99x
> > bitmap_RGBA_8888_A_scale_rotate_bicubic 71us -> 70us 0.99x
> > bitmap_RGBA_8888_update_scale_rotate_bilerp 103us -> 101us 0.99x
> > bitmap_RGBA_8888_A_scale_rotate_bilerp 112us -> 109us 0.98x
> > bitmap_RGBA_8888_update_volatile 5.72us -> 5.58us 0.98x
> > bitmap_RGBA_8888 5.73us -> 5.58us 0.97x
> > bitmap_RGBA_8888_update 5.78us -> 5.6us 0.97x
> > bitmap_RGBA_8888_A_scale_bilerp 70.7us -> 68us 0.96x
> > bitmap_RGBA_8888_A_scale_bicubic 23.7us -> 21.8us 0.92x
> > bitmap_RGBA_8888_A 13.9us -> 10.9us 0.78x
> > bitmap_RGBA_8888_A_source_opaque 14us -> 6.29us 0.45x
> > bitmap_RGBA_8888_A_source_transparent 14us -> 3.65us 0.26x
> >
> > Running over our ~70 SKP web page captures, this looks like we spend 0.7x
> > the time in S32A_Opaque_BlitRow compared to the SSE2 version, which should
> > be a decent predictor of real-world impact.
> >
> > BUG=chromium:399842
> >
> > Committed: https://skia.googlesource.com/skia/+/04bc91b972417038fecfa87c484771eac2b9b785
> >
> > CQ_EXTRA_TRYBOTS=client.skia:Test-Mac10.6-MacMini4.1-GeForce320M-x86_64-Release-Trybot
> >
> > Committed: https://skia.googlesource.com/skia/+/6dbfb21a6c88af6d94e8c823c3ad559f1a41b493
>
> TBR=henrik.smiding@intel.com,mtklein@google.com,herb@google.com,reed@google.com,thakis@chromium.org,mtklein@chromium.org
> NOPRESUBMIT=true
> NOTREECHECKS=true
> NOTRY=true
> BUG=chromium:399842
>
> Committed: https://skia.googlesource.com/skia/+/4988891a1173cd405bf1c1dd3a3668c451f45e4c
TBR=henrik.smiding@intel.com,mtklein@google.com,herb@google.com,reed@google.com,thakis@chromium.org,mtklein@chromium.org
NOPRESUBMIT=true
NOTREECHECKS=true
NOTRY=true
BUG=chromium:399842
Review URL: https://codereview.chromium.org/894083002
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
#16 id:300001 of https://codereview.chromium.org/874863002/)
Reason for revert:
This causes a bug on the 'hittestpath' GM on MacMini 4,1
See:
https://gold.skia.org/#/triage/hittestpath?head=0
for details.
Original issue's description:
> SSE4 opaque blend using intrinsics instead of assembly.
>
> Since we had such a hard time with the assembly versions of this blit (to the
> point that we have them completely disabled everywhere), I thought I'd take
> a shot at writing a version of the blit using intrinsics.
>
> The key feature of SSE4 we're exploiting is that we can use ptest (_mm_test*)
> to skip the blend when the 16 src pixels we consider each loop are all opaque
> or all transparent. _mm_shuffle_epi8 from SSSE3 also lends a hand to extract
> all those alphas.
>
> It's worth looking to see if we can backport this type of logic to SSE2 using
> _mm_movemask_epi8, or up to 32 pixels at a time using AVX.
>
> My local performance testing doesn't show this to be an unambiguous win
> (there are probably microbenchmarks and SKPs where we'd be better off just
> powering through the blend rather than looking at alphas), but the potential
> does seem tantalizing enough to let skiaperf vet it on the bots. (< 1.0x is a win.)
>
> DM says it draws pixel perfect compare to the old code.
>
> Microbenchmarks:
> bitmap_RGBA_8888_A_source_stripes_two 14us -> 14.4us 1.03x
> bitmap_RGBA_8888_A_source_stripes_three 14.3us -> 14.5us 1.01x
> bitmap_RGBA_8888_scale_bilerp 61.9us -> 62.2us 1.01x
> bitmap_RGBA_8888_update_volatile_scale_rotate_bilerp 102us -> 101us 0.99x
> bitmap_RGBA_8888_scale_rotate_bilerp 103us -> 101us 0.99x
> bitmap_RGBA_8888_scale 18.4us -> 18.2us 0.99x
> bitmap_RGBA_8888_A_scale_rotate_bicubic 71us -> 70us 0.99x
> bitmap_RGBA_8888_update_scale_rotate_bilerp 103us -> 101us 0.99x
> bitmap_RGBA_8888_A_scale_rotate_bilerp 112us -> 109us 0.98x
> bitmap_RGBA_8888_update_volatile 5.72us -> 5.58us 0.98x
> bitmap_RGBA_8888 5.73us -> 5.58us 0.97x
> bitmap_RGBA_8888_update 5.78us -> 5.6us 0.97x
> bitmap_RGBA_8888_A_scale_bilerp 70.7us -> 68us 0.96x
> bitmap_RGBA_8888_A_scale_bicubic 23.7us -> 21.8us 0.92x
> bitmap_RGBA_8888_A 13.9us -> 10.9us 0.78x
> bitmap_RGBA_8888_A_source_opaque 14us -> 6.29us 0.45x
> bitmap_RGBA_8888_A_source_transparent 14us -> 3.65us 0.26x
>
> Running over our ~70 SKP web page captures, this looks like we spend 0.7x
> the time in S32A_Opaque_BlitRow compared to the SSE2 version, which should
> be a decent predictor of real-world impact.
>
> BUG=chromium:399842
>
> Committed: https://skia.googlesource.com/skia/+/04bc91b972417038fecfa87c484771eac2b9b785
>
> CQ_EXTRA_TRYBOTS=client.skia:Test-Mac10.6-MacMini4.1-GeForce320M-x86_64-Release-Trybot
>
> Committed: https://skia.googlesource.com/skia/+/6dbfb21a6c88af6d94e8c823c3ad559f1a41b493
TBR=henrik.smiding@intel.com,mtklein@google.com,herb@google.com,reed@google.com,thakis@chromium.org,mtklein@chromium.org
NOPRESUBMIT=true
NOTREECHECKS=true
NOTRY=true
BUG=chromium:399842
Review URL: https://codereview.chromium.org/873553003
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Since we had such a hard time with the assembly versions of this blit (to the
point that we have them completely disabled everywhere), I thought I'd take
a shot at writing a version of the blit using intrinsics.
The key feature of SSE4 we're exploiting is that we can use ptest (_mm_test*)
to skip the blend when the 16 src pixels we consider each loop are all opaque
or all transparent. _mm_shuffle_epi8 from SSSE3 also lends a hand to extract
all those alphas.
It's worth looking to see if we can backport this type of logic to SSE2 using
_mm_movemask_epi8, or up to 32 pixels at a time using AVX.
My local performance testing doesn't show this to be an unambiguous win
(there are probably microbenchmarks and SKPs where we'd be better off just
powering through the blend rather than looking at alphas), but the potential
does seem tantalizing enough to let skiaperf vet it on the bots. (< 1.0x is a win.)
DM says it draws pixel perfect compare to the old code.
Microbenchmarks:
bitmap_RGBA_8888_A_source_stripes_two 14us -> 14.4us 1.03x
bitmap_RGBA_8888_A_source_stripes_three 14.3us -> 14.5us 1.01x
bitmap_RGBA_8888_scale_bilerp 61.9us -> 62.2us 1.01x
bitmap_RGBA_8888_update_volatile_scale_rotate_bilerp 102us -> 101us 0.99x
bitmap_RGBA_8888_scale_rotate_bilerp 103us -> 101us 0.99x
bitmap_RGBA_8888_scale 18.4us -> 18.2us 0.99x
bitmap_RGBA_8888_A_scale_rotate_bicubic 71us -> 70us 0.99x
bitmap_RGBA_8888_update_scale_rotate_bilerp 103us -> 101us 0.99x
bitmap_RGBA_8888_A_scale_rotate_bilerp 112us -> 109us 0.98x
bitmap_RGBA_8888_update_volatile 5.72us -> 5.58us 0.98x
bitmap_RGBA_8888 5.73us -> 5.58us 0.97x
bitmap_RGBA_8888_update 5.78us -> 5.6us 0.97x
bitmap_RGBA_8888_A_scale_bilerp 70.7us -> 68us 0.96x
bitmap_RGBA_8888_A_scale_bicubic 23.7us -> 21.8us 0.92x
bitmap_RGBA_8888_A 13.9us -> 10.9us 0.78x
bitmap_RGBA_8888_A_source_opaque 14us -> 6.29us 0.45x
bitmap_RGBA_8888_A_source_transparent 14us -> 3.65us 0.26x
Running over our ~70 SKP web page captures, this looks like we spend 0.7x
the time in S32A_Opaque_BlitRow compared to the SSE2 version, which should
be a decent predictor of real-world impact.
BUG=chromium:399842
Committed: https://skia.googlesource.com/skia/+/04bc91b972417038fecfa87c484771eac2b9b785
CQ_EXTRA_TRYBOTS=client.skia:Test-Mac10.6-MacMini4.1-GeForce320M-x86_64-Release-Trybot
Review URL: https://codereview.chromium.org/874863002
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
#14 id:260001 of https://codereview.chromium.org/874863002/)
Reason for revert:
This kills Mac 10.6 bots.
FAILED: c++ -MMD -MF obj/src/opts/opts_sse4.SkBlitRow_opts_SSE4.o.d -DSK_INTERNAL -DSK_GAMMA_SRGB -DSK_GAMMA_APPLY_TO_A8 -DSK_SCALAR_TO_FLOAT_EXCLUDED -DSK_ALLOW_STATIC_GLOBAL_INITIALIZERS=1 -DSK_SUPPORT_GPU=1 -DSK_SUPPORT_OPENCL=0 -DSK_FORCE_DISTANCE_FIELD_TEXT=0 -DSK_BUILD_FOR_MAC -DSK_CRASH_HANDLER -DSK_DEVELOPER=1 -I../../src/core -I../../src/utils -I../../include/c -I../../include/config -I../../include/core -I../../include/pathops -I../../include/pipe -I../../include/utils/mac -I../../include/effects -O0 -gdwarf-2 -mmacosx-version-min=10.6 -arch x86_64 -mssse3 -Wall -Wextra -Winit-self -Wpointer-arith -Wsign-compare -Wno-unused-parameter -Wno-invalid-offsetof -msse4.1 -c ../../src/opts/SkBlitRow_opts_SSE4.cpp -o obj/src/opts/opts_sse4.SkBlitRow_opts_SSE4.o
../../src/opts/SkBlitRow_opts_SSE4.cpp:15:27: warning: x86intrin.h: No such file or directory
../../src/opts/SkBlitRow_opts_SSE4.cpp: In function 'void S32A_Opaque_BlitRow32_SSE4(SkPMColor*, const SkPMColor*, int, U8CPU)':
../../src/opts/SkBlitRow_opts_SSE4.cpp:40: error: '_mm_testz_si128' was not declared in this scope
../../src/opts/SkBlitRow_opts_SSE4.cpp:45: error: '_mm_testc_si128' was not declared in this scope
Original issue's description:
> SSE4 opaque blend using intrinsics instead of assembly.
>
> Since we had such a hard time with the assembly versions of this blit (to the
> point that we have them completely disabled everywhere), I thought I'd take
> a shot at writing a version of the blit using intrinsics.
>
> The key feature of SSE4 we're exploiting is that we can use ptest (_mm_test*)
> to skip the blend when the 16 src pixels we consider each loop are all opaque
> or all transparent. _mm_shuffle_epi8 from SSSE3 also lends a hand to extract
> all those alphas.
>
> It's worth looking to see if we can backport this type of logic to SSE2 using
> _mm_movemask_epi8, or up to 32 pixels at a time using AVX.
>
> My local performance testing doesn't show this to be an unambiguous win
> (there are probably microbenchmarks and SKPs where we'd be better off just
> powering through the blend rather than looking at alphas), but the potential
> does seem tantalizing enough to let skiaperf vet it on the bots. (< 1.0x is a win.)
>
> DM says it draws pixel perfect compare to the old code.
>
> Microbenchmarks:
> bitmap_RGBA_8888_A_source_stripes_two 14us -> 14.4us 1.03x
> bitmap_RGBA_8888_A_source_stripes_three 14.3us -> 14.5us 1.01x
> bitmap_RGBA_8888_scale_bilerp 61.9us -> 62.2us 1.01x
> bitmap_RGBA_8888_update_volatile_scale_rotate_bilerp 102us -> 101us 0.99x
> bitmap_RGBA_8888_scale_rotate_bilerp 103us -> 101us 0.99x
> bitmap_RGBA_8888_scale 18.4us -> 18.2us 0.99x
> bitmap_RGBA_8888_A_scale_rotate_bicubic 71us -> 70us 0.99x
> bitmap_RGBA_8888_update_scale_rotate_bilerp 103us -> 101us 0.99x
> bitmap_RGBA_8888_A_scale_rotate_bilerp 112us -> 109us 0.98x
> bitmap_RGBA_8888_update_volatile 5.72us -> 5.58us 0.98x
> bitmap_RGBA_8888 5.73us -> 5.58us 0.97x
> bitmap_RGBA_8888_update 5.78us -> 5.6us 0.97x
> bitmap_RGBA_8888_A_scale_bilerp 70.7us -> 68us 0.96x
> bitmap_RGBA_8888_A_scale_bicubic 23.7us -> 21.8us 0.92x
> bitmap_RGBA_8888_A 13.9us -> 10.9us 0.78x
> bitmap_RGBA_8888_A_source_opaque 14us -> 6.29us 0.45x
> bitmap_RGBA_8888_A_source_transparent 14us -> 3.65us 0.26x
>
> Running over our ~70 SKP web page captures, this looks like we spend 0.7x
> the time in S32A_Opaque_BlitRow compared to the SSE2 version, which should
> be a decent predictor of real-world impact.
>
> BUG=chromium:399842
>
> Committed: https://skia.googlesource.com/skia/+/04bc91b972417038fecfa87c484771eac2b9b785
TBR=henrik.smiding@intel.com,mtklein@google.com,herb@google.com,reed@google.com,thakis@chromium.org,mtklein@chromium.org
NOPRESUBMIT=true
NOTREECHECKS=true
NOTRY=true
BUG=chromium:399842
Review URL: https://codereview.chromium.org/874033004
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Since we had such a hard time with the assembly versions of this blit (to the
point that we have them completely disabled everywhere), I thought I'd take
a shot at writing a version of the blit using intrinsics.
The key feature of SSE4 we're exploiting is that we can use ptest (_mm_test*)
to skip the blend when the 16 src pixels we consider each loop are all opaque
or all transparent. _mm_shuffle_epi8 from SSSE3 also lends a hand to extract
all those alphas.
It's worth looking to see if we can backport this type of logic to SSE2 using
_mm_movemask_epi8, or up to 32 pixels at a time using AVX.
My local performance testing doesn't show this to be an unambiguous win
(there are probably microbenchmarks and SKPs where we'd be better off just
powering through the blend rather than looking at alphas), but the potential
does seem tantalizing enough to let skiaperf vet it on the bots. (< 1.0x is a win.)
DM says it draws pixel perfect compare to the old code.
Microbenchmarks:
bitmap_RGBA_8888_A_source_stripes_two 14us -> 14.4us 1.03x
bitmap_RGBA_8888_A_source_stripes_three 14.3us -> 14.5us 1.01x
bitmap_RGBA_8888_scale_bilerp 61.9us -> 62.2us 1.01x
bitmap_RGBA_8888_update_volatile_scale_rotate_bilerp 102us -> 101us 0.99x
bitmap_RGBA_8888_scale_rotate_bilerp 103us -> 101us 0.99x
bitmap_RGBA_8888_scale 18.4us -> 18.2us 0.99x
bitmap_RGBA_8888_A_scale_rotate_bicubic 71us -> 70us 0.99x
bitmap_RGBA_8888_update_scale_rotate_bilerp 103us -> 101us 0.99x
bitmap_RGBA_8888_A_scale_rotate_bilerp 112us -> 109us 0.98x
bitmap_RGBA_8888_update_volatile 5.72us -> 5.58us 0.98x
bitmap_RGBA_8888 5.73us -> 5.58us 0.97x
bitmap_RGBA_8888_update 5.78us -> 5.6us 0.97x
bitmap_RGBA_8888_A_scale_bilerp 70.7us -> 68us 0.96x
bitmap_RGBA_8888_A_scale_bicubic 23.7us -> 21.8us 0.92x
bitmap_RGBA_8888_A 13.9us -> 10.9us 0.78x
bitmap_RGBA_8888_A_source_opaque 14us -> 6.29us 0.45x
bitmap_RGBA_8888_A_source_transparent 14us -> 3.65us 0.26x
Running over our ~70 SKP web page captures, this looks like we spend 0.7x
the time in S32A_Opaque_BlitRow compared to the SSE2 version, which should
be a decent predictor of real-world impact.
BUG=chromium:399842
Review URL: https://codereview.chromium.org/874863002
|
|
|
|
|
|
|
|
|
|
|
| |
This code sometimes generates a build warning that bothers Chrome.
BUG=399842,skia:2906
R=reed@google.com, mtklein@google.com
Author: mtklein@chromium.org
Review URL: https://codereview.chromium.org/538463003
|
|
|
|
|
|
|
|
|
| |
BUG=391016
R=tomhudson@chromium.org, mtklein@google.com, rnk@chromium.org, thakis@chromium.org
Author: mtklein@chromium.org
Review URL: https://codereview.chromium.org/363983004
|
|
|
|
|
|
|
|
|
|
| |
MemorySanitizer is an unitialized memory use detector which is used in
Chromium, and does not presently support assembly code.
BUG=chromium:344505, chromium:373739
R=mtklein@google.com
Review URL: https://codereview.chromium.org/367973005
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Marks the symbols in the S32A_Opaque_BlitRow32_SSE4 files as hidden,
so Chromium can build.
Also enables the optimizations.
Signed-off-by: Henrik Smiding <henrik.smiding@intel.com>
R=mtklein@google.com, joakim.landberg@intel.com
Author: henrik.smiding@intel.com
Review URL: https://codereview.chromium.org/368573002
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Reason for revert:
Mac Chrome builders still failing.
Original issue's description:
> Re-enable SSE4.
>
> I will roll this into Chrome with https://codereview.chromium.org/332393003.
>
> BUG=skia:
>
> Committed: https://skia.googlesource.com/skia/+/a75b0fadbdec4214afec6dd727fd224d34ed164f
R=reed@google.com, mtklein@chromium.org
TBR=mtklein@chromium.org, reed@google.com
NOTREECHECKS=true
NOTRY=true
BUG=skia:
Author: mtklein@google.com
Review URL: https://codereview.chromium.org/337093004
|
|
|
|
|
|
|
|
|
|
|
| |
I will roll this into Chrome with https://codereview.chromium.org/332393003.
BUG=skia:
R=reed@google.com, mtklein@google.com
Author: mtklein@chromium.org
Review URL: https://codereview.chromium.org/357593003
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Chrome canary failing to link chrome:
http://108.170.220.120:10115/builders/Canary-Chrome-Ubuntu13.10-Ninja-x86_64-ToT/builds/1009/steps/BuildChrome/logs/stdio
BUG=skia:
NOTRY=true
R=mtklein@google.com, rmistry@google.com
Author: mtklein@chromium.org
Review URL: https://codereview.chromium.org/361493002
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Adds optimization of Skia S32A_Opaque_Blitrow blitter using SSE4.2 SIMD
instruction set. Special case for when alpha is zero or opaque.
Performance increase of 10%-400% compared to the existing SSE2
optimization (measured on Silvermont architecture).
Noticeable in ~25 different skia bench subtests, especially in
bitmap_8888_*, repeatTile_*, and morph_*.
bitmap_8888_A - 100% faster
bitmap_8888_A_source_transparent - 250% faster
bitmap_8888_A_source_opaque - 25% faster
bitmap_8888_A_scale_bicubic - 75% faster
Signed-off-by: Henrik Smiding <henrik.smiding@intel.com>
Committed: https://skia.googlesource.com/skia/+/e2527b147679b0c43019fae7d59cc3777d2d097e
Committed: https://skia.googlesource.com/skia/+/b5c281e1e06af3be804309877de1dac6145686b9
R=reed@google.com, mtklein@google.com, tomhudson@google.com, djsollen@google.com, joakim.landberg@intel.com
Author: henrik.smiding@intel.com
Review URL: https://codereview.chromium.org/289473009
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
(https://codereview.chromium.org/289473009/)
NOTREECHECKS=true
NOTRY=true
Reason for revert:
Valgrind bot's seeing this code use uninitialized memory, and it's somehow blocking our roll into Chrome too:
> ld: warning: could not create compact unwind for
S32A_Opaque_BlitRow32_SSE4_asm:
> stack subq instruction is too different from dwarf stack size
> [10339/10982 | 3247.792] PACKAGE FRAMEWORK "Chromium Framework.framework",
> POSTBUILDS
> FAILED: ./gyp-mac-tool package-framework "Chromium Framework.framework" A &&
> (export
> BUILT_PRODUCTS_DIR=/Volumes/data/b/build/slave/mac_gpu/build/src/out/Release;
> export CONFIGURATION=Release; export CONTENTS_FOLDER_PATH="Chromium
> Framework.framework/Versions/A"; export
> DYLIB_INSTALL_NAME_BASE=@executable_path/../Versions/37.0.2056.0; export
> EXECUTABLE_NAME="Chromium Framework"; export EXECUTABLE_PATH="Chromium
> Framework.framework/Versions/A/Chromium Framework"; export
> FULL_PRODUCT_NAME="Chromium Framework.framework"; export
> INFOPLIST_PATH="Chromium Framework.framework/Versions/A/Resources/Info.plist";
> export
LD_DYLIB_INSTALL_NAME="@executable_path/../Versions/37.0.2056.0/Chromium
> Framework.framework/Chromium Framework"; export MACH_O_TYPE=mh_dylib; export
> PRODUCT_NAME="Chromium Framework"; export
> PRODUCT_TYPE=com.apple.product-type.framework; export
>
SDKROOT=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.6.sdk;
> export
>
SRCROOT=/Volumes/data/b/build/slave/mac_gpu/build/src/out/Release/../../chrome;
> export SOURCE_ROOT="${SRCROOT}"; export
> TARGET_BUILD_DIR=/Volumes/data/b/build/slave/mac_gpu/build/src/out/Release;
> export TEMP_DIR="${TMPDIR}"; export
UNLOCALIZED_RESOURCES_FOLDER_PATH="Chromium
> Framework.framework/Versions/A/Resources"; export WRAPPER_NAME="Chromium
> Framework.framework"; (cd ../../chrome && ../build/mac/tweak_info_plist.py
> "--breakpad=1" "--breakpad_uploads=0" "--keystone=0" "--scm=1"
> "--branding=Chromium" && ln -fns Versions/Current/Libraries
> "${BUILT_PRODUCTS_DIR}/${WRAPPER_NAME}/Libraries" &&
> tools/build/mac/verify_order _ChromeMain
> "${BUILT_PRODUCTS_DIR}/${EXECUTABLE_PATH}"); G=$?; ((exit $G) || rm -rf
> 'Chromium Framework.framework') && exit $G) && touch "Chromium
> Framework.framework"
> tools/build/mac/verify_order: unordered symbols in
> /Volumes/data/b/build/slave/mac_gpu/build/src/out/Release/Chromium
> Framework.framework/Versions/A/Chromium Framework:
> S32A_Opaque_BlitRow32_SSE4_asm
> _S32A_Opaque_BlitRow32_SSE4_asm
> ninja: build stopped: subcommand failed.
Original issue's description:
> Add SSE4 optimization of S32A_Opaque_Blitrow
>
> Adds optimization of Skia S32A_Opaque_Blitrow blitter using SSE4.2 SIMD
> instruction set. Special case for when alpha is zero or opaque.
>
> Performance increase of 10%-400% compared to the existing SSE2
> optimization (measured on Silvermont architecture).
> Noticeable in ~25 different skia bench subtests, especially in
> bitmap_8888_*, repeatTile_*, and morph_*.
>
> bitmap_8888_A - 100% faster
> bitmap_8888_A_source_transparent - 250% faster
> bitmap_8888_A_source_opaque - 25% faster
> bitmap_8888_A_scale_bicubic - 75% faster
>
> Signed-off-by: Henrik Smiding <henrik.smiding@intel.com>
>
> Committed: https://skia.googlesource.com/skia/+/e2527b147679b0c43019fae7d59cc3777d2d097e
>
> Committed: https://skia.googlesource.com/skia/+/b5c281e1e06af3be804309877de1dac6145686b9
R=reed@google.com, tomhudson@google.com, djsollen@google.com, joakim.landberg@intel.com, henrik.smiding@intel.com, mtklein@chromium.org
Author: mtklein@google.com
Review URL: https://codereview.chromium.org/336413007
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Adds optimization of Skia S32A_Opaque_Blitrow blitter using SSE4.2 SIMD
instruction set. Special case for when alpha is zero or opaque.
Performance increase of 10%-400% compared to the existing SSE2
optimization (measured on Silvermont architecture).
Noticeable in ~25 different skia bench subtests, especially in
bitmap_8888_*, repeatTile_*, and morph_*.
bitmap_8888_A - 100% faster
bitmap_8888_A_source_transparent - 250% faster
bitmap_8888_A_source_opaque - 25% faster
bitmap_8888_A_scale_bicubic - 75% faster
Signed-off-by: Henrik Smiding <henrik.smiding@intel.com>
Committed: https://skia.googlesource.com/skia/+/e2527b147679b0c43019fae7d59cc3777d2d097e
R=reed@google.com, mtklein@google.com, tomhudson@google.com, djsollen@google.com, joakim.landberg@intel.com
Author: henrik.smiding@intel.com
Review URL: https://codereview.chromium.org/289473009
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
(https://codereview.chromium.org/289473009/)
Reason for revert:
Buildbot failures on Mac 10.6 and Mac 10.7.
R=reed@google.com, mtklein@google.com, tomhudson@google.com, djsollen@google.com, joakim.landberg@intel.com, henrik.smiding@intel.com
TBR=reed@google.com
NOTRY=True
Original issue's description:
> Add SSE4 optimization of S32A_Opaque_Blitrow
>
> Adds optimization of Skia S32A_Opaque_Blitrow blitter using SSE4.2 SIMD
> instruction set. Special case for when alpha is zero or opaque.
>
> Performance increase of 10%-400% compared to the existing SSE2
> optimization (measured on Silvermont architecture).
> Noticeable in ~25 different skia bench subtests, especially in
> bitmap_8888_*, repeatTile_*, and morph_*.
>
> bitmap_8888_A - 100% faster
> bitmap_8888_A_source_transparent - 250% faster
> bitmap_8888_A_source_opaque - 25% faster
> bitmap_8888_A_scale_bicubic - 75% faster
>
> Signed-off-by: Henrik Smiding <henrik.smiding@intel.com>
>
> Committed: https://skia.googlesource.com/skia/+/e2527b147679b0c43019fae7d59cc3777d2d097e
Author: jvanverth@google.com
Review URL: https://codereview.chromium.org/311053009
|
|
Adds optimization of Skia S32A_Opaque_Blitrow blitter using SSE4.2 SIMD
instruction set. Special case for when alpha is zero or opaque.
Performance increase of 10%-400% compared to the existing SSE2
optimization (measured on Silvermont architecture).
Noticeable in ~25 different skia bench subtests, especially in
bitmap_8888_*, repeatTile_*, and morph_*.
bitmap_8888_A - 100% faster
bitmap_8888_A_source_transparent - 250% faster
bitmap_8888_A_source_opaque - 25% faster
bitmap_8888_A_scale_bicubic - 75% faster
Signed-off-by: Henrik Smiding <henrik.smiding@intel.com>
R=reed@google.com, mtklein@google.com, tomhudson@google.com, djsollen@google.com, joakim.landberg@intel.com
Author: henrik.smiding@intel.com
Review URL: https://codereview.chromium.org/289473009
|