aboutsummaryrefslogtreecommitdiffhomepage
path: root/src/opts/SkNx_sse.h
Commit message (Collapse)AuthorAge
...
* Make load4 and store4 part of SkNx properly.Gravatar Mike Klein2016-10-06
| | | | | | | | | | | | | | | Every type now nominally has Load4() and Store4() methods. The ones that we use are implemented. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=3046 CQ_INCLUDE_TRYBOTS=master.client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Change-Id: I7984f0c2063ef8acbc322bd2e968f8f7eaa0d8fd Reviewed-on: https://skia-review.googlesource.com/3046 Reviewed-by: Matt Sarett <msarett@google.com> Commit-Queue: Mike Klein <mtklein@chromium.org>
* Support Float32 output from SkColorSpaceXformGravatar msarett2016-09-16
| | | | | | | | | | | | | | | * Adds Float32 support to SkColorSpaceXform * Changes API to allows clients to ask for F32, updates clients to new API * Adds Sk4f_load4 and Sk4f_store4 to SkNx * Make use of new xform in SkGr.cpp BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2339233003 CQ_INCLUDE_TRYBOTS=master.client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Committed: https://skia.googlesource.com/skia/+/43d6651111374b5d1e4ddd9030dcf079b448ec47 Review-Url: https://codereview.chromium.org/2339233003
* Revert of Support Float32 output from SkColorSpaceXform (patchset #7 ↵Gravatar msarett2016-09-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | id:140001 of https://codereview.chromium.org/2339233003/ ) Reason for revert: Hitting an assert Original issue's description: > Support Float32 output from SkColorSpaceXform > > * Adds Float32 support to SkColorSpaceXform > * Changes API to allows clients to ask for F32, updates clients to > new API > * Adds Sk4f_load4 and Sk4f_store4 to SkNx > * Make use of new xform in SkGr.cpp > > BUG=skia: > GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2339233003 > CQ_INCLUDE_TRYBOTS=master.client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot > > Committed: https://skia.googlesource.com/skia/+/43d6651111374b5d1e4ddd9030dcf079b448ec47 TBR=brianosman@google.com,mtklein@google.com,scroggo@google.com,mtklein@chromium.org,bsalomon@google.com # Skipping CQ checks because original CL landed less than 1 days ago. NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG=skia: Review-Url: https://codereview.chromium.org/2347473007
* Support Float32 output from SkColorSpaceXformGravatar msarett2016-09-16
| | | | | | | | | | | | | | * Adds Float32 support to SkColorSpaceXform * Changes API to allows clients to ask for F32, updates clients to new API * Adds Sk4f_load4 and Sk4f_store4 to SkNx * Make use of new xform in SkGr.cpp BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2339233003 CQ_INCLUDE_TRYBOTS=master.client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review-Url: https://codereview.chromium.org/2339233003
* Refactor of SkColorSpaceXformOptsGravatar msarett2016-08-02
| | | | | | | | | | | | | | | | | (1) Performance is better or stays the same. (2) Code is split into functions (RasterPipeline-ish design). IMO, it's not really more or less readable. But I think it's now much easier add capabilities, apply optimizations, or do more refactors. Or to actually use RasterPipeline. I help back from trying any of these to try to keep this CL sane. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2194303002 CQ_INCLUDE_TRYBOTS=master.client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review-Url: https://codereview.chromium.org/2194303002
* SkNx: add Sk4uGravatar mtklein2016-07-29
| | | | | | | | | | This lets us get at logical >> in a nicely principled way. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2197683002 CQ_INCLUDE_TRYBOTS=master.client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review-Url: https://codereview.chromium.org/2197683002
* Add Sk4h_load4 for loading F16.Gravatar mtklein2016-07-26
| | | | | | | | | | | | | | Should feel very similar to Sk4h_store4: NEON uses its native instruction, SSE unpacks manually. Since we'll have our F16s in 4 Sk4h by the time we're done here, this also extracts an Sk4h->Sk4f routine from the old uint64_t->Sk4f one. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2184753002 CQ_INCLUDE_TRYBOTS=master.client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review-Url: https://codereview.chromium.org/2184753002
* Improve naive SkColorXform to half floatsGravatar msarett2016-07-19
| | | | | | | | | | | | | | | | | | | | | | | | This should give us a good baseline to explore using SkRasterPipeline. A particular colorxform to half float drops from 425us to 282us on my desktop. Color Xform to Half Float (HP z620) Original 425us Trans16 (not 32) 355us Vector Trans16 378us Trans16 + Keep Halfs in Vector 335us Vector Trans16 + Keep Halfs in Vector 282us Final 282us Color Xform to Half Float (Nexus 5X) Original 556us Final 472us BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2159993003 CQ_INCLUDE_TRYBOTS=master.client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review-Url: https://codereview.chromium.org/2159993003
* Add a bench to measure the best way to pack from int to uint16_t with SSE.Gravatar mtklein2016-07-15
| | | | | | | | | | | | | | | | | | | | | | | I measured relative runtimes on my laptop: pack_int_uint16_t_ss… 1036 …e41 1x …se3 1.01x …e2_b 3.01x …e2_a 3.02x I've run into Clang problems with the actual _mm_packus_epi32 instruction, I think, so I'm going to exercise a little cowardice and leave that option disabled for now. The ssse3 version probably looks a little faster than it will be in practice. We'll usually need to load its mask, which here is hoisted out of the bench loop. The two sse2 variants are close enough in speed that I'm tie breaking them on other concerns: the <<16, >>16 version doesn't need any scratch registers or to load any constants, so it wins. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2150343002 CQ_INCLUDE_TRYBOTS=master.client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot,Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-Fast-Trybot Review-Url: https://codereview.chromium.org/2150343002
* Expand _01 half<->float limitation to _finite. Simplify.Gravatar mtklein2016-07-15
| | | | | | | | | | | | | | | | | | | | | | | It's become clear we need to sometimes deal with values <0 or >1. I'm not yet convinced we care about NaN or +-inf. We had some fairly clever tricks and optimizations here for NEON and SSE. I've thrown them out in favor of a single implementation. If we find the specializations mattered, we can certainly figure out how to extend them to this new range/domain. This happens to add a vectorized float -> half for ARMv7, which was missing from the _01 version. (The SSE strategy was not portable to platforms that flush denorm floats to zero.) I've tested the full float range for FloatToHalf on my desktop and a 5x. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2145663003 CQ_INCLUDE_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot;master.client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot,Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-Fast-Trybot Committed: https://skia.googlesource.com/skia/+/3296bee70d074bb8094b3229dbe12fa016657e90 Review-Url: https://codereview.chromium.org/2145663003
* Revert of Expand _01 half<->float limitation to _finite. Simplify. ↵Gravatar mtklein2016-07-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | (patchset #7 id:120001 of https://codereview.chromium.org/2145663003/ ) Reason for revert: Unit tests fail on Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-Fast Original issue's description: > Expand _01 half<->float limitation to _finite. Simplify. > > It's become clear we need to sometimes deal with values <0 or >1. > I'm not yet convinced we care about NaN or +-inf. > > We had some fairly clever tricks and optimizations here for NEON > and SSE. I've thrown them out in favor of a single implementation. > If we find the specializations mattered, we can certainly figure out > how to extend them to this new range/domain. > > This happens to add a vectorized float -> half for ARMv7, which was > missing from the _01 version. (The SSE strategy was not portable to > platforms that flush denorm floats to zero.) > > I've tested the full float range for FloatToHalf on my desktop and a 5x. > > BUG=skia: > GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2145663003 > CQ_INCLUDE_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot;master.client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot,Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-Fast-Trybot > > Committed: https://skia.googlesource.com/skia/+/3296bee70d074bb8094b3229dbe12fa016657e90 TBR=msarett@google.com,mtklein@chromium.org # Skipping CQ checks because original CL landed less than 1 days ago. NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG=skia: Review-Url: https://codereview.chromium.org/2151023003
* Expand _01 half<->float limitation to _finite. Simplify.Gravatar mtklein2016-07-14
| | | | | | | | | | | | | | | | | | | | | | It's become clear we need to sometimes deal with values <0 or >1. I'm not yet convinced we care about NaN or +-inf. We had some fairly clever tricks and optimizations here for NEON and SSE. I've thrown them out in favor of a single implementation. If we find the specializations mattered, we can certainly figure out how to extend them to this new range/domain. This happens to add a vectorized float -> half for ARMv7, which was missing from the _01 version. (The SSE strategy was not portable to platforms that flush denorm floats to zero.) I've tested the full float range for FloatToHalf on my desktop and a 5x. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2145663003 CQ_INCLUDE_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot;master.client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review-Url: https://codereview.chromium.org/2145663003
* SkRasterPipeline preliminariesGravatar mtklein2016-07-12
| | | | | | | | | | | | | | Re-uploading to see if I can get a CL number < 2^31. patch from issue 2147533002 at patchset 240001 (http://crrev.com/2147533002#ps240001) Already reviewed at the other crrev link. TBR= BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2147533002 CQ_INCLUDE_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review-Url: https://codereview.chromium.org/2144573004
* Add Sk4f_RoundToIntGravatar msarett2016-07-12
| | | | | | | | BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2134753006 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review-Url: https://codereview.chromium.org/2134753006
* Revert of try to speed-up maprect + round2i + contains (patchset #8 ↵Gravatar msarett2016-07-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | id:140001 of https://codereview.chromium.org/2133413002/ ) Reason for revert: Breaking the roll... https://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_rel_ng/builds/253294/steps/compile%20%28with%20patch%29/logs/stdio Original issue's description: > try to speed-up maprect + round2i + contains > > We call roundOut in a few places. If we can get SkNx::Ceil we could efficiently implement that as well. > > > BUG=skia: > GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2133413002 > CQ_INCLUDE_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot > > Committed: https://skia.googlesource.com/skia/+/b42b785d1cbc98bd34aceae338060831b974f9c5 TBR=mtklein@google.com,reed@google.com # Skipping CQ checks because original CL landed less than 1 days ago. NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG=skia: Review-Url: https://codereview.chromium.org/2136343002
* try to speed-up maprect + round2i + containsGravatar reed2016-07-11
| | | | | | | | | | We call roundOut in a few places. If we can get SkNx::Ceil we could efficiently implement that as well. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2133413002 CQ_INCLUDE_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review-Url: https://codereview.chromium.org/2133413002
* Clean up hyper-local SkCpu feature test experiment.Gravatar mtklein2016-07-11
| | | | | | | | | | | | | | | | | | | | | | | | | | This removes the code paths where we make SkCpu::Supports() calls from within a tight loop. It keeps code paths using SkCpu::Supports() to choose entire routines from src/opts/. We can't rely on these hyper-local checks to be hoisted up reliably enough. It worked pretty well with the first couple platforms we tried (e.g. Clang on Linux/Mac) but we can't gaurantee it works everywhere. Further, I'm not able to actually do anything fancy with those tests outside of x86... I've not found a way to get, say, NEON+F16 conversion code embedded into ordinary NEON code outside writing then entire function in external assembly. This whole idea becomes less important now that we've got a way to chain separate function calls together efficiently. We can now, e.g., use an AVX+F16C method to load some pixels, then chain that into an ordinary AVX method to color filter them. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2138073002 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review-Url: https://codereview.chromium.org/2138073002
* Support sRGB dsts in opt codeGravatar msarett2016-06-20
| | | | | | | | | | | | | | | 201295.jpg on HP z620 (300x280) QCMS Xform 0.418 ms Skia NEW Xform 0.378 ms Vs QCMS 1.11x BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2078623002 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review-Url: https://codereview.chromium.org/2078623002
* port SkColorXform_opts to Sk4fGravatar mtklein2016-06-17
| | | | | | | | | | I have tested that this compiles. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2078913003 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review-Url: https://codereview.chromium.org/2078913003
* Move immintrin/arm_neon includes to where they are used.Gravatar mtklein2016-06-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | On my Mac (so, immintrin), this improves compile time, both wall and cpu, by about 16%. To test I ran this on an SSD with files hot in their caches: $ env CC=/usr/bin/clang CXX=/usr/bin/clang++ ./gyp_skia && \ ninja -C out/Release -t clean && \ time ninja -C out/Release Before: 159 wall / 3367 cpu 159 wall / 3368 cpu After: 137 wall / 2860 cpu 136 wall / 2863 cpu I also tried further refining immintrin down to emmintrin / tmmintrin / smmintrin etc. That made no signficant difference, so I've kept immintrin for its simplicity. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2045633002 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot TBR=reed@google.com No public API changes. Committed: https://skia.googlesource.com/skia/+/12dfaaa53c23f3d03050bde8f64136ac1f44164a Review-Url: https://codereview.chromium.org/2045633002
* Revert of Move immintrin/arm_neon includes to where they are used. (patchset ↵Gravatar mtklein2016-06-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | #2 id:20001 of https://codereview.chromium.org/2045633002/ ) Reason for revert: Appears to have broken the ARMv7 aspect of the Google3 roll in bizarre seemingly-unrelated ways. Original issue's description: > Move immintrin/arm_neon includes to where they are used. > > On my Mac (so, immintrin), this improves compile time, both wall and cpu, > by about 16%. To test I ran this on an SSD with files hot in their caches: > > $ env CC=/usr/bin/clang CXX=/usr/bin/clang++ ./gyp_skia && \ > ninja -C out/Release -t clean && \ > time ninja -C out/Release > > Before: 159 wall / 3367 cpu > 159 wall / 3368 cpu > > After: 137 wall / 2860 cpu > 136 wall / 2863 cpu > > I also tried further refining immintrin down to emmintrin / tmmintrin / smmintrin etc. > That made no signficant difference, so I've kept immintrin for its simplicity. > > BUG=skia: > GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2045633002 > CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot > > TBR=reed@google.com > No public API changes. > > Committed: https://skia.googlesource.com/skia/+/12dfaaa53c23f3d03050bde8f64136ac1f44164a TBR=herb@google.com,mtklein@chromium.org # Skipping CQ checks because original CL landed less than 1 days ago. NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG=skia: Review-Url: https://codereview.chromium.org/2046213002
* Move immintrin/arm_neon includes to where they are used.Gravatar mtklein2016-06-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | On my Mac (so, immintrin), this improves compile time, both wall and cpu, by about 16%. To test I ran this on an SSD with files hot in their caches: $ env CC=/usr/bin/clang CXX=/usr/bin/clang++ ./gyp_skia && \ ninja -C out/Release -t clean && \ time ninja -C out/Release Before: 159 wall / 3367 cpu 159 wall / 3368 cpu After: 137 wall / 2860 cpu 136 wall / 2863 cpu I also tried further refining immintrin down to emmintrin / tmmintrin / smmintrin etc. That made no signficant difference, so I've kept immintrin for its simplicity. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2045633002 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot TBR=reed@google.com No public API changes. Review-Url: https://codereview.chromium.org/2045633002
* skcpu: sse4.1 floor, f16c f16<->f32Gravatar mtklein2016-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - floor with roundps is about 4.5x faster when available - f16 srcover_n is similar to but a little faster than the version in https://codereview.chromium.org/1884683002. This new one fuses the dst load/stores into the f16<->f32 conversions: +0x180 movups (%r15), %xmm1 +0x184 vcvtph2ps (%rbx), %xmm2 +0x189 movaps %xmm1, %xmm3 +0x18c shufps $255, %xmm3, %xmm3 +0x190 movaps %xmm0, %xmm4 +0x193 subps %xmm3, %xmm4 +0x196 mulps %xmm2, %xmm4 +0x199 addps %xmm1, %xmm4 +0x19c vcvtps2ph $0, %xmm4, (%rbx) +0x1a2 addq $16, %r15 +0x1a6 addq $8, %rbx +0x1aa decl %r14d +0x1ad jne +0x180 If we decide to land this it'd be a good idea to convert most or all users of SkFloatToHalf_01 and SkHalfToFloat_01 over to the pointer-based versions. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1891513002 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Committed: https://skia.googlesource.com/skia/+/cbe3c1af987d622ea67ef560d855b41bb14a0ce9 Committed: https://skia.googlesource.com/skia/+/3faf74b8364491ca806f523fbb1d8a97be592663 Review URL: https://codereview.chromium.org/1891513002
* Revert of skcpu: sse4.1 floor, f16c f16<->f32 (patchset #11 id:200001 of ↵Gravatar mtklein2016-04-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://codereview.chromium.org/1891513002/ ) Reason for revert: this depends on a CL I want to revert Original issue's description: > skcpu: sse4.1 floor, f16c f16<->f32 > > - floor with roundps is about 4.5x faster when available > - f16 srcover_n is similar to but a little faster than the version in https://codereview.chromium.org/1884683002. This new one fuses the dst load/stores into the f16<->f32 conversions: > > +0x180 movups (%r15), %xmm1 > +0x184 vcvtph2ps (%rbx), %xmm2 > +0x189 movaps %xmm1, %xmm3 > +0x18c shufps $255, %xmm3, %xmm3 > +0x190 movaps %xmm0, %xmm4 > +0x193 subps %xmm3, %xmm4 > +0x196 mulps %xmm2, %xmm4 > +0x199 addps %xmm1, %xmm4 > +0x19c vcvtps2ph $0, %xmm4, (%rbx) > +0x1a2 addq $16, %r15 > +0x1a6 addq $8, %rbx > +0x1aa decl %r14d > +0x1ad jne +0x180 > > If we decide to land this it'd be a good idea to convert most or all users of SkFloatToHalf_01 and SkHalfToFloat_01 over to the pointer-based versions. > > BUG=skia: > GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1891513002 > CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot > > Committed: https://skia.googlesource.com/skia/+/cbe3c1af987d622ea67ef560d855b41bb14a0ce9 > > Committed: https://skia.googlesource.com/skia/+/3faf74b8364491ca806f523fbb1d8a97be592663 TBR=fmalita@chromium.org,herb@google.com,reed@google.com,mtklein@chromium.org # Skipping CQ checks because original CL landed less than 1 days ago. NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG=skia: Review URL: https://codereview.chromium.org/1897433002
* skcpu: sse4.1 floor, f16c f16<->f32Gravatar mtklein2016-04-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | - floor with roundps is about 4.5x faster when available - f16 srcover_n is similar to but a little faster than the version in https://codereview.chromium.org/1884683002. This new one fuses the dst load/stores into the f16<->f32 conversions: +0x180 movups (%r15), %xmm1 +0x184 vcvtph2ps (%rbx), %xmm2 +0x189 movaps %xmm1, %xmm3 +0x18c shufps $255, %xmm3, %xmm3 +0x190 movaps %xmm0, %xmm4 +0x193 subps %xmm3, %xmm4 +0x196 mulps %xmm2, %xmm4 +0x199 addps %xmm1, %xmm4 +0x19c vcvtps2ph $0, %xmm4, (%rbx) +0x1a2 addq $16, %r15 +0x1a6 addq $8, %rbx +0x1aa decl %r14d +0x1ad jne +0x180 If we decide to land this it'd be a good idea to convert most or all users of SkFloatToHalf_01 and SkHalfToFloat_01 over to the pointer-based versions. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1891513002 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Committed: https://skia.googlesource.com/skia/+/cbe3c1af987d622ea67ef560d855b41bb14a0ce9 Review URL: https://codereview.chromium.org/1891513002
* Revert of skcpu: sse4.1 floor, f16c f16<->f32 (patchset #10 id:180001 of ↵Gravatar mtklein2016-04-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://codereview.chromium.org/1891513002/ ) Reason for revert: Need to change around my #if guards so that clang-cl is treated like GCC and Clang, rather than MSVC. Original issue's description: > skcpu: sse4.1 floor, f16c f16<->f32 > > - floor with roundps is about 4.5x faster when available > - f16 srcover_n is similar to but a little faster than the version in https://codereview.chromium.org/1884683002. This new one fuses the dst load/stores into the f16<->f32 conversions: > > +0x180 movups (%r15), %xmm1 > +0x184 vcvtph2ps (%rbx), %xmm2 > +0x189 movaps %xmm1, %xmm3 > +0x18c shufps $255, %xmm3, %xmm3 > +0x190 movaps %xmm0, %xmm4 > +0x193 subps %xmm3, %xmm4 > +0x196 mulps %xmm2, %xmm4 > +0x199 addps %xmm1, %xmm4 > +0x19c vcvtps2ph $0, %xmm4, (%rbx) > +0x1a2 addq $16, %r15 > +0x1a6 addq $8, %rbx > +0x1aa decl %r14d > +0x1ad jne +0x180 > > If we decide to land this it'd be a good idea to convert most or all users of SkFloatToHalf_01 and SkHalfToFloat_01 over to the pointer-based versions. > > BUG=skia: > GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1891513002 > CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot > > Committed: https://skia.googlesource.com/skia/+/cbe3c1af987d622ea67ef560d855b41bb14a0ce9 TBR=fmalita@chromium.org,herb@google.com,reed@google.com,mtklein@chromium.org # Skipping CQ checks because original CL landed less than 1 days ago. NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG=skia: Review URL: https://codereview.chromium.org/1891993002
* skcpu: sse4.1 floor, f16c f16<->f32Gravatar mtklein2016-04-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | - floor with roundps is about 4.5x faster when available - f16 srcover_n is similar to but a little faster than the version in https://codereview.chromium.org/1884683002. This new one fuses the dst load/stores into the f16<->f32 conversions: +0x180 movups (%r15), %xmm1 +0x184 vcvtph2ps (%rbx), %xmm2 +0x189 movaps %xmm1, %xmm3 +0x18c shufps $255, %xmm3, %xmm3 +0x190 movaps %xmm0, %xmm4 +0x193 subps %xmm3, %xmm4 +0x196 mulps %xmm2, %xmm4 +0x199 addps %xmm1, %xmm4 +0x19c vcvtps2ph $0, %xmm4, (%rbx) +0x1a2 addq $16, %r15 +0x1a6 addq $8, %rbx +0x1aa decl %r14d +0x1ad jne +0x180 If we decide to land this it'd be a good idea to convert most or all users of SkFloatToHalf_01 and SkHalfToFloat_01 over to the pointer-based versions. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1891513002 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1891513002
* SkNx refreshGravatar mtklein2016-03-21
| | | | | | | | | | | | | | | | | | | | - rearrange a bit - fewer macros - hooks for all operators - add left and right scalar operator overrides - add +=, &=, <<=, etc. - add SkNx_split() and SkNx_join() - simplify the many rsqrt() and invert() options to just what we actually use This refactoring pointed out that our float <-> int NEON conversions are not specialized, so I've implemented them. It seems nice that this is an error rather than silently falling back to serial code. It's unclear to me if split/join want to be external, static methods, or non-static methods (SkNx_join(), Sk4f::Join(), x.join()). Time will tell? BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1812233003 CQ_EXTRA_TRYBOTS=client.skia.android:Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot;client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1812233003
* Add swizzle for rgb8888.Gravatar herb2016-03-01
| | | | | | | | BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1746153002 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1746153002
* SkNx: kth<...>() -> [...]Gravatar mtklein2016-02-21
| | | | | | | | | | Just some syntax cleanup. No real change: kth<...>() was calling [...] already. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1714363002 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1714363002
* fast sk4f <-> sk4i SSE methodsGravatar mtklein2016-02-17
| | | | | | | | BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1707673002 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1707673002
* Revert of SkNx refactoring (patchset #4 id:60001 of ↵Gravatar mtklein2016-02-12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://codereview.chromium.org/1690633003/ ) Reason for revert: Precautionary revert for chromium:586487 Original issue's description: > SkNx refactoring > > - add back Sk4i typedef > - define SSE casts in terms of Sk4i > * uint8 <-> float becomes uint8 <-> int <-> float > * uint16 <-> float becomes uint16 <-> int <-> float > > This has the nice side effect of specializing uint8 <-> int > and uint16 <-> int, which are useful in their own right. > > There are many cast specializations now, some of which call each other. > I have tried to arrange them in some sort of sensible order, subject to > the constraint that those called must precede those who call. > > BUG=skia: > GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1690633003 > CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot > > Committed: https://skia.googlesource.com/skia/+/c1eb311f4e98934476f1b2ad5d6de772cf140d60 TBR=herb@google.com,mtklein@chromium.org # Not skipping CQ checks because original CL landed more than 1 days ago. BUG=chromium:586487 Review URL: https://codereview.chromium.org/1696903002
* SkNx refactoringGravatar mtklein2016-02-11
| | | | | | | | | | | | | | | | | | | | - add back Sk4i typedef - define SSE casts in terms of Sk4i * uint8 <-> float becomes uint8 <-> int <-> float * uint16 <-> float becomes uint16 <-> int <-> float This has the nice side effect of specializing uint8 <-> int and uint16 <-> int, which are useful in their own right. There are many cast specializations now, some of which call each other. I have tried to arrange them in some sort of sensible order, subject to the constraint that those called must precede those who call. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1690633003 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1690633003
* Sk4f: floor() via int32_t roundtrip.Gravatar mtklein2016-02-10
| | | | | | | | | | About 25% faster on both x86 and ARMv7. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1682953002 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1682953002
* Sk4f: add floor()Gravatar mtklein2016-02-09
| | | | | | | | | | | | BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1685773002 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Committed: https://skia.googlesource.com/skia/+/86c6c4935171a1d2d6a9ffbff37ec6dac1326614 CQ_EXTRA_TRYBOTS=client.skia.android:Test-Android-GCC-Nexus7-GPU-Tegra3-Arm7-Release-Trybot,Test-Android-GCC-Nexus9-GPU-TegraK1-Arm64-Release-Trybot Review URL: https://codereview.chromium.org/1685773002
* Revert of Sk4f: add floor() (patchset #3 id:40001 of ↵Gravatar mtklein2016-02-09
| | | | | | | | | | | | | | | | | | | | | | | | | https://codereview.chromium.org/1685773002/ ) Reason for revert: build break must be this, right? Original issue's description: > Sk4f: add floor() > > BUG=skia: > GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1685773002 > CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot > > Committed: https://skia.googlesource.com/skia/+/86c6c4935171a1d2d6a9ffbff37ec6dac1326614 TBR=herb@google.com,mtklein@chromium.org # Skipping CQ checks because original CL landed less than 1 days ago. NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG=skia: Review URL: https://codereview.chromium.org/1679343004
* Sk4f: add floor()Gravatar mtklein2016-02-09
| | | | | | | | BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1685773002 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1685773002
* restore sk4i SSE specializationGravatar mtklein2016-02-09
| | | | | | | | BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1679343003 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1679343003
* sknx refactoringGravatar mtklein2016-02-09
| | | | | | | | | | | | | | | | | | | | | | | | - trim unused specializations (Sk4i, Sk2d) and apis (SkNx_dup) - expand apis a little * v[0] == v.kth<0>() * SkNx_shuffle can now convert to different-sized vectors, e.g. Sk2f <-> Sk4f - remove anonymous namespace I believe it's safe to remove the anonymous namespace right now. We're worried about violating the One Definition Rule; the anonymous namespace protected us from that. In Release builds, this is mostly moot, as everything tends to inline completely. In Debug builds, violating the ODR is at worst an inconvenience, time spent trying to figure out why the bot is broken. Now that we're building with SSE2/NEON everywhere, very few bots have even a chance about getting confused by two definitions of the same type or function. Where we do compile variants depending on, e.g., SSSE3, we do so in static inline functions. These are not subject to the ODR. I plan to follow up with a tedious .kth<...>() -> [...] auto-replace. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1683543002 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1683543002
* fix float <---> uint16_t conversion, with Mike's tests.Gravatar mtklein2016-02-08
| | | | | | | | | | BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1679713002 patch from issue 1679713002 at patchset 1 (http://crrev.com/1679713002#ps1) CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1676893002
* could not resist: fast sse float <--> u16Gravatar mtklein2016-02-06
| | | | | | | | | | | - generalizes the bench to float <--> {u8,u16} - must remember to implement NEON version at some point BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1676853002 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1676853002
* SkNx Load/store: take any pointer.Gravatar mtklein2016-01-31
| | | | | | | | | | This means we can remove a lot of explicit casts in code that uses SkNx. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1650653002 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1650653002
* SkNx miplevel buildingGravatar mtklein2016-01-20
| | | | | | | | | | | | | | | | | | | | | | | | | | All sizes approximately twice as fast. Before: micros bench 1649.35 mipmap_build_512x512 nonrendering 1824.42 mipmap_build_511x512 nonrendering 2100.66 ? mipmap_build_512x511 nonrendering 2375.94 mipmap_build_511x511 nonrendering After: micros bench 730.32 ! mipmap_build_512x512 nonrendering 922.12 mipmap_build_511x512 nonrendering 999.07 mipmap_build_512x511 nonrendering 1342.93 ! mipmap_build_511x511 nonrendering BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1606013003 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Committed: https://skia.googlesource.com/skia/+/3bd5aba2a0e165997f683cf3aa306661e71464f6 Review URL: https://codereview.chromium.org/1606013003
* Revert of SkNx miplevel building (patchset #3 id:40001 of ↵Gravatar mtklein2016-01-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://codereview.chromium.org/1606013003/ ) Reason for revert: Paranoid revert to see if it helps skia:4823 Original issue's description: > SkNx miplevel building > > All sizes approximately twice as fast. > > Before: > micros bench > 1649.35 mipmap_build_512x512 nonrendering > 1824.42 mipmap_build_511x512 nonrendering > 2100.66 ? mipmap_build_512x511 nonrendering > 2375.94 mipmap_build_511x511 nonrendering > > After: > micros bench > 730.32 ! mipmap_build_512x512 nonrendering > 922.12 mipmap_build_511x512 nonrendering > 999.07 mipmap_build_512x511 nonrendering > 1342.93 ! mipmap_build_511x511 nonrendering > > > BUG=skia: > GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1606013003 > CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot > > Committed: https://skia.googlesource.com/skia/+/3bd5aba2a0e165997f683cf3aa306661e71464f6 TBR=reed@google.com,mtklein@chromium.org # Skipping CQ checks because original CL landed less than 1 days ago. NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG=skia: Review URL: https://codereview.chromium.org/1607323002
* SkNx miplevel buildingGravatar mtklein2016-01-19
| | | | | | | | | | | | | | | | | | | | | | | | All sizes approximately twice as fast. Before: micros bench 1649.35 mipmap_build_512x512 nonrendering 1824.42 mipmap_build_511x512 nonrendering 2100.66 ? mipmap_build_512x511 nonrendering 2375.94 mipmap_build_511x511 nonrendering After: micros bench 730.32 ! mipmap_build_512x512 nonrendering 922.12 mipmap_build_511x512 nonrendering 999.07 mipmap_build_512x511 nonrendering 1342.93 ! mipmap_build_511x511 nonrendering BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1606013003 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1606013003
* add SkNx::abs(), for now only implemented for Sk4fGravatar mtklein2016-01-15
| | | | | | | | | | | There's no reason we couldn't implement this for all ints and floats; just don't want to land unused code. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1590843003 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1590843003
* Specialize Sk2d for SSE2Gravatar mtklein2015-12-15
| | | | | | | | | | Given the autovectorization we've seen, I wouldn't expect big speedups from this, but it does give us a point of control over what's going on. BUG=skia: CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1526923003
* Unify some SkNx codeGravatar mtklein2015-12-14
| | | | | | | | | | | | | | | - one base case and one N=1 case instead of two each (or three with doubles) - use SkNx_cast instead of FromBytes/toBytes - 4-at-a-time Sk4f::ToBytes becomes a special standalone Sk4f_ToBytes If I did everything right, this'll be perf- and pixel- neutral. https://gold.skia.org/search2?issue=1526523003&unt=true&query=source_type%3Dgm&master=false BUG=skia: CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1526523003
* Don't use the Sk4f gradient impl without SIMDGravatar fmalita2015-12-03
| | | | | | | | | | | Also remove the SK_SUPPORT_LEGACY_LINEAR_GRADIENT_TABLE guard since it is no longer used in Chromium. BUG=chromium:563492 R=reed@google.com,mtklein@google.com CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1489233005
* Add Sk4f::ToBytes(uint8_t[16], Sk4f, Sk4f, Sk4f, Sk4f)Gravatar mtklein2015-12-01
| | | | | | | | | | | | This is a big speedup for float -> byte. E.g. gradient_linear_clamp_3color: x86-64 147µs -> 103µs (Broadwell MBP) arm64 2.03ms -> 648µs (Galaxy S6) armv7 1.12ms -> 489µs (Galaxy S6, same device!) BUG=skia: CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot;client.skia.android:Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Debug-Trybot Review URL: https://codereview.chromium.org/1483953002