aboutsummaryrefslogtreecommitdiffhomepage
path: root/src/opts/SkNx_sse.h
Commit message (Collapse)AuthorAge
* skcpu: sse4.1 floor, f16c f16<->f32Gravatar mtklein2016-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - floor with roundps is about 4.5x faster when available - f16 srcover_n is similar to but a little faster than the version in https://codereview.chromium.org/1884683002. This new one fuses the dst load/stores into the f16<->f32 conversions: +0x180 movups (%r15), %xmm1 +0x184 vcvtph2ps (%rbx), %xmm2 +0x189 movaps %xmm1, %xmm3 +0x18c shufps $255, %xmm3, %xmm3 +0x190 movaps %xmm0, %xmm4 +0x193 subps %xmm3, %xmm4 +0x196 mulps %xmm2, %xmm4 +0x199 addps %xmm1, %xmm4 +0x19c vcvtps2ph $0, %xmm4, (%rbx) +0x1a2 addq $16, %r15 +0x1a6 addq $8, %rbx +0x1aa decl %r14d +0x1ad jne +0x180 If we decide to land this it'd be a good idea to convert most or all users of SkFloatToHalf_01 and SkHalfToFloat_01 over to the pointer-based versions. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1891513002 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Committed: https://skia.googlesource.com/skia/+/cbe3c1af987d622ea67ef560d855b41bb14a0ce9 Committed: https://skia.googlesource.com/skia/+/3faf74b8364491ca806f523fbb1d8a97be592663 Review URL: https://codereview.chromium.org/1891513002
* Revert of skcpu: sse4.1 floor, f16c f16<->f32 (patchset #11 id:200001 of ↵Gravatar mtklein2016-04-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://codereview.chromium.org/1891513002/ ) Reason for revert: this depends on a CL I want to revert Original issue's description: > skcpu: sse4.1 floor, f16c f16<->f32 > > - floor with roundps is about 4.5x faster when available > - f16 srcover_n is similar to but a little faster than the version in https://codereview.chromium.org/1884683002. This new one fuses the dst load/stores into the f16<->f32 conversions: > > +0x180 movups (%r15), %xmm1 > +0x184 vcvtph2ps (%rbx), %xmm2 > +0x189 movaps %xmm1, %xmm3 > +0x18c shufps $255, %xmm3, %xmm3 > +0x190 movaps %xmm0, %xmm4 > +0x193 subps %xmm3, %xmm4 > +0x196 mulps %xmm2, %xmm4 > +0x199 addps %xmm1, %xmm4 > +0x19c vcvtps2ph $0, %xmm4, (%rbx) > +0x1a2 addq $16, %r15 > +0x1a6 addq $8, %rbx > +0x1aa decl %r14d > +0x1ad jne +0x180 > > If we decide to land this it'd be a good idea to convert most or all users of SkFloatToHalf_01 and SkHalfToFloat_01 over to the pointer-based versions. > > BUG=skia: > GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1891513002 > CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot > > Committed: https://skia.googlesource.com/skia/+/cbe3c1af987d622ea67ef560d855b41bb14a0ce9 > > Committed: https://skia.googlesource.com/skia/+/3faf74b8364491ca806f523fbb1d8a97be592663 TBR=fmalita@chromium.org,herb@google.com,reed@google.com,mtklein@chromium.org # Skipping CQ checks because original CL landed less than 1 days ago. NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG=skia: Review URL: https://codereview.chromium.org/1897433002
* skcpu: sse4.1 floor, f16c f16<->f32Gravatar mtklein2016-04-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | - floor with roundps is about 4.5x faster when available - f16 srcover_n is similar to but a little faster than the version in https://codereview.chromium.org/1884683002. This new one fuses the dst load/stores into the f16<->f32 conversions: +0x180 movups (%r15), %xmm1 +0x184 vcvtph2ps (%rbx), %xmm2 +0x189 movaps %xmm1, %xmm3 +0x18c shufps $255, %xmm3, %xmm3 +0x190 movaps %xmm0, %xmm4 +0x193 subps %xmm3, %xmm4 +0x196 mulps %xmm2, %xmm4 +0x199 addps %xmm1, %xmm4 +0x19c vcvtps2ph $0, %xmm4, (%rbx) +0x1a2 addq $16, %r15 +0x1a6 addq $8, %rbx +0x1aa decl %r14d +0x1ad jne +0x180 If we decide to land this it'd be a good idea to convert most or all users of SkFloatToHalf_01 and SkHalfToFloat_01 over to the pointer-based versions. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1891513002 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Committed: https://skia.googlesource.com/skia/+/cbe3c1af987d622ea67ef560d855b41bb14a0ce9 Review URL: https://codereview.chromium.org/1891513002
* Revert of skcpu: sse4.1 floor, f16c f16<->f32 (patchset #10 id:180001 of ↵Gravatar mtklein2016-04-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://codereview.chromium.org/1891513002/ ) Reason for revert: Need to change around my #if guards so that clang-cl is treated like GCC and Clang, rather than MSVC. Original issue's description: > skcpu: sse4.1 floor, f16c f16<->f32 > > - floor with roundps is about 4.5x faster when available > - f16 srcover_n is similar to but a little faster than the version in https://codereview.chromium.org/1884683002. This new one fuses the dst load/stores into the f16<->f32 conversions: > > +0x180 movups (%r15), %xmm1 > +0x184 vcvtph2ps (%rbx), %xmm2 > +0x189 movaps %xmm1, %xmm3 > +0x18c shufps $255, %xmm3, %xmm3 > +0x190 movaps %xmm0, %xmm4 > +0x193 subps %xmm3, %xmm4 > +0x196 mulps %xmm2, %xmm4 > +0x199 addps %xmm1, %xmm4 > +0x19c vcvtps2ph $0, %xmm4, (%rbx) > +0x1a2 addq $16, %r15 > +0x1a6 addq $8, %rbx > +0x1aa decl %r14d > +0x1ad jne +0x180 > > If we decide to land this it'd be a good idea to convert most or all users of SkFloatToHalf_01 and SkHalfToFloat_01 over to the pointer-based versions. > > BUG=skia: > GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1891513002 > CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot > > Committed: https://skia.googlesource.com/skia/+/cbe3c1af987d622ea67ef560d855b41bb14a0ce9 TBR=fmalita@chromium.org,herb@google.com,reed@google.com,mtklein@chromium.org # Skipping CQ checks because original CL landed less than 1 days ago. NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG=skia: Review URL: https://codereview.chromium.org/1891993002
* skcpu: sse4.1 floor, f16c f16<->f32Gravatar mtklein2016-04-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | - floor with roundps is about 4.5x faster when available - f16 srcover_n is similar to but a little faster than the version in https://codereview.chromium.org/1884683002. This new one fuses the dst load/stores into the f16<->f32 conversions: +0x180 movups (%r15), %xmm1 +0x184 vcvtph2ps (%rbx), %xmm2 +0x189 movaps %xmm1, %xmm3 +0x18c shufps $255, %xmm3, %xmm3 +0x190 movaps %xmm0, %xmm4 +0x193 subps %xmm3, %xmm4 +0x196 mulps %xmm2, %xmm4 +0x199 addps %xmm1, %xmm4 +0x19c vcvtps2ph $0, %xmm4, (%rbx) +0x1a2 addq $16, %r15 +0x1a6 addq $8, %rbx +0x1aa decl %r14d +0x1ad jne +0x180 If we decide to land this it'd be a good idea to convert most or all users of SkFloatToHalf_01 and SkHalfToFloat_01 over to the pointer-based versions. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1891513002 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1891513002
* SkNx refreshGravatar mtklein2016-03-21
| | | | | | | | | | | | | | | | | | | | - rearrange a bit - fewer macros - hooks for all operators - add left and right scalar operator overrides - add +=, &=, <<=, etc. - add SkNx_split() and SkNx_join() - simplify the many rsqrt() and invert() options to just what we actually use This refactoring pointed out that our float <-> int NEON conversions are not specialized, so I've implemented them. It seems nice that this is an error rather than silently falling back to serial code. It's unclear to me if split/join want to be external, static methods, or non-static methods (SkNx_join(), Sk4f::Join(), x.join()). Time will tell? BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1812233003 CQ_EXTRA_TRYBOTS=client.skia.android:Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot;client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1812233003
* Add swizzle for rgb8888.Gravatar herb2016-03-01
| | | | | | | | BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1746153002 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1746153002
* SkNx: kth<...>() -> [...]Gravatar mtklein2016-02-21
| | | | | | | | | | Just some syntax cleanup. No real change: kth<...>() was calling [...] already. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1714363002 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1714363002
* fast sk4f <-> sk4i SSE methodsGravatar mtklein2016-02-17
| | | | | | | | BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1707673002 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1707673002
* Revert of SkNx refactoring (patchset #4 id:60001 of ↵Gravatar mtklein2016-02-12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://codereview.chromium.org/1690633003/ ) Reason for revert: Precautionary revert for chromium:586487 Original issue's description: > SkNx refactoring > > - add back Sk4i typedef > - define SSE casts in terms of Sk4i > * uint8 <-> float becomes uint8 <-> int <-> float > * uint16 <-> float becomes uint16 <-> int <-> float > > This has the nice side effect of specializing uint8 <-> int > and uint16 <-> int, which are useful in their own right. > > There are many cast specializations now, some of which call each other. > I have tried to arrange them in some sort of sensible order, subject to > the constraint that those called must precede those who call. > > BUG=skia: > GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1690633003 > CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot > > Committed: https://skia.googlesource.com/skia/+/c1eb311f4e98934476f1b2ad5d6de772cf140d60 TBR=herb@google.com,mtklein@chromium.org # Not skipping CQ checks because original CL landed more than 1 days ago. BUG=chromium:586487 Review URL: https://codereview.chromium.org/1696903002
* SkNx refactoringGravatar mtklein2016-02-11
| | | | | | | | | | | | | | | | | | | | - add back Sk4i typedef - define SSE casts in terms of Sk4i * uint8 <-> float becomes uint8 <-> int <-> float * uint16 <-> float becomes uint16 <-> int <-> float This has the nice side effect of specializing uint8 <-> int and uint16 <-> int, which are useful in their own right. There are many cast specializations now, some of which call each other. I have tried to arrange them in some sort of sensible order, subject to the constraint that those called must precede those who call. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1690633003 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1690633003
* Sk4f: floor() via int32_t roundtrip.Gravatar mtklein2016-02-10
| | | | | | | | | | About 25% faster on both x86 and ARMv7. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1682953002 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1682953002
* Sk4f: add floor()Gravatar mtklein2016-02-09
| | | | | | | | | | | | BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1685773002 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Committed: https://skia.googlesource.com/skia/+/86c6c4935171a1d2d6a9ffbff37ec6dac1326614 CQ_EXTRA_TRYBOTS=client.skia.android:Test-Android-GCC-Nexus7-GPU-Tegra3-Arm7-Release-Trybot,Test-Android-GCC-Nexus9-GPU-TegraK1-Arm64-Release-Trybot Review URL: https://codereview.chromium.org/1685773002
* Revert of Sk4f: add floor() (patchset #3 id:40001 of ↵Gravatar mtklein2016-02-09
| | | | | | | | | | | | | | | | | | | | | | | | | https://codereview.chromium.org/1685773002/ ) Reason for revert: build break must be this, right? Original issue's description: > Sk4f: add floor() > > BUG=skia: > GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1685773002 > CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot > > Committed: https://skia.googlesource.com/skia/+/86c6c4935171a1d2d6a9ffbff37ec6dac1326614 TBR=herb@google.com,mtklein@chromium.org # Skipping CQ checks because original CL landed less than 1 days ago. NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG=skia: Review URL: https://codereview.chromium.org/1679343004
* Sk4f: add floor()Gravatar mtklein2016-02-09
| | | | | | | | BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1685773002 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1685773002
* restore sk4i SSE specializationGravatar mtklein2016-02-09
| | | | | | | | BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1679343003 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1679343003
* sknx refactoringGravatar mtklein2016-02-09
| | | | | | | | | | | | | | | | | | | | | | | | - trim unused specializations (Sk4i, Sk2d) and apis (SkNx_dup) - expand apis a little * v[0] == v.kth<0>() * SkNx_shuffle can now convert to different-sized vectors, e.g. Sk2f <-> Sk4f - remove anonymous namespace I believe it's safe to remove the anonymous namespace right now. We're worried about violating the One Definition Rule; the anonymous namespace protected us from that. In Release builds, this is mostly moot, as everything tends to inline completely. In Debug builds, violating the ODR is at worst an inconvenience, time spent trying to figure out why the bot is broken. Now that we're building with SSE2/NEON everywhere, very few bots have even a chance about getting confused by two definitions of the same type or function. Where we do compile variants depending on, e.g., SSSE3, we do so in static inline functions. These are not subject to the ODR. I plan to follow up with a tedious .kth<...>() -> [...] auto-replace. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1683543002 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1683543002
* fix float <---> uint16_t conversion, with Mike's tests.Gravatar mtklein2016-02-08
| | | | | | | | | | BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1679713002 patch from issue 1679713002 at patchset 1 (http://crrev.com/1679713002#ps1) CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1676893002
* could not resist: fast sse float <--> u16Gravatar mtklein2016-02-06
| | | | | | | | | | | - generalizes the bench to float <--> {u8,u16} - must remember to implement NEON version at some point BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1676853002 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1676853002
* SkNx Load/store: take any pointer.Gravatar mtklein2016-01-31
| | | | | | | | | | This means we can remove a lot of explicit casts in code that uses SkNx. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1650653002 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1650653002
* SkNx miplevel buildingGravatar mtklein2016-01-20
| | | | | | | | | | | | | | | | | | | | | | | | | | All sizes approximately twice as fast. Before: micros bench 1649.35 mipmap_build_512x512 nonrendering 1824.42 mipmap_build_511x512 nonrendering 2100.66 ? mipmap_build_512x511 nonrendering 2375.94 mipmap_build_511x511 nonrendering After: micros bench 730.32 ! mipmap_build_512x512 nonrendering 922.12 mipmap_build_511x512 nonrendering 999.07 mipmap_build_512x511 nonrendering 1342.93 ! mipmap_build_511x511 nonrendering BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1606013003 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Committed: https://skia.googlesource.com/skia/+/3bd5aba2a0e165997f683cf3aa306661e71464f6 Review URL: https://codereview.chromium.org/1606013003
* Revert of SkNx miplevel building (patchset #3 id:40001 of ↵Gravatar mtklein2016-01-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://codereview.chromium.org/1606013003/ ) Reason for revert: Paranoid revert to see if it helps skia:4823 Original issue's description: > SkNx miplevel building > > All sizes approximately twice as fast. > > Before: > micros bench > 1649.35 mipmap_build_512x512 nonrendering > 1824.42 mipmap_build_511x512 nonrendering > 2100.66 ? mipmap_build_512x511 nonrendering > 2375.94 mipmap_build_511x511 nonrendering > > After: > micros bench > 730.32 ! mipmap_build_512x512 nonrendering > 922.12 mipmap_build_511x512 nonrendering > 999.07 mipmap_build_512x511 nonrendering > 1342.93 ! mipmap_build_511x511 nonrendering > > > BUG=skia: > GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1606013003 > CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot > > Committed: https://skia.googlesource.com/skia/+/3bd5aba2a0e165997f683cf3aa306661e71464f6 TBR=reed@google.com,mtklein@chromium.org # Skipping CQ checks because original CL landed less than 1 days ago. NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG=skia: Review URL: https://codereview.chromium.org/1607323002
* SkNx miplevel buildingGravatar mtklein2016-01-19
| | | | | | | | | | | | | | | | | | | | | | | | All sizes approximately twice as fast. Before: micros bench 1649.35 mipmap_build_512x512 nonrendering 1824.42 mipmap_build_511x512 nonrendering 2100.66 ? mipmap_build_512x511 nonrendering 2375.94 mipmap_build_511x511 nonrendering After: micros bench 730.32 ! mipmap_build_512x512 nonrendering 922.12 mipmap_build_511x512 nonrendering 999.07 mipmap_build_512x511 nonrendering 1342.93 ! mipmap_build_511x511 nonrendering BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1606013003 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1606013003
* add SkNx::abs(), for now only implemented for Sk4fGravatar mtklein2016-01-15
| | | | | | | | | | | There's no reason we couldn't implement this for all ints and floats; just don't want to land unused code. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1590843003 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1590843003
* Specialize Sk2d for SSE2Gravatar mtklein2015-12-15
| | | | | | | | | | Given the autovectorization we've seen, I wouldn't expect big speedups from this, but it does give us a point of control over what's going on. BUG=skia: CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1526923003
* Unify some SkNx codeGravatar mtklein2015-12-14
| | | | | | | | | | | | | | | - one base case and one N=1 case instead of two each (or three with doubles) - use SkNx_cast instead of FromBytes/toBytes - 4-at-a-time Sk4f::ToBytes becomes a special standalone Sk4f_ToBytes If I did everything right, this'll be perf- and pixel- neutral. https://gold.skia.org/search2?issue=1526523003&unt=true&query=source_type%3Dgm&master=false BUG=skia: CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1526523003
* Don't use the Sk4f gradient impl without SIMDGravatar fmalita2015-12-03
| | | | | | | | | | | Also remove the SK_SUPPORT_LEGACY_LINEAR_GRADIENT_TABLE guard since it is no longer used in Chromium. BUG=chromium:563492 R=reed@google.com,mtklein@google.com CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1489233005
* Add Sk4f::ToBytes(uint8_t[16], Sk4f, Sk4f, Sk4f, Sk4f)Gravatar mtklein2015-12-01
| | | | | | | | | | | | This is a big speedup for float -> byte. E.g. gradient_linear_clamp_3color: x86-64 147µs -> 103µs (Broadwell MBP) arm64 2.03ms -> 648µs (Galaxy S6) armv7 1.12ms -> 489µs (Galaxy S6, same device!) BUG=skia: CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot;client.skia.android:Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Debug-Trybot Review URL: https://codereview.chromium.org/1483953002
* Add SkNx_cast().Gravatar mtklein2015-11-20
| | | | | | | | | | | | | | | | | | | | SkNx_cast() can cast between any of our vector types, provided they have the same number of elements. Any types should work with the default implementation, and we can drop in specializations as needed, like the SSE and NEON Sk4f -> Sk4i I included here as an example. To make this work, I made some internal name changes: SkNi<N,T> -> SkNx<N, T> SkNf<N> -> SkNx<N, float> User aliases (Sk4f, Sk16b, etc.) stay the same. We can land this first (it's PS1) if that makes things easier. BUG=skia: CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1464623002
* float xfermodes (burn, dodge, softlight) in Sk8f, possibly using AVX.Gravatar mtklein2015-11-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Xfermode_ColorDodge_aa 10.3ms -> 7.85ms 0.76x Xfermode_SoftLight_aa 13.8ms -> 10.2ms 0.74x Xfermode_ColorBurn_aa 10.7ms -> 7.82ms 0.73x Xfermode_SoftLight 33.6ms -> 23.2ms 0.69x Xfermode_ColorDodge 25ms -> 16.5ms 0.66x Xfermode_ColorBurn 26.1ms -> 16.6ms 0.63x Ought to be no pixel diffs: https://gold.skia.org/search2?issue=1432903002&unt=true&query=source_type%3Dgm&master=false Incidental stuff: I made the SkNx(T) constructors implicit to make writing math expressions simpler. This allows us to write expressions like Sk4f v; ... v = v*4; rather than Sk4f v; ... v = v * Sk4f(4); As written it only works when the constant is on the right-hand side, so expressions like `(Sk4f(1) - da)` have to stay for now. I plan on following up with a CL that lets those become `(1 - da)` too. BUG=skia:4117 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1432903002
* prune unused SkNx featuresGravatar mtklein2015-11-09
| | | | | | | | | | | | | | | - remove float -> int conversion, keeping float -> byte - remove support for doubles I was thinking of specializing Sk8f for AVX. This will help keep the complexity down. This may cause minor diffs in radial gradients: toBytes() rounds where castTrunc() truncated. But I don't see any diffs in Gold. https://gold.skia.org/search2?issue=1411563008&unt=true&query=source_type%3Dgm&master=false BUG=skia:4117 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1411563008
* Require Sk4f::toBytes() clampsGravatar mtklein2015-09-01
| | | | | | | | BUG=skia:4117 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot;client.skia.android:Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Release-Trybot Review URL: https://codereview.chromium.org/1312053004
* Move float<->byte conversions into Sk4f.Gravatar mtklein2015-08-31
| | | | | | | | | | | | | | | | | | | This lets us avoid conversions to [0.0, 1.0] space and rounding that aren't necessary for SkColorCubeFilter_opts.h. Dropping rounding on the way back to bytes means we'll see a bunch of off-by-1 diffs. Rough perf effect: SSSE3: 110 -> 93 (~15%) NEON: 465 -> 375 (~20%) This is the beginning of the end for SkPMFloat as an entity distinct from Sk4f. I've kept it for now so I can convert sites one by one and think about how things that really want to keep PM color order will work. BUG=skia:4117 Review URL: https://codereview.chromium.org/1319413003
* Revert of Refactor to put SkXfermode_opts inside SK_OPTS_NS. (patchset #1 ↵Gravatar mtklein2015-08-12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | id:1 of https://codereview.chromium.org/1286093004/ ) Reason for revert: Maybe causing test / gold problems? Original issue's description: > Refactor to put SkXfermode_opts inside SK_OPTS_NS. > > Without this refactor I was getting warnings previously about having code > inside namespace SK_OPTS_NS (e.g. namespace sse2, namespace neon) referring to > code inside an anonymous namespace (Sk4px, SkPMFloat, Sk4f, etc) [1]. > > That low-level code was in an anonymous namespace to allow multiple independent > copies of its methods to be instantiated without the linker getting confused / > offended about violating the One Definition Rule. This was only happening in > Debug mode where the methods were not being inlined. > > To fix this all, I've force-inlined the methods of the low-level code and > removed the anonymous namespace. > > BUG=skia:4117 > > > [1] Here is what those errors looked like: > > In file included from ../../../../src/core/SkOpts.cpp:18:0: > ../../../../src/opts/SkXfermode_opts.h:193:7: error: 'portable::Sk4pxXfermode' has a field 'portable::Sk4pxXfermode::fProc4' whose type uses the anonymous namespace [-Werror] > class Sk4pxXfermode : public SkProcCoeffXfermode { > ^ > ../../../../src/opts/SkXfermode_opts.h:193:7: error: 'portable::Sk4pxXfermode' has a field 'portable::Sk4pxXfermode::fAAProc4' whose type uses the anonymous namespace [-Werror] > ../../../../src/opts/SkXfermode_opts.h:235:7: error: 'portable::SkPMFloatXfermode' has a field 'portable::SkPMFloatXfermode::fProcF' whose type uses the anonymous namespace [-Werror] > class SkPMFloatXfermode : public SkProcCoeffXfermode { > ^ > cc1plus: all warnings being treated as errors > > Committed: https://skia.googlesource.com/skia/+/b07bee3121680b53b98b780ac08d14d374dd4c6f TBR=djsollen@google.com,mtklein@chromium.org NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG=skia:4117 Review URL: https://codereview.chromium.org/1284333002
* Refactor to put SkXfermode_opts inside SK_OPTS_NS.Gravatar mtklein2015-08-12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Without this refactor I was getting warnings previously about having code inside namespace SK_OPTS_NS (e.g. namespace sse2, namespace neon) referring to code inside an anonymous namespace (Sk4px, SkPMFloat, Sk4f, etc) [1]. That low-level code was in an anonymous namespace to allow multiple independent copies of its methods to be instantiated without the linker getting confused / offended about violating the One Definition Rule. This was only happening in Debug mode where the methods were not being inlined. To fix this all, I've force-inlined the methods of the low-level code and removed the anonymous namespace. BUG=skia:4117 [1] Here is what those errors looked like: In file included from ../../../../src/core/SkOpts.cpp:18:0: ../../../../src/opts/SkXfermode_opts.h:193:7: error: 'portable::Sk4pxXfermode' has a field 'portable::Sk4pxXfermode::fProc4' whose type uses the anonymous namespace [-Werror] class Sk4pxXfermode : public SkProcCoeffXfermode { ^ ../../../../src/opts/SkXfermode_opts.h:193:7: error: 'portable::Sk4pxXfermode' has a field 'portable::Sk4pxXfermode::fAAProc4' whose type uses the anonymous namespace [-Werror] ../../../../src/opts/SkXfermode_opts.h:235:7: error: 'portable::SkPMFloatXfermode' has a field 'portable::SkPMFloatXfermode::fProcF' whose type uses the anonymous namespace [-Werror] class SkPMFloatXfermode : public SkProcCoeffXfermode { ^ cc1plus: all warnings being treated as errors Review URL: https://codereview.chromium.org/1286093004
* 3-15% speedup to HardLight / Overlay xfermodes.Gravatar mtklein2015-07-14
| | | | | | | | | | | | | | | While investigating my bug (skia:4052) I saw this TODO and figured it'd make me feel better about an otherwise unsuccessful investigation. This speeds up HardLight and Overlay (same code) by about 15% with SSE, mostly by rewriting the logic from 1 cheap comparison and 2 expensive div255() calls to 2 cheap comparisons and 1 expensive div255(). NEON speeds up by a more modest ~3%. BUG=skia: Review URL: https://codereview.chromium.org/1230663005
* Color dodge and burn with SkPMFloat.Gravatar mtklein2015-06-26
| | | | | | | | | | | | | | | | | Both 25-35% faster with SSE. With NEON, Burn measures as a ~10% regression, Dodge a huge 2.9x improvement. The Burn regression is somewhat artificial: we're drawing random colored rects onto an opaque white dst, so we're heavily biased toward the (d==da) fast path in the serial code. In the vector code there's no short-circuiting and we always pay a fixed cost for ColorBurn regardless of src or dst content. Dodge's fast paths, in contrast, only trigger when (s==sa) or (d==0), neither of which happens any more than randomly in our benchmark. I don't think (d==0) should happen at all. Similarly, the (s==0) Burn fast path is really only going to happen as often as SkRandom allows. In practice, the existing Burn benchmark is hitting its fast path 100% of the time. So I actually feel really great that this only dings the benchmark by 10%. Chrome's still guarded by SK_SUPPORT_LEGACY_XFERMODES, which I'll lift after finishing the last xfermode, SoftLight. BUG=skia: Review URL: https://codereview.chromium.org/1214443002
* Implement four more xfermodes with Sk4px.Gravatar mtklein2015-06-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | HardLight, Overlay, Darken, and Lighten are all ~2x faster with SSE, ~25% faster with NEON. This covers all previously-implemented NEON xfermodes. 3 previous SSE xfermodes remain. Those need division and sqrt, so I'm planning on using SkPMFloat for them. It'll help the readability and NEON speed if I move that into [0,1] space first. The main new concept here is c.thenElse(t,e), which behaves like (c ? t : e) except, of course, both t and e are evaluated. This allows us to emulate conditionals with vectors. This also removes the concept of SkNb. Instead of a standalone bool vector, each SkNi or SkNf will just return their own types for comparisons. Turns out to be a lot more manageable this way. BUG=skia: Committed: https://skia.googlesource.com/skia/+/b9d4163bebab0f5639f9c5928bb5fc15f472dddc CQ_EXTRA_TRYBOTS=client.skia.compile:Build-Ubuntu-GCC-Arm64-Debug-Android-Trybot Review URL: https://codereview.chromium.org/1196713004
* Revert of Implement four more xfermodes with Sk4px. (patchset #16 id:290001 ↵Gravatar mtklein2015-06-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | of https://codereview.chromium.org/1196713004/) Reason for revert: 64-bit ARM build failures. Original issue's description: > Implement four more xfermodes with Sk4px. > > HardLight, Overlay, Darken, and Lighten are all > ~2x faster with SSE, ~25% faster with NEON. > > This covers all previously-implemented NEON xfermodes. > 3 previous SSE xfermodes remain. Those need division > and sqrt, so I'm planning on using SkPMFloat for them. > It'll help the readability and NEON speed if I move that > into [0,1] space first. > > The main new concept here is c.thenElse(t,e), which behaves like > (c ? t : e) except, of course, both t and e are evaluated. This allows > us to emulate conditionals with vectors. > > This also removes the concept of SkNb. Instead of a standalone bool > vector, each SkNi or SkNf will just return their own types for > comparisons. Turns out to be a lot more manageable this way. > > BUG=skia: > > Committed: https://skia.googlesource.com/skia/+/b9d4163bebab0f5639f9c5928bb5fc15f472dddc TBR=reed@google.com,mtklein@chromium.org NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG=skia: Review URL: https://codereview.chromium.org/1205703008
* Implement four more xfermodes with Sk4px.Gravatar mtklein2015-06-24
| | | | | | | | | | | | | | | | | | | | | | | HardLight, Overlay, Darken, and Lighten are all ~2x faster with SSE, ~25% faster with NEON. This covers all previously-implemented NEON xfermodes. 3 previous SSE xfermodes remain. Those need division and sqrt, so I'm planning on using SkPMFloat for them. It'll help the readability and NEON speed if I move that into [0,1] space first. The main new concept here is c.thenElse(t,e), which behaves like (c ? t : e) except, of course, both t and e are evaluated. This allows us to emulate conditionals with vectors. This also removes the concept of SkNb. Instead of a standalone bool vector, each SkNi or SkNf will just return their own types for comparisons. Turns out to be a lot more manageable this way. BUG=skia: Review URL: https://codereview.chromium.org/1196713004
* Update some Sk4px APIs.Gravatar mtklein2015-06-22
| | | | | | | | | | | | | | | Mostly this is about ergonomics, making it easier to do good operations and hard / impossible to do bad ones. - SkAlpha / SkPMColor constructors become static factories. - Remove div255TruncNarrow(), rename div255RoundNarrow() to div255(). In practice we always want to round, and the narrowing to 8-bit is contextually obvious. - Rename fastMulDiv255Round() approxMulDiv255() to stress it's approximate-ness over its speed. Drop Round for the same reason as above... we should always round. - Add operator overloads so we don't have to keep throwing in seemingly-random Sk4px() or Sk4px::Wide() casts. - use operator*() for 8-bit x 8-bit -> 16-bit math. It's always what we want, and there's generally no 8x8->8 alternative. - MapFoo can take a const Func&. Don't think it makes a big difference, but nice to do. BUG=skia: Review URL: https://codereview.chromium.org/1202013002
* Everyone gets a namespace {}.Gravatar mtklein2015-05-22
| | | | | | | | | | | | | | | | | | If we include Sk4px.h, SkPMFloat.h, or SkNx.h into files with different SIMD flags, that could cause different definitions of the same method. Normally that's moot, because all the code inlines, but in Debug it tends not to. So in Debug, the linker picks one definition for us. That breaks _someone_. Wrapping everything in a namespace {} keeps the definitions separate. Tested locally, it fixes this bug. BUG=skia:3861 This code is not yet enabled in Chrome, so shouldn't affect the roll. NOTREECHECKS=true Review URL: https://codereview.chromium.org/1154523004
* add Min to SkNi, specialized for u8 and u16 on SSE and NEONGravatar mtklein2015-05-14
| | | | | | | | | | | 0x8001 / 0x7fff don't seem to work, but we were close: 0x8000 does. I plan to use this to implement the Difference xfermode, and it seems generally handy. BUG=skia: Review URL: https://codereview.chromium.org/1133933004
* Plus xfermode using Sk4px.Gravatar mtklein2015-05-12
| | | | | | | | | | | | | | | | | | | Xfermode_Plus runs 4-5x faster. We expect mixed_xfermodes to have a small diff. This is because kFoldCoverageIntoSrcAlpha was incorrectly set to true. This implementation handily beats the Sk4f impl, the portable impl, and the existing SSE2 impl. Reading the SkXfermodes_opts_SSE2.cpp file, I'm pretty confident that we'll be able to beat all SSE2 impls. I believe this impl will beat or match the existing NEON impl too, but that may not be true for more complicated xfermodes. They can take advantage of transposing ARGBARGB... to AAAARRRR.... cheaply and I haven't figured out an abstraction for that yet that doesn't screw SSE. Adds: - MapDstSrc() to Sk4px - saturatedAdd() to SkNi (only implemented as far as it's used). - div255Narrow() BUG=skia: Review URL: https://codereview.chromium.org/1138893002
* Sk4pxGravatar mtklein2015-05-12
| | | | | | | | | | Xfermode_SrcOver: SSE: 2.08ms -> 2.03ms (~2% faster) NEON: my N5 is noisy, but there appears to be no perf change BUG=skia: Review URL: https://codereview.chromium.org/1132273004
* Split rsqrt into rsqrt{0,1,2}, with increasing cost and precision on ARMGravatar mtklein2015-04-27
| | | | | | | | | | | | This is a logical no-op. Everything was using the equivalent of rsqrt1() before, and is now after. BUG=skia: Committed: https://skia.googlesource.com/skia/+/9de16283fdc8cc0d31a84f503578d0ecea4e8297 CQ_EXTRA_TRYBOTS=client.skia.compile:Build-Ubuntu-GCC-Arm64-Debug-Android-Trybot Review URL: https://codereview.chromium.org/1109913002
* Revert of Split rsqrt into rsqrt{0,1,2}, with increasing cost and precision ↵Gravatar mtklein2015-04-27
| | | | | | | | | | | | | | | | | | | | | | | | on ARM (patchset #2 id:20001 of https://codereview.chromium.org/1109913002/) Reason for revert: arm64 typos Original issue's description: > Split rsqrt into rsqrt{0,1,2}, with increasing cost and precision on ARM > > This is a logical no-op. Everything was using the equivalent of rsqrt1() before, and is now after. > > BUG=skia: > > Committed: https://skia.googlesource.com/skia/+/9de16283fdc8cc0d31a84f503578d0ecea4e8297 TBR=reed@google.com,mtklein@chromium.org NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG=skia: Review URL: https://codereview.chromium.org/1105233003
* Split rsqrt into rsqrt{0,1,2}, with increasing cost and precision on ARMGravatar mtklein2015-04-27
| | | | | | | | This is a logical no-op. Everything was using the equivalent of rsqrt1() before, and is now after. BUG=skia: Review URL: https://codereview.chromium.org/1109913002
* Mike's radial gradient CL with better float -> int.Gravatar mtklein2015-04-27
| | | | | | | | | | | | | | | | | | patch from issue 1072303005 at patchset 40001 (http://crrev.com/1072303005#ps40001) This looks quite launchable. radial_gradient3, min of 100 samples: N5: 985µs -> 946µs MBP: 395µs -> 279µs On my MBP, most of the meat looks like it's now in reading the cache and writing to dst one color at a time. Is that something we could do in float math rather than with a lookup table? BUG=skia: CQ_EXTRA_TRYBOTS=client.skia.compile:Build-Mac10.8-Clang-Arm7-Debug-Android-Trybot,Build-Ubuntu-GCC-Arm7-Release-Android_NoNeon-Trybot Committed: https://skia.googlesource.com/skia/+/abf6c5cf95e921fae59efb487480e5b5081cf0ec Review URL: https://codereview.chromium.org/1109643002
* Revert of Mike's radial gradient CL with better float -> int. (patchset #7 ↵Gravatar mtklein2015-04-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | id:120001 of https://codereview.chromium.org/1109643002/) Reason for revert: compile failures. Original issue's description: > Mike's radial gradient CL with better float -> int. > > patch from issue 1072303005 at patchset 40001 (http://crrev.com/1072303005#ps40001) > > This looks quite launchable. radial_gradient3, min of 100 samples: > N5: 985µs -> 946µs > MBP: 395µs -> 279µs > > On my MBP, most of the meat looks like it's now in reading the cache and writing to dst one color at a time. Is that something we could do in float math rather than with a lookup table? > > BUG=skia: > > CQ_EXTRA_TRYBOTS=client.skia.android:Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Debug-Trybot,Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Debug-Trybot > > Committed: https://skia.googlesource.com/skia/+/abf6c5cf95e921fae59efb487480e5b5081cf0ec TBR=reed@google.com,robertphillips@google.com,mtklein@chromium.org NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG=skia: Review URL: https://codereview.chromium.org/1109883003