aboutsummaryrefslogtreecommitdiffhomepage
path: root/src/opts
Commit message (Collapse)AuthorAge
* Optimized premultiplying swizzles for NEONGravatar msarett2016-01-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Improves decode performance for RGBA encoded PNGs. Swizzle Time on Nexus 9 (with clang): SwapPremul 0.44x Premul 0.44x Decode Time On Nexus 9 (with clang): ZeroInit Decodes 0.85x Regular Decodes 0.86x Swizzle Time on Nexus 6P (with clang) SwapPremul 0.14x Premul 0.14x Decode Time On Nexus 6P (with clang): ZeroInit Decodes 0.93x Regular Decodes 0.95x Notes: ZeroInit means memory is zero initialized, and we do not write to memory for large sections of zero pixels (memory use opt for Android). A profile on Nexus 9 shows that the premultiplication step of PNG decoding is now ~5% of decode time (down from ~20%). BUG=skia:4767 GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1577703006 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1577703006
* Clean up order of arguments to d,s[,aa].Gravatar mtklein2016-01-08
| | | | | | | | | | | This gets rid of those unsightly lambdas, and makes the file more consistent both with itself and with Sk4px. BUG=skia:4765 GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1569373002 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1569373002
* Clean up SkXfermode_opts.hGravatar mtklein2016-01-08
| | | | | | | | | | | | | | | | | | | | | | | | | It seems that MSVC + __vectorcall don't play well together, so back ourselves out into a situation where we don't need it. - Inline transfermode functions. This removes the need for SK_VECTORCALL. - Remove 565 destination specializations. Blending into 565 is not speed-critical enough to merit the code bloat. - Removing 565 specializations means a bunch of Sk4px code is now dead. 8888 xfermodes generally speed up a bit from inlining, smoothly ranging from no change down to 0.65x for the fastest functions like Plus or Modulate. 565 xfermodes generally slow down because we're doing 565 -> 8888 and 8888->565 conversion serially[1] and using the stack, smoothly ranging from no change up to 2x slower for the fastest functions like Plus and Modulate. [1] the 565->8888 conversion is actually being autovectorized BUG=skia:4765,skia:4776 GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1565223002 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot No public API changes. TBR=reed@google.com Review URL: https://codereview.chromium.org/1565223002
* clean up dead x86 filter opts codeGravatar mtklein2016-01-05
| | | | | | | | | | This is dead after removing shadeSpan16(). BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1553233004 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1553233004
* remove shadeSpan16 from shaderGravatar reed2016-01-05
| | | | | | | | BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1556003003 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1556003003
* Revert of Try using std::call_once (patchset #1 id:1 of ↵Gravatar mtklein2016-01-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://codereview.chromium.org/1550893002/ ) Reason for revert: Can't use on XP. :( Original issue's description: > Try using std::call_once > > Now that we've got std library support, perhaps we should start using it. > This CL acts as a little canary, and may help fix the linked bug. > > I'm not really sure what's going on in the linked bug, but using > std::call_once over homegrown atomics has to be the right answer... > > BUG=chromium:418041 > GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1550893002 > CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot > > Going to land this ahead of review while the tree is quiet to see how it rolls. > TBR=herb@google.com > > Committed: https://skia.googlesource.com/skia/+/8895b72f789e5dc8bb99cb9727875439005fc919 TBR=herb@google.com,mtklein@chromium.org NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG=chromium:418041 Review URL: https://codereview.chromium.org/1552333003
* Try using std::call_onceGravatar mtklein2015-12-28
| | | | | | | | | | | | | | | | | Now that we've got std library support, perhaps we should start using it. This CL acts as a little canary, and may help fix the linked bug. I'm not really sure what's going on in the linked bug, but using std::call_once over homegrown atomics has to be the right answer... BUG=chromium:418041 GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1550893002 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Going to land this ahead of review while the tree is quiet to see how it rolls. TBR=herb@google.com Review URL: https://codereview.chromium.org/1550893002
* count is an int, so constrain it to a 32-bit w-register.Gravatar mtklein2015-12-16
| | | | | | | | | | | | | | This piece of code is already 64-bit only, so we don't need to think about ARMv7. Hopefully this shuts up the warnings. They were harmless. If this doesn't work (it's relatively new modifier, so maybe some compilers barf), an alternative is to cast count to a size_t. BUG=skia:4686 GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1527123003 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1527123003
* SSE 4.1 SrcOver blits: color32, blitmask.Gravatar mtklein2015-12-16
| | | | | | | | | | | | | | | | | | | | | | | | This is mainly warmup for an AVX2 version. The machine I'm typing this on just doesn't support AVX2. This strategy should translate easily down to SSSE3 and SSE2. Xfermode_SrcOver: 2.73ms -> 2.62ms (0.96x) (That's Color32.) Xfermode_SrcOver_aa: 3.48ms -> 3.09ms (0.89x) (That's BlitMask_D32_A8.) AA text blits (text_16_AA_{88,FF,WT,BK}) show speedups in the range of 5 to 20%. Unlike previous versions of this code, all the div255() are exactly (x+127)/255. This won't fix any major bugs, but it does correct our bias in the middle. There will be many diffs, all minor. I've punted for now on pmaddubsw for lerping. I do intend to try that, but I want this (relatively simple) code as my basis for comparison. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1526883004 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1526883004
* Specialize Sk2d for SSE2Gravatar mtklein2015-12-15
| | | | | | | | | | Given the autovectorization we've seen, I wouldn't expect big speedups from this, but it does give us a point of control over what's going on. BUG=skia: CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1526923003
* Unify some SkNx codeGravatar mtklein2015-12-14
| | | | | | | | | | | | | | | - one base case and one N=1 case instead of two each (or three with doubles) - use SkNx_cast instead of FromBytes/toBytes - 4-at-a-time Sk4f::ToBytes becomes a special standalone Sk4f_ToBytes If I did everything right, this'll be perf- and pixel- neutral. https://gold.skia.org/search2?issue=1526523003&unt=true&query=source_type%3Dgm&master=false BUG=skia: CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1526523003
* archive skpx... currently dead codeGravatar mtklein2015-12-11
| | | | | | | BUG=skia: CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1521623003
* better NEON div255Gravatar mtklein2015-12-07
| | | | | | | | | | | | | | | | | | | | | | | We were doing (x+127)/255 = ((x+128) + (x+128)>>8)>>8 in three instructions: 1) x += 128 2) shift x right 8 bits 3) add x and x>>8 together, then shift right more 8 bits Now do it as two instructions: 1) shift (x+128) right 8 bits 2) add x and (x+128)>>8 and 128 all together, then shift right 8 more bits On ARM this will be a 5-10% speedup for SrcATop, DstATop, Xor, Multiply, Difference, HardLight, Darken, and Lighten xfermodes. When we have a mask (e.g. text), *all* xfermodes except Plus will get a similar boost. This should mean now that (a*b).div255() is the same speed as a.approxMulDiv255(b) on both x86 and ARM, and of course it's perfect instead of approximate. So we should eliminate approxMulDiv255(), but I'll leave it to another CL, as it'll need Blink rebaselines. This CL should not change GMs or Blink. https://gold.skia.org/search2?issue=1502843002&unt=true&query=source_type%3Dgm&master=false BUG=skia: CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot;client.skia.android:Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Debug-Trybot,Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot Review URL: https://codereview.chromium.org/1502843002
* Don't use the Sk4f gradient impl without SIMDGravatar fmalita2015-12-03
| | | | | | | | | | | Also remove the SK_SUPPORT_LEGACY_LINEAR_GRADIENT_TABLE guard since it is no longer used in Chromium. BUG=chromium:563492 R=reed@google.com,mtklein@google.com CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1489233005
* Add Sk4f::ToBytes(uint8_t[16], Sk4f, Sk4f, Sk4f, Sk4f)Gravatar mtklein2015-12-01
| | | | | | | | | | | | This is a big speedup for float -> byte. E.g. gradient_linear_clamp_3color: x86-64 147µs -> 103µs (Broadwell MBP) arm64 2.03ms -> 648µs (Galaxy S6) armv7 1.12ms -> 489µs (Galaxy S6, same device!) BUG=skia: CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot;client.skia.android:Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Debug-Trybot Review URL: https://codereview.chromium.org/1483953002
* Add SkNx_cast().Gravatar mtklein2015-11-20
| | | | | | | | | | | | | | | | | | | | SkNx_cast() can cast between any of our vector types, provided they have the same number of elements. Any types should work with the default implementation, and we can drop in specializations as needed, like the SSE and NEON Sk4f -> Sk4i I included here as an example. To make this work, I made some internal name changes: SkNi<N,T> -> SkNx<N, T> SkNf<N> -> SkNx<N, float> User aliases (Sk4f, Sk16b, etc.) stay the same. We can land this first (it's PS1) if that makes things easier. BUG=skia: CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1464623002
* Revert float xfermodes back to Sk4f (from Sk8f).Gravatar mtklein2015-11-19
| | | | | | | | | | | | | | | Generally this was a performance win, even on devices without AVX due to unrolling, but on ARM+NEON it looks like that unrolling hurt a bit. while (...) { blend a pixel } ~~~> while (...) { blend two pixels } if (n % 2) { blend last pixel } BUG=chromium:555278 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1465483002
* Revert SkBlitMask_opts.h back to hand-coded NEON.Gravatar mtklein2015-11-18
| | | | | | | | | SkPx has triggered a bunch of small (2-9%) regressions on NEON devices. BUG=skia: CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1462783002
* div255(x) as ((x+128)*257)>>16 with SSEGravatar mtklein2015-11-17
| | | | | | | | | | | | | | | | | _mm_mulhi_epu16 makes the (...*257)>>16 part simple. This seems to speed up every transfermode that uses div255(), in the 7-25% range. It even appears to obviate the need for approxMulDiv255() on SSE. I'm not sure about NEON yet, so I'll keep approxMulDiv255() for now. Should be no pixels change: https://gold.skia.org/search2?issue=1452903004&unt=true&query=source_type%3Dgm&master=false BUG=skia: CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1452903004
* trim some fat from SSE2 fixed point alpha codeGravatar mtklein2015-11-17
| | | | | | | | | | | | - extract alpha from a pixel: 5 1-cycle ops to 4 1-cycle ops - load alphas: drop 4 unnecessary ops Should be no pixel diffs. BUG=skia: CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1447273004
* float xfermodes (burn, dodge, softlight) in Sk8f, possibly using AVX.Gravatar mtklein2015-11-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Xfermode_ColorDodge_aa 10.3ms -> 7.85ms 0.76x Xfermode_SoftLight_aa 13.8ms -> 10.2ms 0.74x Xfermode_ColorBurn_aa 10.7ms -> 7.82ms 0.73x Xfermode_SoftLight 33.6ms -> 23.2ms 0.69x Xfermode_ColorDodge 25ms -> 16.5ms 0.66x Xfermode_ColorBurn 26.1ms -> 16.6ms 0.63x Ought to be no pixel diffs: https://gold.skia.org/search2?issue=1432903002&unt=true&query=source_type%3Dgm&master=false Incidental stuff: I made the SkNx(T) constructors implicit to make writing math expressions simpler. This allows us to write expressions like Sk4f v; ... v = v*4; rather than Sk4f v; ... v = v * Sk4f(4); As written it only works when the constant is on the right-hand side, so expressions like `(Sk4f(1) - da)` have to stay for now. I plan on following up with a CL that lets those become `(1 - da)` too. BUG=skia:4117 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1432903002
* SkPx: use namespaces as namespacesGravatar mtklein2015-11-09
| | | | | | | | | | This is a pure refactor. No behavior change. I'm just getting tired of typing out the names... BUG=skia:4117 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1436513002
* prune unused SkNx featuresGravatar mtklein2015-11-09
| | | | | | | | | | | | | | | - remove float -> int conversion, keeping float -> byte - remove support for doubles I was thinking of specializing Sk8f for AVX. This will help keep the complexity down. This may cause minor diffs in radial gradients: toBytes() rounds where castTrunc() truncated. But I don't see any diffs in Gold. https://gold.skia.org/search2?issue=1411563008&unt=true&query=source_type%3Dgm&master=false BUG=skia:4117 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1411563008
* SkPx: new approach to fixed-point SIMDGravatar mtklein2015-11-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | SkPx is like Sk4px, except each platform implementation of SkPx can declare a different sweet spot of N pixels, with extra loads and stores to handle the ragged edge of 0<n<N pixels. In this case, _sse's sweet spot remains 4 pixels. _neon jumps up to 8 so we can now use NEON's transposing loads and stores, and _none is just 1. This makes operations involving alpha considerably more efficient on NEON, as alpha is its own distinct 8x8 bit plane that's easy to toss around. This incorporates a few other improvements I've been wanting: - no requirement that we're dealing with SkPMColor. SkColor works too. - no anonymous namespace hack to differentiate implementations. Codegen and perf look good on Clang/x86-64 and GCC/ARMv7. The NEON code looks very similar to the old NEON code, as intended. No .skp or GM diffs on my laptop. Don't expect any. I intend this to replace Sk4px. Plan after landing: - port SkXfermode_opts.h - port Color32 in SkBlitRow_D32.cpp (and move to SkBlitRow_opts.h like other SkOpts code) - delete all Sk4px-related code - clean up evolutionary dead ends in SkNx (Sk16b, Sk16h, Sk4i, Sk4d, etc.) leaving Sk2f, Sk4f (and Sk2s, Sk4s). - find a machine with AVX2 to work on, write SkPx_avx2.h handling 8 pixels at a time. In the end we'll have Sk4f for float pixels, SkPx for fixed-point pixels. BUG=skia:4117 Committed: https://skia.googlesource.com/skia/+/82c93b45ed6ac0b628adb8375389c202d1f586f9 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot;client.skia.compile:Build-Mac10.8-Clang-Arm7-Debug-Android-Trybot Committed: https://skia.googlesource.com/skia/+/a7627dc5cc2bf5d9a95d883d20c40d477ecadadf Review URL: https://codereview.chromium.org/1317233005
* Revert of SkPx: new approach to fixed-point SIMD (patchset #12 id:220001 of ↵Gravatar mtklein2015-11-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://codereview.chromium.org/1317233005/ ) Reason for revert: master-skia unhappy: https://android-build.storage.googleapis.com/builds/git_master-skia-linux-volantis-userdebug/2404853/e6c439e806fb0bd0f872a3d7a5cf0637d4ad11bfaa89e9bc18b651dc65f0a36b/logs/build_error.log?GoogleAccessId=701025073339-mqn0q2nvir9iurm6q5d00tdv7blbgvjr%40developer.gserviceaccount.com&Signature=WOqQO7xHkv83SmC4h5tNUIp%2BREaYULqK11hNTWlhj1XXo0NAOQd7GNSIHl775uRRZpBw2LkHeb2Ups3LsgRPrldqymposFtDa%2BUEW0Jv2NWAr%2F1Cqt6lwWsfknvJLN9NiEGfpCCye3Q%2FEYx9bU1ozMBG6h2DRHJUMRS%2FjstkJg0%3D&Expires=1446838937 Original issue's description: > SkPx: new approach to fixed-point SIMD > > SkPx is like Sk4px, except each platform implementation of SkPx can declare > a different sweet spot of N pixels, with extra loads and stores to handle the > ragged edge of 0<n<N pixels. > > In this case, _sse's sweet spot remains 4 pixels. _neon jumps up to 8 so > we can now use NEON's transposing loads and stores, and _none is just 1. > This makes operations involving alpha considerably more efficient on NEON, > as alpha is its own distinct 8x8 bit plane that's easy to toss around. > > This incorporates a few other improvements I've been wanting: > - no requirement that we're dealing with SkPMColor. SkColor works too. > - no anonymous namespace hack to differentiate implementations. > > Codegen and perf look good on Clang/x86-64 and GCC/ARMv7. > The NEON code looks very similar to the old NEON code, as intended. > No .skp or GM diffs on my laptop. Don't expect any. > > I intend this to replace Sk4px. Plan after landing: > - port SkXfermode_opts.h > - port Color32 in SkBlitRow_D32.cpp (and move to SkBlitRow_opts.h like other > SkOpts code) > - delete all Sk4px-related code > - clean up evolutionary dead ends in SkNx (Sk16b, Sk16h, Sk4i, Sk4d, etc.) > leaving Sk2f, Sk4f (and Sk2s, Sk4s). > - find a machine with AVX2 to work on, write SkPx_avx2.h handling 8 pixels > at a time. > > In the end we'll have Sk4f for float pixels, SkPx for fixed-point pixels. > > BUG=skia:4117 > > Committed: https://skia.googlesource.com/skia/+/82c93b45ed6ac0b628adb8375389c202d1f586f9 > > CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot;client.skia.compile:Build-Mac10.8-Clang-Arm7-Debug-Android-Trybot > > Committed: https://skia.googlesource.com/skia/+/a7627dc5cc2bf5d9a95d883d20c40d477ecadadf TBR=msarett@google.com,mtklein@chromium.org NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG=skia:4117 Review URL: https://codereview.chromium.org/1409843005
* SkPx: new approach to fixed-point SIMDGravatar mtklein2015-11-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | SkPx is like Sk4px, except each platform implementation of SkPx can declare a different sweet spot of N pixels, with extra loads and stores to handle the ragged edge of 0<n<N pixels. In this case, _sse's sweet spot remains 4 pixels. _neon jumps up to 8 so we can now use NEON's transposing loads and stores, and _none is just 1. This makes operations involving alpha considerably more efficient on NEON, as alpha is its own distinct 8x8 bit plane that's easy to toss around. This incorporates a few other improvements I've been wanting: - no requirement that we're dealing with SkPMColor. SkColor works too. - no anonymous namespace hack to differentiate implementations. Codegen and perf look good on Clang/x86-64 and GCC/ARMv7. The NEON code looks very similar to the old NEON code, as intended. No .skp or GM diffs on my laptop. Don't expect any. I intend this to replace Sk4px. Plan after landing: - port SkXfermode_opts.h - port Color32 in SkBlitRow_D32.cpp (and move to SkBlitRow_opts.h like other SkOpts code) - delete all Sk4px-related code - clean up evolutionary dead ends in SkNx (Sk16b, Sk16h, Sk4i, Sk4d, etc.) leaving Sk2f, Sk4f (and Sk2s, Sk4s). - find a machine with AVX2 to work on, write SkPx_avx2.h handling 8 pixels at a time. In the end we'll have Sk4f for float pixels, SkPx for fixed-point pixels. BUG=skia:4117 Committed: https://skia.googlesource.com/skia/+/82c93b45ed6ac0b628adb8375389c202d1f586f9 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot;client.skia.compile:Build-Mac10.8-Clang-Arm7-Debug-Android-Trybot Review URL: https://codereview.chromium.org/1317233005
* Make SkBlurImageFilter capable of cropping during blur (raster path)Gravatar senorblanco2015-11-02
| | | | | | | | | | | | | | | | | | SkBlurImageFilter can currently only process a source image which is larger than or equal to the destination rect. If the source image (or crop rect) is smaller, it is padded out to dest size with transparent black via the 6-param version of applyCropRect(). Fixing this requires modifying all the flavours of RGBA box_blur() to accept a src crop rect. BUG=skia:4502, skia:4526 CQ_EXTRA_TRYBOTS=client.skia.android:Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot;client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Committed: https://skia.googlesource.com/skia/+/1b82ceb737c73327412f2e8a91748481e1aec9e4 Review URL: https://codereview.chromium.org/1415653003
* Revert of Make SkBlurImageFilter capable of cropping during blur (patchset ↵Gravatar senorblanco2015-11-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | #16 id:400001 of https://codereview.chromium.org/1415653003/ ) Reason for revert: ASAN failures (see https://codereview.chromium.org/1415653003/) Original issue's description: > Make SkBlurImageFilter capable of cropping during blur (raster path) > > SkBlurImageFilter can currently only process a source image > which is larger than or equal to the destination rect. If > the source image (or crop rect) is smaller, it is padded > out to dest size with transparent black via the 6-param > version of applyCropRect(). > > Fixing this requires modifying all the flavours of RGBA > box_blur() to accept a src crop rect. > > BUG=skia:4502, skia:4526 > CQ_EXTRA_TRYBOTS=client.skia.android:Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot;client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot > > Committed: https://skia.googlesource.com/skia/+/1b82ceb737c73327412f2e8a91748481e1aec9e4 TBR=mtklein@google.com,reed@google.com NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG=skia:4502, skia:4526 Review URL: https://codereview.chromium.org/1428053002
* Make SkBlurImageFilter capable of cropping during blur (raster path)Gravatar senorblanco2015-11-02
| | | | | | | | | | | | | | | | SkBlurImageFilter can currently only process a source image which is larger than or equal to the destination rect. If the source image (or crop rect) is smaller, it is padded out to dest size with transparent black via the 6-param version of applyCropRect(). Fixing this requires modifying all the flavours of RGBA box_blur() to accept a src crop rect. BUG=skia:4502, skia:4526 CQ_EXTRA_TRYBOTS=client.skia.android:Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot;client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1415653003
* SkBlurImageFilter_opts: optimize NEON box_blur_double in separate loops.Gravatar senorblanco2015-10-28
| | | | | | | | | | | | | | Stop leaning so hard on the branch predictor, and pull the conditionals out of the loops for box_blur_double() (NEON). This is conceptually the same change as https://codereview.chromium.org/1426583004/ for the NEON double-pixel loop. R=mtklein@google.com BUG=skia:4526 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1412793009
* SkBlurImageFilter_opt.h: break conditions into separate loops.Gravatar senorblanco2015-10-28
| | | | | | | | | | | This gives ~15% improvement on blur_image on Linux Z620, and should allow me to implement cropping without incurring a perf hit. BUG=skia: CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1426583004
* move reinterpret_cast into SK_PREFETCHGravatar mtklein2015-10-28
| | | | | | | | | no public API changes TBR=reed@google.com BUG=skia: Review URL: https://codereview.chromium.org/1419573011
* Refactor SkBlurImageFilter_Opts.h.Gravatar senorblanco2015-10-27
| | | | | | | | | | | | Refactor box_blur() into a single driver function which SSE*, NEON and generic code paths can use. I've used macros to do this in order to keep debug performance reasonable, but it's fairly ugly. I'm open to other suggestions. BUG=skia: CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Review URL: https://codereview.chromium.org/1408003007
* Revert of SkPx: new approach to fixed-point SIMD (patchset #9 id:160001 of ↵Gravatar mtklein2015-09-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://codereview.chromium.org/1317233005/ ) Reason for revert: http://build.chromium.org/p/client.skia.compile/builders/Build-Mac10.8-Clang-Arm7-Debug-Android/builds/4627 Original issue's description: > SkPx: new approach to fixed-point SIMD > > SkPx is like Sk4px, except each platform implementation of SkPx can declare > a different sweet spot of N pixels, with extra loads and stores to handle the > ragged edge of 0<n<N pixels. > > In this case, _sse's sweet spot remains 4 pixels. _neon jumps up to 8 so > we can now use NEON's transposing loads and stores, and _none is just 1. > This makes operations involving alpha considerably more efficient on NEON, > as alpha is its own distinct 8x8 bit plane that's easy to toss around. > > This incorporates a few other improvements I've been wanting: > - no requirement that we're dealing with SkPMColor. SkColor works too. > - no anonymous namespace hack to differentiate implementations. > > Codegen and perf look good on Clang/x86-64 and GCC/ARMv7. > The NEON code looks very similar to the old NEON code, as intended. > No .skp or GM diffs on my laptop. Don't expect any. > > I intend this to replace Sk4px. Plan after landing: > - port SkXfermode_opts.h > - port Color32 in SkBlitRow_D32.cpp (and move to SkBlitRow_opts.h like other > SkOpts code) > - delete all Sk4px-related code > - clean up evolutionary dead ends in SkNx (Sk16b, Sk16h, Sk4i, Sk4d, etc.) > leaving Sk2f, Sk4f (and Sk2s, Sk4s). > - find a machine with AVX2 to work on, write SkPx_avx2.h handling 8 pixels > at a time. > > In the end we'll have Sk4f for float pixels, SkPx for fixed-point pixels. > > BUG=skia:4117 > > Committed: https://skia.googlesource.com/skia/+/82c93b45ed6ac0b628adb8375389c202d1f586f9 TBR=mtklein@google.com,msarett@google.com NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG=skia:4117 Review URL: https://codereview.chromium.org/1336423002
* SkPx: new approach to fixed-point SIMDGravatar mtklein2015-09-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | SkPx is like Sk4px, except each platform implementation of SkPx can declare a different sweet spot of N pixels, with extra loads and stores to handle the ragged edge of 0<n<N pixels. In this case, _sse's sweet spot remains 4 pixels. _neon jumps up to 8 so we can now use NEON's transposing loads and stores, and _none is just 1. This makes operations involving alpha considerably more efficient on NEON, as alpha is its own distinct 8x8 bit plane that's easy to toss around. This incorporates a few other improvements I've been wanting: - no requirement that we're dealing with SkPMColor. SkColor works too. - no anonymous namespace hack to differentiate implementations. Codegen and perf look good on Clang/x86-64 and GCC/ARMv7. The NEON code looks very similar to the old NEON code, as intended. No .skp or GM diffs on my laptop. Don't expect any. I intend this to replace Sk4px. Plan after landing: - port SkXfermode_opts.h - port Color32 in SkBlitRow_D32.cpp (and move to SkBlitRow_opts.h like other SkOpts code) - delete all Sk4px-related code - clean up evolutionary dead ends in SkNx (Sk16b, Sk16h, Sk4i, Sk4d, etc.) leaving Sk2f, Sk4f (and Sk2s, Sk4s). - find a machine with AVX2 to work on, write SkPx_avx2.h handling 8 pixels at a time. In the end we'll have Sk4f for float pixels, SkPx for fixed-point pixels. BUG=skia:4117 Review URL: https://codereview.chromium.org/1317233005
* Revert of use new shuffle to speed up affine matrix mappts (patchset #3 ↵Gravatar mtklein2015-09-10
| | | | | | | | | | | | | | | | | | | | | | | | | id:40001 of https://codereview.chromium.org/1333983002/ ) Reason for revert: Unexpected perf impact, and a whole bunch of new images in gold (mostly invisibly different). Original issue's description: > use new shuffle to speed up affine matrix mappts > > sse: 25 -> 18 > neon: 95 -> 86 > > BUG=skia: > > Committed: https://skia.googlesource.com/skia/+/e70afc9f48d00828ee6b707899a8ff542b0e8b98 TBR=reed@google.com,mtklein@chromium.org NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG=skia: Review URL: https://codereview.chromium.org/1335003002
* use new shuffle to speed up affine matrix mapptsGravatar mtklein2015-09-10
| | | | | | | | | sse: 25 -> 18 neon: 95 -> 86 BUG=skia: Review URL: https://codereview.chromium.org/1333983002
* SkNx_shuffleGravatar mtklein2015-09-10
| | | | | | | | | | | This allows us to express shuffles more directly in code while also giving us a convenient point to platform-specify particular shuffles for particular types. No specializations yet. Everyone just uses the (pretty good) default option. BUG=skia: Review URL: https://codereview.chromium.org/1301413006
* Port SkMatrix opts to SkOpts.Gravatar mtklein2015-09-10
| | | | | | | | | | | | | | | | | | | | | | No changes to the code, just moved around. This will have the effect of enabling vectorized code on ARMv7. Should be no effect on ARMv8 or x86, which would have been vectorized already. nanobench --match mappoints changes on Nexus 5 (ARMv7): _affine: 132 -> 95 _scale: 118 -> 47 _trans: 60 -> 37 A teaser: We should next look at the ABCD->BADC shuffle we've noted that we need in _affine. A quick hack showed doing that optimally is another ~35% speedup on x86. Got to figure out how to do it best on ARM though: that same quick hack was a 2x slowdown there. Good reason to resurrect that SkNx_shuffle() CL! (I believe the answers are vrev64q_f32(v) and _mm_shuffle_ps(v,v, _MM_SHUFFLE(2,3,0,1), but we should probably find out in another CL.) BUG=skia:4117 Review URL: https://codereview.chromium.org/1320673014
* Port SkBlitRow::Color32 to SkOpts.Gravatar mtklein2015-09-10
| | | | | | | | | | This was a pre-SkOpts attempt that we can bring under its wing now. This should be a perf no-op, deo volente. BUG=skia:4117 Review URL: https://codereview.chromium.org/1314863006
* Port uses of SkLazyPtr to SkOncePtr.Gravatar mtklein2015-09-09
| | | | | | | | | | | | | | | | This gives SkOncePtr a non-trivial destructor that uses std::default_delete by default. This is overrideable, as seen in SkColorTable. SK_DECLARE_STATIC_ONCE_PTR still just leaves its pointers hanging at EOP. BUG=skia: No public API changes. TBR=reed@google.com Committed: https://skia.googlesource.com/skia/+/a1254acdb344174e761f5061c820559dab64a74c Review URL: https://codereview.chromium.org/1322933005
* Revert of Port uses of SkLazyPtr to SkOncePtr. (patchset #7 id:110001 of ↵Gravatar mtklein2015-09-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://codereview.chromium.org/1322933005/ ) Reason for revert: Breaks Chrome roll. obj/skia/ext/skia_chrome.skia_memory_dump_provider.o does not have -I include/private on its include path, but transitively includes SkMessageBus.h. Original issue's description: > Port uses of SkLazyPtr to SkOncePtr. > > This gives SkOncePtr a non-trivial destructor that uses std::default_delete > by default. This is overrideable, as seen in SkColorTable. > > SK_DECLARE_STATIC_ONCE_PTR still just leaves its pointers hanging at EOP. > > BUG=skia: > > No public API changes. > TBR=reed@google.com > > Committed: https://skia.googlesource.com/skia/+/a1254acdb344174e761f5061c820559dab64a74c TBR=herb@google.com,mtklein@chromium.org NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG=skia: Review URL: https://codereview.chromium.org/1334523002
* Port uses of SkLazyPtr to SkOncePtr.Gravatar mtklein2015-09-09
| | | | | | | | | | | | | | This gives SkOncePtr a non-trivial destructor that uses std::default_delete by default. This is overrideable, as seen in SkColorTable. SK_DECLARE_STATIC_ONCE_PTR still just leaves its pointers hanging at EOP. BUG=skia: No public API changes. TBR=reed@google.com Review URL: https://codereview.chromium.org/1322933005
* Restore old NEON blit_mask_d32_a8 methods.Gravatar mtklein2015-09-01
| | | | | | | | | | | | | | As you'll see from the BUG line, we have a strong indication that the new Sk4px methods regress some devices. This restores the old code back as literally as possible while still fitting in SkOpts framework. This is ideally temporary breathing room. We should get an early indication of if those bugs will improve by watching https://perf.skia.org/#4004 BUG=skia:4117,525844,519596,524149 Review URL: https://codereview.chromium.org/1312763009
* SkColorCubeFilter_opts: rounding is actually free here.Gravatar mtklein2015-09-01
| | | | | | | | (Sk4f(float) is statically initializable, unlike the old SkPMFlor(SkPMColor).) BUG=skia:4117 Review URL: https://codereview.chromium.org/1317593007
* Require Sk4f::toBytes() clampsGravatar mtklein2015-09-01
| | | | | | | | BUG=skia:4117 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot;client.skia.android:Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Release-Trybot Review URL: https://codereview.chromium.org/1312053004
* Clean up remaining users of SkPMFloatGravatar mtklein2015-08-31
| | | | | | | | | | | | This switches over SkXfermodes_opts.h and SkColorMatrixFilter to use Sk4f, and converts the SkPMFloat benches to Sk4f benches. No pixels should change here, and no code beyond the Sk4f_ benches should change speed. The benches are faster than the old versions. BUG=skia:4117 Review URL: https://codereview.chromium.org/1324743002
* Move float<->byte conversions into Sk4f.Gravatar mtklein2015-08-31
| | | | | | | | | | | | | | | | | | | This lets us avoid conversions to [0.0, 1.0] space and rounding that aren't necessary for SkColorCubeFilter_opts.h. Dropping rounding on the way back to bytes means we'll see a bunch of off-by-1 diffs. Rough perf effect: SSSE3: 110 -> 93 (~15%) NEON: 465 -> 375 (~20%) This is the beginning of the end for SkPMFloat as an entity distinct from Sk4f. I've kept it for now so I can convert sites one by one and think about how things that really want to keep PM color order will work. BUG=skia:4117 Review URL: https://codereview.chromium.org/1319413003
* Style Change: NULL->nullptrGravatar halcanary2015-08-27
| | | | | | DOCS_PREVIEW= https://skia.org/?cl=1316233002 Review URL: https://codereview.chromium.org/1316233002
* SkColorCubeFilter_opts: start with a statically-initializable zero.Gravatar mtklein2015-08-27
| | | | | | | | | | | | | | | | SkPMFloat(0) and SkPMFloat(0,0,0,0) end up with the same value, but the first goes through math to get there. The second is a lot more transparent to the compiler, and should compile all the way down to just `xorps xmmN,xmmN` or even be optimized away. Didn't measure any additional benefit from hoisting the zero outside the loop and writing `SkPMFloat color = zero;`. Perf win is <2%. BUG=skia: Review URL: https://codereview.chromium.org/1314763007