aboutsummaryrefslogtreecommitdiffhomepage
path: root/src/opts
Commit message (Collapse)AuthorAge
* Port morphology to SkOpts.Gravatar mtklein2015-08-04
| | | | | | | | | | | | Nothing too fancy. Direction enums become enum classes so they don't get all confused. An alternative is to create one single Direction enum that both blur and morphology opts use. BUG=skia:4117 Review URL: https://codereview.chromium.org/1267343004
* Reorganize to keep similar code together.Gravatar Mike Klein2015-08-04
| | | | | | | | | This organizes memset16, memset32, and rsqrt the same way as the other code. No functional change. BUG=skia:4117 R=djsollen@google.com Review URL: https://codereview.chromium.org/1264423002 .
* Remove dead code.Gravatar Mike Klein2015-08-04
| | | | | | | BUG=skia:4117 R=mtklein@google.com Review URL: https://codereview.chromium.org/1262213005 .
* Port SkBlurImage opts to SkOpts.Gravatar mtklein2015-08-04
| | | | | | | | | | | | +268 -535 lines I also rearranged the code a little bit to encapsulate itself better, mostly replacing static helper functions with lambdas. This also let me merge the SSE2 and SSE4.1 code paths. BUG=skia:4117 Review URL: https://codereview.chromium.org/1264103004
* Port SkXfermode opts to SkOpts.hGravatar mtklein2015-07-31
| | | | | | | | | | | | | Renames Sk4pxXfermode.h to SkXfermode_opts.h, and refactors it a tiny bit internally. This moves xfermode optimization from being "compile-time everywhere but NEON" to simply "runtime everywhere". I don't anticipate any effect on perf or correctness. BUG=skia:4117 Review URL: https://codereview.chromium.org/1264543006
* Port SkUtils opts to SkOpts.Gravatar mtklein2015-07-31
| | | | | | | | | | | | | | | | With this new arrangement, the benefits of inlining sk_memset16/32 have changed. On x86, they're not significantly different, except for small N<=10 where the inlined code is significantly slower. On ARMv7 with NEON, our custom code is still significantly faster for N>10 (up to 2x faster). For small N<=10 inlining is still significantly faster. On ARMv7 without NEON, our custom code is still ridiculously faster (up to 10x) than inlining for N>10, though for small N<=10 inlining is still a little faster. We were not using the NEON memset16 and memset32 procs on ARMv8. At first blush, that seems to be an oversight, but if so it's an extremely lucky one. The ARMv8 code generation for our memset16/32 procs is total garbage, leaving those methods ~8x slower than just inlining the memset, using the compiler's autovectorization. So, no need to inline any more on x86, and still inline for N<=10 on ARMv7. Always inline for ARMv8. BUG=skia:4117 Review URL: https://codereview.chromium.org/1270573002
* Runtime CPU detection for rsqrt().Gravatar mtklein2015-07-30
| | | | | | | | | | | | | | | This enables the NEON sk_float_rsqrt() code for configurations that have NEON at run-time but not compile-time. These devices will see about a 2x (1.26 -> 2.33) slowdown in sk_float_rsqrt(), but it should be more precise than our portable fallback. (When inlined, the portable fallback and the NEON code are almost identical in speed. The only difference is precision. Going through a function pointer is causing all this slowdown. This is a good example of a place where Skia really benefits from compile-time NEON.) BUG=skia:4117,skia:4114 No public API changes. TBR=reed@google.com Review URL: https://codereview.chromium.org/1264893002
* Lay groundwork for SkOpts.Gravatar mtklein2015-07-30
| | | | | | | | | | This doesn't really do anything yet. It's just the CPU detection code, skeleton new .cpp files, and a few little .gyp tweaks. BUG=skia:4117 Committed: https://skia.googlesource.com/skia/+/ce2c5055cee5d5d3c9fc84c1b3eeed4b4d84a827 Review URL: https://codereview.chromium.org/1255193002
* Revert of Optimize RGB16 blitH functions with NEON for ARM platform. ↵Gravatar mtklein2015-07-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | (patchset #2 id:20001 of https://codereview.chromium.org/1229673008/) Reason for revert: This doesn't draw correctly, e.g. our GM test named dashcubics. Good: https://gold.skia.org/img/images/0f7e8e226379afbad8a700e0a80fd8f1.png Bad: https://gold.skia.org/img/images/56ce15fc67436065a3db4b8ee31f13ae.png Original issue's description: > Optimize RGB16 blitH functions with NEON for ARM platform. > > Here are some performance resultsi on Nexus 9: > SkRGB16BlitterBlitH_neon: > +--------+-----------+ > |height | C/NEON | > +--------+-----------+ > |1 | 0.888531 | > +--------+-----------+ > |8 | 1.231800 | > +--------+-----------+ > |18 | 1.073327 | > +--------+-----------+ > |32 | 1.136991 | > +--------+-----------+ > |76 | 1.174638 | > +--------+-----------+ > |85 | 1.188551 | > +--------+-----------+ > |120 | 1.180261 | > +--------+-----------+ > |128 | 1.183726 | > +--------+-----------+ > |512 | 1.220806 | > +--------+-----------+ > > BUG=skia: > > Committed: https://skia.googlesource.com/skia/+/6c72d5740231f47c664a8e765a8df05cd124c88c TBR=djsollen@google.com,caryclark@google.com,reed@google.com,bero@linaro.com,yang.zhang@linaro.org NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG=skia: Review URL: https://codereview.chromium.org/1268513003
* Optimize RGB16 blitH functions with NEON for ARM platform.Gravatar yang.zhang2015-07-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Here are some performance resultsi on Nexus 9: SkRGB16BlitterBlitH_neon: +--------+-----------+ |height | C/NEON | +--------+-----------+ |1 | 0.888531 | +--------+-----------+ |8 | 1.231800 | +--------+-----------+ |18 | 1.073327 | +--------+-----------+ |32 | 1.136991 | +--------+-----------+ |76 | 1.174638 | +--------+-----------+ |85 | 1.188551 | +--------+-----------+ |120 | 1.180261 | +--------+-----------+ |128 | 1.183726 | +--------+-----------+ |512 | 1.220806 | +--------+-----------+ BUG=skia: Review URL: https://codereview.chromium.org/1229673008
* Revert of Lay groundwork for SkOpts. (patchset #3 id:40001 of ↵Gravatar mtklein2015-07-27
| | | | | | | | | | | | | | | | | | | | | | | | https://codereview.chromium.org/1255193002/) Reason for revert: Chromium doesn't call SkGraphics::Init(). This setup won't work. Original issue's description: > Lay groundwork for SkOpts. > > This doesn't really do anything yet. It's just the CPU detection code, skeleton new .cpp files, and a few little .gyp tweaks. > > BUG=skia:4117 > > Committed: https://skia.googlesource.com/skia/+/ce2c5055cee5d5d3c9fc84c1b3eeed4b4d84a827 TBR=djsollen@google.com NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG=skia:4117 Review URL: https://codereview.chromium.org/1261743002
* Remove sk_memcpy32Gravatar mtklein2015-07-27
| | | | | | | | | | | | | | | | | | | | | | | | It's only implemented on x86, where the exisiting benchmark says memcpy() is faster for all cases: Timer overhead: 24ns curr/maxrss loops min median mean max stddev samples config bench 10/10 MB 1 35.9µs 36.2µs 36.2µs 36.6µs 1% ▁▂▄▅▅▃█▄▄▅ nonrendering sk_memcpy32_100000 10/10 MB 13 2.27µs 2.28µs 2.28µs 2.29µs 0% █▄▃▅▃▁▃▅▁▄ nonrendering sk_memcpy32_10000 11/11 MB 677 91.6ns 95.9ns 94.5ns 99.4ns 3% ▅▅▅▅▅█▁▁▁▁ nonrendering sk_memcpy32_1000 11/11 MB 1171 20ns 20.9ns 21.3ns 23.4ns 6% ▁▁▇▃▃▃█▇▃▃ nonrendering sk_memcpy32_100 11/11 MB 1952 14ns 14ns 14.3ns 15.2ns 3% ▁▁██▁▁▁▁▁▁ nonrendering sk_memcpy32_10 11/11 MB 5 33.6µs 33.7µs 34.1µs 35.2µs 2% ▆▇█▁▁▁▁▁▁▁ nonrendering memcpy32_memcpy_100000 11/11 MB 18 2.12µs 2.22µs 2.24µs 2.39µs 5% ▂█▄▇█▄▇▁▁▁ nonrendering memcpy32_memcpy_10000 11/11 MB 1112 87.3ns 87.3ns 89.1ns 93.7ns 3% ▄██▄▁▁▁▁▁▁ nonrendering memcpy32_memcpy_1000 11/11 MB 2124 12.8ns 13.3ns 13.5ns 14.8ns 6% ▁▁▁█▃▃█▇▃▃ nonrendering memcpy32_memcpy_100 11/11 MB 3077 9ns 9.41ns 9.52ns 10.2ns 4% ▃█▁█▃▃▃▃▃▃ nonrendering memcpy32_memcpy_10 (Why? One fewer thing to port to SkOpts.) BUG=skia:4117 Review URL: https://codereview.chromium.org/1256763003
* Lay groundwork for SkOpts.Gravatar mtklein2015-07-27
| | | | | | | | This doesn't really do anything yet. It's just the CPU detection code, skeleton new .cpp files, and a few little .gyp tweaks. BUG=skia:4117 Review URL: https://codereview.chromium.org/1255193002
* NEON has a ternary instruction.Gravatar mtklein2015-07-27
| | | | | | | | Nothing seems to run any faster or slower, but it is terser. BUG=skia: Review URL: https://codereview.chromium.org/1255913004
* 565 support for SIMD xfermodesGravatar mtklein2015-07-22
| | | | | | | | | | | | | | | | | | | | | | This uses the most basic approach possible: - to load an Sk4px from 565, convert to SkPMColors on the stack serially then load those SkPMColors. - to store an Sk4px to 565, store to SkPMColors on the stack then convert to 565 serially. Clearly, we can optimize these loads and stores. That's a TODO. The code using SkPMFloat is the same idea but a little more long-term viable, as we're only operating on one pixel at a time anyway. We could probably write 565 <-> SkPMFloat methods, but I'd rather not until it's really compelling. The speedups are varied but similar across SSE and NEON: a few uninteresting, many 50% faster, some 2x faster, and SoftLight ~4x faster. This will cause minor GM diffs, but I don't think any layout test changes. BUG=skia: Committed: https://skia.googlesource.com/skia/+/942930dcaa51f66d82cdaf46ae62efebd16c8cd0 Committed: https://skia.googlesource.com/skia/+/860dcaa2ddfdadc050af4f943a84a9d499315066 Review URL: https://codereview.chromium.org/1245673002
* Revert of 565 support for SIMD xfermodes (patchset #4 id:60001 of ↵Gravatar mtklein2015-07-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://codereview.chromium.org/1245673002/) Reason for revert: NEON 565 gold images have gone ugly. This is what I get for writing and testing SSE and just writing NEON. E.g. colortype_xfermodes, dstreadshuffle, bigbitmaprect, pictures, textbloblooper, aaxfermodes (only Plus) Original issue's description: > 565 support for SIMD xfermodes > > This uses the most basic approach possible: > - to load an Sk4px from 565, convert to SkPMColors on the stack serially then load those SkPMColors. > - to store an Sk4px to 565, store to SkPMColors on the stack then convert to 565 serially. > > Clearly, we can optimize these loads and stores. That's a TODO. > > The code using SkPMFloat is the same idea but a little more long-term viable, as we're only operating on one pixel at a time anyway. We could probably write 565 <-> SkPMFloat methods, but I'd rather not until it's really compelling. > > The speedups are varied but similar across SSE and NEON: a few uninteresting, many 50% faster, some 2x faster, and SoftLight ~4x faster. > > This will cause minor GM diffs, but I don't think any layout test changes. > > BUG=skia: > > Committed: https://skia.googlesource.com/skia/+/942930dcaa51f66d82cdaf46ae62efebd16c8cd0 > > Committed: https://skia.googlesource.com/skia/+/860dcaa2ddfdadc050af4f943a84a9d499315066 TBR=msarett@google.com,mtklein@chromium.org NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG=skia: Review URL: https://codereview.chromium.org/1248893004
* 565 support for SIMD xfermodesGravatar mtklein2015-07-21
| | | | | | | | | | | | | | | | | | | | This uses the most basic approach possible: - to load an Sk4px from 565, convert to SkPMColors on the stack serially then load those SkPMColors. - to store an Sk4px to 565, store to SkPMColors on the stack then convert to 565 serially. Clearly, we can optimize these loads and stores. That's a TODO. The code using SkPMFloat is the same idea but a little more long-term viable, as we're only operating on one pixel at a time anyway. We could probably write 565 <-> SkPMFloat methods, but I'd rather not until it's really compelling. The speedups are varied but similar across SSE and NEON: a few uninteresting, many 50% faster, some 2x faster, and SoftLight ~4x faster. This will cause minor GM diffs, but I don't think any layout test changes. BUG=skia: Committed: https://skia.googlesource.com/skia/+/942930dcaa51f66d82cdaf46ae62efebd16c8cd0 Review URL: https://codereview.chromium.org/1245673002
* Clean up dead xfermode opts code.Gravatar mtklein2015-07-20
| | | | | | | | | | | | | Now that SK_SUPPORT_LEGACY_XFERMODES is unused, tons of code becomes dead. Nothing is needed in opts/ anymore for x86. We still do runtime NEON detection, which just duplicates Sk4pxXfermode. TBR=reed@google.com BUG=skia: Review URL: https://codereview.chromium.org/1230023011
* Optimize RGB16 blitV functions with NEON for ARM platform.Gravatar yang.zhang2015-07-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Here are some performance resultsi on Nexus 9: SkRGB16BlitterBlitV_neon: +--------+-----------+ |height | C/NEON | +--------+-----------+ |1 | 0.765230 | +--------+-----------+ |8 | 1.273330 | +--------+-----------+ |18 | 1.441462 | +--------+-----------+ |32 | 1.627798 | +--------+-----------+ |76 | 1.683131 | +--------+-----------+ |85 | 1.679456 | +--------+-----------+ |120 | 1.721311 | +--------+-----------+ |128 | 1.725482 | +--------+-----------+ |512 | 1.784117 | +--------+-----------+ BUG=skia: Review URL: https://codereview.chromium.org/1213723002
* 3-15% speedup to HardLight / Overlay xfermodes.Gravatar mtklein2015-07-14
| | | | | | | | | | | | | | | While investigating my bug (skia:4052) I saw this TODO and figured it'd make me feel better about an otherwise unsuccessful investigation. This speeds up HardLight and Overlay (same code) by about 15% with SSE, mostly by rewriting the logic from 1 cheap comparison and 2 expensive div255() calls to 2 cheap comparisons and 1 expensive div255(). NEON speeds up by a more modest ~3%. BUG=skia: Review URL: https://codereview.chromium.org/1230663005
* SoftLight with SkPMFloatGravatar mtklein2015-06-29
| | | | | | | | | | | | | | | | | | SSE speeds up about 4.5x over existing integer SSE, NEON speeds up about 3x over serial integer code. We expect 1-2 bit component diffs in the usual GMs. Still guarded by SK_SUPPORT_LEGACY_XFERMODES, which I'll now try to lift in Chrome. BUG=skia: Committed: https://skia.googlesource.com/skia/+/3e47d49b46b3ab62071218ef3dd44642c9713e04 CQ_EXTRA_TRYBOTS=client.skia:Test-ChromeOS-GCC-Daisy-CPU-NEON-Arm7-Debug-Trybot Review URL: https://codereview.chromium.org/1221493002
* Revert of SoftLight with SkPMFloat (patchset #6 id:100001 of ↵Gravatar mtklein2015-06-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://codereview.chromium.org/1221493002/) Reason for revert: xfermodes and xfermodes2 show major diffs on Nexus 5 and Daisy (both ARMv7 w/NEON). Nexus 9 and SSE all look fine... Original issue's description: > SoftLight with SkPMFloat > > SSE speeds up about 4.5x over existing integer SSE, > NEON speeds up about 3x over serial integer code. > > We expect 1-2 bit component diffs in the usual GMs. > > Still guarded by SK_SUPPORT_LEGACY_XFERMODES, > which I'll now try to lift in Chrome. > > > BUG=skia: > > Committed: https://skia.googlesource.com/skia/+/3e47d49b46b3ab62071218ef3dd44642c9713e04 TBR=reed@google.com,mtklein@chromium.org NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG=skia: Review URL: https://codereview.chromium.org/1221683002
* SoftLight with SkPMFloatGravatar mtklein2015-06-29
| | | | | | | | | | | | | | SSE speeds up about 4.5x over existing integer SSE, NEON speeds up about 3x over serial integer code. We expect 1-2 bit component diffs in the usual GMs. Still guarded by SK_SUPPORT_LEGACY_XFERMODES, which I'll now try to lift in Chrome. BUG=skia: Review URL: https://codereview.chromium.org/1221493002
* Add extra braces for uintNNxMx4_t initializers.Gravatar mtklein2015-06-26
| | | | | | | | | | | | These structs are always implemented as struct uintNNxMx4_t { uintNNxM val[4]; }; So, the first set of braces is for the struct, the second for val. BUG=skia: Review URL: https://codereview.chromium.org/1221453002
* Color dodge and burn with SkPMFloat.Gravatar mtklein2015-06-26
| | | | | | | | | | | | | | | | | Both 25-35% faster with SSE. With NEON, Burn measures as a ~10% regression, Dodge a huge 2.9x improvement. The Burn regression is somewhat artificial: we're drawing random colored rects onto an opaque white dst, so we're heavily biased toward the (d==da) fast path in the serial code. In the vector code there's no short-circuiting and we always pay a fixed cost for ColorBurn regardless of src or dst content. Dodge's fast paths, in contrast, only trigger when (s==sa) or (d==0), neither of which happens any more than randomly in our benchmark. I don't think (d==0) should happen at all. Similarly, the (s==0) Burn fast path is really only going to happen as often as SkRandom allows. In practice, the existing Burn benchmark is hitting its fast path 100% of the time. So I actually feel really great that this only dings the benchmark by 10%. Chrome's still guarded by SK_SUPPORT_LEGACY_XFERMODES, which I'll lift after finishing the last xfermode, SoftLight. BUG=skia: Review URL: https://codereview.chromium.org/1214443002
* add/fix copyrightsGravatar reed2015-06-26
| | | | | | | BUG=skia: TBR= Review URL: https://codereview.chromium.org/1212393002
* What did we learn today? 255 != 256Gravatar mtklein2015-06-25
| | | | | | | | | vcvt_n_f32_u32 and _u32_f32 work in power-of-2 fixed point, so (...,8) meant 'please multiply or divide by 256'. We need to use 255. :( BUG=skia: Review URL: https://codereview.chromium.org/1204363002
* Convert SkPMFloat to [0,1] range and prune its API.Gravatar mtklein2015-06-25
| | | | | | | | | | | | | | | | | | | | | | | | Now that Sk4px exists, there's a lot less sense in eeking out every cycle of speed from SkPMFloat: if we need to go _really_ fast, we should use Sk4px. SkPMFloat's going to be used for things that are already slow: large-range intermediates, divides, sqrts, etc. A [0,1] range is easier to work with, and can even be faster if we eliminate enough *255 and *1/255 steps. This is particularly true on ARM, where NEON can do the *255 and /255 steps for us while converting float<->int. We have lots of experimental SkPMFloat <-> SkPMColor APIs that I'm now removing. Of the existing APIs, roundClamp() is the sanest, so I've kept only that, now called round(). The 4-at-a-time APIs never panned out, so they're gone. There will be small diffs on: colormatrix coloremoji colorfilterimagefilter fadefilter imagefilters_xfermodes imagefilterscropexpand imagefiltersgraph tileimagefilter BUG=skia: Review URL: https://codereview.chromium.org/1201343004
* Implement four more xfermodes with Sk4px.Gravatar mtklein2015-06-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | HardLight, Overlay, Darken, and Lighten are all ~2x faster with SSE, ~25% faster with NEON. This covers all previously-implemented NEON xfermodes. 3 previous SSE xfermodes remain. Those need division and sqrt, so I'm planning on using SkPMFloat for them. It'll help the readability and NEON speed if I move that into [0,1] space first. The main new concept here is c.thenElse(t,e), which behaves like (c ? t : e) except, of course, both t and e are evaluated. This allows us to emulate conditionals with vectors. This also removes the concept of SkNb. Instead of a standalone bool vector, each SkNi or SkNf will just return their own types for comparisons. Turns out to be a lot more manageable this way. BUG=skia: Committed: https://skia.googlesource.com/skia/+/b9d4163bebab0f5639f9c5928bb5fc15f472dddc CQ_EXTRA_TRYBOTS=client.skia.compile:Build-Ubuntu-GCC-Arm64-Debug-Android-Trybot Review URL: https://codereview.chromium.org/1196713004
* Revert of Implement four more xfermodes with Sk4px. (patchset #16 id:290001 ↵Gravatar mtklein2015-06-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | of https://codereview.chromium.org/1196713004/) Reason for revert: 64-bit ARM build failures. Original issue's description: > Implement four more xfermodes with Sk4px. > > HardLight, Overlay, Darken, and Lighten are all > ~2x faster with SSE, ~25% faster with NEON. > > This covers all previously-implemented NEON xfermodes. > 3 previous SSE xfermodes remain. Those need division > and sqrt, so I'm planning on using SkPMFloat for them. > It'll help the readability and NEON speed if I move that > into [0,1] space first. > > The main new concept here is c.thenElse(t,e), which behaves like > (c ? t : e) except, of course, both t and e are evaluated. This allows > us to emulate conditionals with vectors. > > This also removes the concept of SkNb. Instead of a standalone bool > vector, each SkNi or SkNf will just return their own types for > comparisons. Turns out to be a lot more manageable this way. > > BUG=skia: > > Committed: https://skia.googlesource.com/skia/+/b9d4163bebab0f5639f9c5928bb5fc15f472dddc TBR=reed@google.com,mtklein@chromium.org NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG=skia: Review URL: https://codereview.chromium.org/1205703008
* Implement four more xfermodes with Sk4px.Gravatar mtklein2015-06-24
| | | | | | | | | | | | | | | | | | | | | | | HardLight, Overlay, Darken, and Lighten are all ~2x faster with SSE, ~25% faster with NEON. This covers all previously-implemented NEON xfermodes. 3 previous SSE xfermodes remain. Those need division and sqrt, so I'm planning on using SkPMFloat for them. It'll help the readability and NEON speed if I move that into [0,1] space first. The main new concept here is c.thenElse(t,e), which behaves like (c ? t : e) except, of course, both t and e are evaluated. This allows us to emulate conditionals with vectors. This also removes the concept of SkNb. Instead of a standalone bool vector, each SkNi or SkNf will just return their own types for comparisons. Turns out to be a lot more manageable this way. BUG=skia: Review URL: https://codereview.chromium.org/1196713004
* Use vmulq_n_u32(..., 0x01010101) to distribute alphas.Gravatar mtklein2015-06-22
| | | | | | | | | | | | | | | | | | | | | | | This seems to make alphas() faster and Load[24]Alphas() no slower. The change is particularly noticeable on xfermodes that call alphas() twice (on src and dst), with a 10-12% speedup. Xfermode_Difference_aa 29ms -> 28.4ms 0.98x Xfermode_DstATop_aa 27.2ms -> 26.7ms 0.98x Xfermode_Xor_aa 27.2ms -> 26.5ms 0.98x Xfermode_DstOver 23.6ms -> 22.9ms 0.97x Xfermode_DstOver_aa 27.8ms -> 26.8ms 0.96x Xfermode_DstOut 22.6ms -> 21.7ms 0.96x Xfermode_Multiply_aa 30ms -> 28.5ms 0.95x Xfermode_DstOut_aa 26.1ms -> 24.8ms 0.95x Xfermode_DstIn_aa 25.4ms -> 24.1ms 0.95x Xfermode_DstATop 28.7ms -> 26ms 0.9x Xfermode_Multiply 35.5ms -> 31.3ms 0.88x Xfermode_Difference 31.8ms -> 27.7ms 0.87x Xfermode_Xor 30.1ms -> 26.1ms 0.87x BUG=skia: Review URL: https://codereview.chromium.org/1203513002
* Update some Sk4px APIs.Gravatar mtklein2015-06-22
| | | | | | | | | | | | | | | Mostly this is about ergonomics, making it easier to do good operations and hard / impossible to do bad ones. - SkAlpha / SkPMColor constructors become static factories. - Remove div255TruncNarrow(), rename div255RoundNarrow() to div255(). In practice we always want to round, and the narrowing to 8-bit is contextually obvious. - Rename fastMulDiv255Round() approxMulDiv255() to stress it's approximate-ness over its speed. Drop Round for the same reason as above... we should always round. - Add operator overloads so we don't have to keep throwing in seemingly-random Sk4px() or Sk4px::Wide() casts. - use operator*() for 8-bit x 8-bit -> 16-bit math. It's always what we want, and there's generally no 8x8->8 alternative. - MapFoo can take a const Func&. Don't think it makes a big difference, but nice to do. BUG=skia: Review URL: https://codereview.chromium.org/1202013002
* Plumb through out_row byte length so we can assert we stay underneath it.Gravatar mtklein2015-06-18
| | | | | | | | Sadly, not asserting for me yet. Can't hurt. BUG=chromium:491660 Review URL: https://codereview.chromium.org/1187173005
* switch bitmapshader internals over to pixmapGravatar reed2015-06-04
| | | | | | | BUG=skia: NOTRY=True Review URL: https://codereview.chromium.org/1158273007
* Everyone gets a namespace {}.Gravatar mtklein2015-05-22
| | | | | | | | | | | | | | | | | | If we include Sk4px.h, SkPMFloat.h, or SkNx.h into files with different SIMD flags, that could cause different definitions of the same method. Normally that's moot, because all the code inlines, but in Debug it tends not to. So in Debug, the linker picks one definition for us. That breaks _someone_. Wrapping everything in a namespace {} keeps the definitions separate. Tested locally, it fixes this bug. BUG=skia:3861 This code is not yet enabled in Chrome, so shouldn't affect the roll. NOTREECHECKS=true Review URL: https://codereview.chromium.org/1154523004
* Move Sk4px Xfermode code to a header so we can use it twice.Gravatar mtklein2015-05-22
| | | | | | | | | | | | | | | | - Once in SkXfermode as usual to pick up compile-time SSE and NEON - Once in SkXfermode_arm_neon to pick up run-time NEON This allows us to start cleaning up SkXfermode_arm_neon as we've done for SkXfermode_SSE2. I'm saving this catharsis for a day when I need it. The Sk4px xfermodes are generally faster than the existing NEON procs, so this should also have the side effect of a perf win there. This means our new Plus-AA code works for runtime NEON too. BUG=skia:3852 Review URL: https://codereview.chromium.org/1150313003
* Re-proc SkBlitRow::Color32 for ARM.Gravatar mtklein2015-05-22
| | | | | | | | | | This is a spiritual revert of http://crrev.com/1104183004. BUG=skia: Committed: https://skia.googlesource.com/skia/+/4e13a23d8f720e17660f26657b45b89fe4339004 Review URL: https://codereview.chromium.org/1145283003
* Revert of Re-proc SkBlitRow::Color32 for ARM. (patchset #3 id:40001 of ↵Gravatar mtklein2015-05-22
| | | | | | | | | | | | | | | | | | | | | | | | https://codereview.chromium.org/1145283003/) Reason for revert: http://build.chromium.org/p/tryserver.chromium.mac/builders/ios_rel_device_ninja/builds/70016/steps/compile%20%28with%20patch%29/logs/stdio Original issue's description: > Re-proc SkBlitRow::Color32 for ARM. > > This is a spiritual revert of http://crrev.com/1104183004. > > BUG=skia: > > Committed: https://skia.googlesource.com/skia/+/4e13a23d8f720e17660f26657b45b89fe4339004 TBR=reed@google.com,mtklein@chromium.org NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG=skia: Review URL: https://codereview.chromium.org/1157633003
* Re-proc SkBlitRow::Color32 for ARM.Gravatar mtklein2015-05-21
| | | | | | | | This is a spiritual revert of http://crrev.com/1104183004. BUG=skia: Review URL: https://codereview.chromium.org/1145283003
* Clean up Sk4f xfermodes and covered _SSE2 xfermodes.Gravatar mtklein2015-05-21
| | | | | | | | Before I get going on fixing Plus, it's nice to clear out the dead cruft. BUG=skia:3852 Review URL: https://codereview.chromium.org/1150833003
* Sk4px: Difference and ExclusionGravatar mtklein2015-05-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This will cause minor (off-by-one) diffs due to a little lost precision: colortype_xfermodes mixed_xfermodes xfermodes2 xfermodeimagefilter xfermodes3 xfermodes Desktop: Xfermode_Difference_aa 9.77ms -> 7.32ms 0.75x Xfermode_Exclusion_aa 8.49ms -> 6.21ms 0.73x Xfermode_Difference 17ms -> 7.54ms 0.44x Xfermode_Exclusion 13.5ms -> 5.09ms 0.38x N7: Xfermode_Difference_aa 32.2ms -> 27.6ms 0.86x Xfermode_Difference 43.9ms -> 32ms 0.73x Xfermode_Exclusion_aa 40.5ms -> 26.7ms 0.66x Xfermode_Exclusion 71.5ms -> 23.9ms 0.33x This wraps up the xfermodes implemented in Sk4f. BUG=skia: Review URL: https://codereview.chromium.org/1141213002
* add Min to SkNi, specialized for u8 and u16 on SSE and NEONGravatar mtklein2015-05-14
| | | | | | | | | | | 0x8001 / 0x7fff don't seem to work, but we were close: 0x8000 does. I plan to use this to implement the Difference xfermode, and it seems generally handy. BUG=skia: Review URL: https://codereview.chromium.org/1133933004
* Sk4px: alphas() and Load[24]Alphas()Gravatar mtklein2015-05-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | alphas() extracts the 4 alphas from an existing Sk4px as another Sk4px. LoadNAlphas() constructs an Sk4px from N packed alphas. In both cases, we end up with 4x repeated alphas aligned with their pixels. alphas() A0 R0 G0 B0 A1 R1 G1 B1 A2 R2 G2 B2 A3 R3 G3 B3 -> A0 A0 A0 A0 A1 A1 A1 A1 A2 A2 A2 A2 A3 A3 A3 A3 Load4Alphas() A0 A1 A2 A3 -> A0 A0 A0 A0 A1 A1 A1 A1 A2 A2 A2 A2 A3 A3 A3 A3 Load2Alphas() A0 A1 -> A0 A0 A0 A0 A1 A1 A1 A1 0 0 0 0 0 0 0 0 This is a 5-10% speedup for AA on Intel, and wash on ARM. AA is still mostly dominated by the final lerp. alphas() isn't used yet, but it's similar enough to Load[24]Alphas() that it was easier to write all at once. BUG=skia: Review URL: https://codereview.chromium.org/1138333003
* Turn on Sk4px xfermodes when we have NEON too.Gravatar mtklein2015-05-13
| | | | | | | | | | | For SSE, Sk4px is better than Sk4f is better than SkXfermodes_opts_SSE2 (where implemented). For NEON, Sk4px is better than SkXfermodes_opts_arm_neon is better than Sk4f (where implemented). This is a 1.6-1.9x speedup for Plus,Modulate, and Screen for NEON. BUG=skia: Review URL: https://codereview.chromium.org/1128053004
* Plus xfermode using Sk4px.Gravatar mtklein2015-05-12
| | | | | | | | | | | | | | | | | | | Xfermode_Plus runs 4-5x faster. We expect mixed_xfermodes to have a small diff. This is because kFoldCoverageIntoSrcAlpha was incorrectly set to true. This implementation handily beats the Sk4f impl, the portable impl, and the existing SSE2 impl. Reading the SkXfermodes_opts_SSE2.cpp file, I'm pretty confident that we'll be able to beat all SSE2 impls. I believe this impl will beat or match the existing NEON impl too, but that may not be true for more complicated xfermodes. They can take advantage of transposing ARGBARGB... to AAAARRRR.... cheaply and I haven't figured out an abstraction for that yet that doesn't screw SSE. Adds: - MapDstSrc() to Sk4px - saturatedAdd() to SkNi (only implemented as far as it's used). - div255Narrow() BUG=skia: Review URL: https://codereview.chromium.org/1138893002
* Sk4pxGravatar mtklein2015-05-12
| | | | | | | | | | Xfermode_SrcOver: SSE: 2.08ms -> 2.03ms (~2% faster) NEON: my N5 is noisy, but there appears to be no perf change BUG=skia: Review URL: https://codereview.chromium.org/1132273004
* We don't use boxBlurY.Gravatar mtklein2015-05-07
| | | | | | | | Also noticed nobody sets SK_DISABLE_BLUR_DIVISION_OPTIMIZATION. BUG=skia: Review URL: https://codereview.chromium.org/1134513003
* Really use SSE4 (and SSSE3) in SkBlurImage_SSE4Gravatar mtklein2015-05-06
| | | | | | | | | | | | We don't seem to be making good use of the available instruction set. SSE4.1 gives us an easy way to unpack a pixel into an __m128i, and SSSE3 gave us an easy way to do the reverse. This should be bit-perfect and about a 10% speedup. BUG=skia: Review URL: https://codereview.chromium.org/1123263003
* De-proc Color32Gravatar mtklein2015-04-27
| | | | | | | | | | | | | | | | | | | | | Also strips SK_SUPPORT_LEGACY_COLOR32_MATH, which is no longer needed. Seems handy to have SkTypes include the relevant intrinsics when we know we've got them, but I'm not married to it. Locally this looks like a pointlessly small perf win, but I'm mostly keen to get all the code together. BUG=skia: Committed: https://skia.googlesource.com/skia/+/376e9bc206b69d9190f38dfebb132a8769bbd72b Committed: https://skia.googlesource.com/skia/+/d65dc0cedd5b50dd407b6ff8fdc39123f11511cc CQ_EXTRA_TRYBOTS=client.skia.compile:Build-Ubuntu-GCC-Mips-Debug-Android-Trybot Review URL: https://codereview.chromium.org/1104183004