aboutsummaryrefslogtreecommitdiffhomepage
path: root/src/opts
Commit message (Collapse)AuthorAge
* Optimize RGB16 blitV functions with NEON for ARM platform.Gravatar yang.zhang2015-07-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Here are some performance resultsi on Nexus 9: SkRGB16BlitterBlitV_neon: +--------+-----------+ |height | C/NEON | +--------+-----------+ |1 | 0.765230 | +--------+-----------+ |8 | 1.273330 | +--------+-----------+ |18 | 1.441462 | +--------+-----------+ |32 | 1.627798 | +--------+-----------+ |76 | 1.683131 | +--------+-----------+ |85 | 1.679456 | +--------+-----------+ |120 | 1.721311 | +--------+-----------+ |128 | 1.725482 | +--------+-----------+ |512 | 1.784117 | +--------+-----------+ BUG=skia: Review URL: https://codereview.chromium.org/1213723002
* 3-15% speedup to HardLight / Overlay xfermodes.Gravatar mtklein2015-07-14
| | | | | | | | | | | | | | | While investigating my bug (skia:4052) I saw this TODO and figured it'd make me feel better about an otherwise unsuccessful investigation. This speeds up HardLight and Overlay (same code) by about 15% with SSE, mostly by rewriting the logic from 1 cheap comparison and 2 expensive div255() calls to 2 cheap comparisons and 1 expensive div255(). NEON speeds up by a more modest ~3%. BUG=skia: Review URL: https://codereview.chromium.org/1230663005
* SoftLight with SkPMFloatGravatar mtklein2015-06-29
| | | | | | | | | | | | | | | | | | SSE speeds up about 4.5x over existing integer SSE, NEON speeds up about 3x over serial integer code. We expect 1-2 bit component diffs in the usual GMs. Still guarded by SK_SUPPORT_LEGACY_XFERMODES, which I'll now try to lift in Chrome. BUG=skia: Committed: https://skia.googlesource.com/skia/+/3e47d49b46b3ab62071218ef3dd44642c9713e04 CQ_EXTRA_TRYBOTS=client.skia:Test-ChromeOS-GCC-Daisy-CPU-NEON-Arm7-Debug-Trybot Review URL: https://codereview.chromium.org/1221493002
* Revert of SoftLight with SkPMFloat (patchset #6 id:100001 of ↵Gravatar mtklein2015-06-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://codereview.chromium.org/1221493002/) Reason for revert: xfermodes and xfermodes2 show major diffs on Nexus 5 and Daisy (both ARMv7 w/NEON). Nexus 9 and SSE all look fine... Original issue's description: > SoftLight with SkPMFloat > > SSE speeds up about 4.5x over existing integer SSE, > NEON speeds up about 3x over serial integer code. > > We expect 1-2 bit component diffs in the usual GMs. > > Still guarded by SK_SUPPORT_LEGACY_XFERMODES, > which I'll now try to lift in Chrome. > > > BUG=skia: > > Committed: https://skia.googlesource.com/skia/+/3e47d49b46b3ab62071218ef3dd44642c9713e04 TBR=reed@google.com,mtklein@chromium.org NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG=skia: Review URL: https://codereview.chromium.org/1221683002
* SoftLight with SkPMFloatGravatar mtklein2015-06-29
| | | | | | | | | | | | | | SSE speeds up about 4.5x over existing integer SSE, NEON speeds up about 3x over serial integer code. We expect 1-2 bit component diffs in the usual GMs. Still guarded by SK_SUPPORT_LEGACY_XFERMODES, which I'll now try to lift in Chrome. BUG=skia: Review URL: https://codereview.chromium.org/1221493002
* Add extra braces for uintNNxMx4_t initializers.Gravatar mtklein2015-06-26
| | | | | | | | | | | | These structs are always implemented as struct uintNNxMx4_t { uintNNxM val[4]; }; So, the first set of braces is for the struct, the second for val. BUG=skia: Review URL: https://codereview.chromium.org/1221453002
* Color dodge and burn with SkPMFloat.Gravatar mtklein2015-06-26
| | | | | | | | | | | | | | | | | Both 25-35% faster with SSE. With NEON, Burn measures as a ~10% regression, Dodge a huge 2.9x improvement. The Burn regression is somewhat artificial: we're drawing random colored rects onto an opaque white dst, so we're heavily biased toward the (d==da) fast path in the serial code. In the vector code there's no short-circuiting and we always pay a fixed cost for ColorBurn regardless of src or dst content. Dodge's fast paths, in contrast, only trigger when (s==sa) or (d==0), neither of which happens any more than randomly in our benchmark. I don't think (d==0) should happen at all. Similarly, the (s==0) Burn fast path is really only going to happen as often as SkRandom allows. In practice, the existing Burn benchmark is hitting its fast path 100% of the time. So I actually feel really great that this only dings the benchmark by 10%. Chrome's still guarded by SK_SUPPORT_LEGACY_XFERMODES, which I'll lift after finishing the last xfermode, SoftLight. BUG=skia: Review URL: https://codereview.chromium.org/1214443002
* add/fix copyrightsGravatar reed2015-06-26
| | | | | | | BUG=skia: TBR= Review URL: https://codereview.chromium.org/1212393002
* What did we learn today? 255 != 256Gravatar mtklein2015-06-25
| | | | | | | | | vcvt_n_f32_u32 and _u32_f32 work in power-of-2 fixed point, so (...,8) meant 'please multiply or divide by 256'. We need to use 255. :( BUG=skia: Review URL: https://codereview.chromium.org/1204363002
* Convert SkPMFloat to [0,1] range and prune its API.Gravatar mtklein2015-06-25
| | | | | | | | | | | | | | | | | | | | | | | | Now that Sk4px exists, there's a lot less sense in eeking out every cycle of speed from SkPMFloat: if we need to go _really_ fast, we should use Sk4px. SkPMFloat's going to be used for things that are already slow: large-range intermediates, divides, sqrts, etc. A [0,1] range is easier to work with, and can even be faster if we eliminate enough *255 and *1/255 steps. This is particularly true on ARM, where NEON can do the *255 and /255 steps for us while converting float<->int. We have lots of experimental SkPMFloat <-> SkPMColor APIs that I'm now removing. Of the existing APIs, roundClamp() is the sanest, so I've kept only that, now called round(). The 4-at-a-time APIs never panned out, so they're gone. There will be small diffs on: colormatrix coloremoji colorfilterimagefilter fadefilter imagefilters_xfermodes imagefilterscropexpand imagefiltersgraph tileimagefilter BUG=skia: Review URL: https://codereview.chromium.org/1201343004
* Implement four more xfermodes with Sk4px.Gravatar mtklein2015-06-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | HardLight, Overlay, Darken, and Lighten are all ~2x faster with SSE, ~25% faster with NEON. This covers all previously-implemented NEON xfermodes. 3 previous SSE xfermodes remain. Those need division and sqrt, so I'm planning on using SkPMFloat for them. It'll help the readability and NEON speed if I move that into [0,1] space first. The main new concept here is c.thenElse(t,e), which behaves like (c ? t : e) except, of course, both t and e are evaluated. This allows us to emulate conditionals with vectors. This also removes the concept of SkNb. Instead of a standalone bool vector, each SkNi or SkNf will just return their own types for comparisons. Turns out to be a lot more manageable this way. BUG=skia: Committed: https://skia.googlesource.com/skia/+/b9d4163bebab0f5639f9c5928bb5fc15f472dddc CQ_EXTRA_TRYBOTS=client.skia.compile:Build-Ubuntu-GCC-Arm64-Debug-Android-Trybot Review URL: https://codereview.chromium.org/1196713004
* Revert of Implement four more xfermodes with Sk4px. (patchset #16 id:290001 ↵Gravatar mtklein2015-06-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | of https://codereview.chromium.org/1196713004/) Reason for revert: 64-bit ARM build failures. Original issue's description: > Implement four more xfermodes with Sk4px. > > HardLight, Overlay, Darken, and Lighten are all > ~2x faster with SSE, ~25% faster with NEON. > > This covers all previously-implemented NEON xfermodes. > 3 previous SSE xfermodes remain. Those need division > and sqrt, so I'm planning on using SkPMFloat for them. > It'll help the readability and NEON speed if I move that > into [0,1] space first. > > The main new concept here is c.thenElse(t,e), which behaves like > (c ? t : e) except, of course, both t and e are evaluated. This allows > us to emulate conditionals with vectors. > > This also removes the concept of SkNb. Instead of a standalone bool > vector, each SkNi or SkNf will just return their own types for > comparisons. Turns out to be a lot more manageable this way. > > BUG=skia: > > Committed: https://skia.googlesource.com/skia/+/b9d4163bebab0f5639f9c5928bb5fc15f472dddc TBR=reed@google.com,mtklein@chromium.org NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG=skia: Review URL: https://codereview.chromium.org/1205703008
* Implement four more xfermodes with Sk4px.Gravatar mtklein2015-06-24
| | | | | | | | | | | | | | | | | | | | | | | HardLight, Overlay, Darken, and Lighten are all ~2x faster with SSE, ~25% faster with NEON. This covers all previously-implemented NEON xfermodes. 3 previous SSE xfermodes remain. Those need division and sqrt, so I'm planning on using SkPMFloat for them. It'll help the readability and NEON speed if I move that into [0,1] space first. The main new concept here is c.thenElse(t,e), which behaves like (c ? t : e) except, of course, both t and e are evaluated. This allows us to emulate conditionals with vectors. This also removes the concept of SkNb. Instead of a standalone bool vector, each SkNi or SkNf will just return their own types for comparisons. Turns out to be a lot more manageable this way. BUG=skia: Review URL: https://codereview.chromium.org/1196713004
* Use vmulq_n_u32(..., 0x01010101) to distribute alphas.Gravatar mtklein2015-06-22
| | | | | | | | | | | | | | | | | | | | | | | This seems to make alphas() faster and Load[24]Alphas() no slower. The change is particularly noticeable on xfermodes that call alphas() twice (on src and dst), with a 10-12% speedup. Xfermode_Difference_aa 29ms -> 28.4ms 0.98x Xfermode_DstATop_aa 27.2ms -> 26.7ms 0.98x Xfermode_Xor_aa 27.2ms -> 26.5ms 0.98x Xfermode_DstOver 23.6ms -> 22.9ms 0.97x Xfermode_DstOver_aa 27.8ms -> 26.8ms 0.96x Xfermode_DstOut 22.6ms -> 21.7ms 0.96x Xfermode_Multiply_aa 30ms -> 28.5ms 0.95x Xfermode_DstOut_aa 26.1ms -> 24.8ms 0.95x Xfermode_DstIn_aa 25.4ms -> 24.1ms 0.95x Xfermode_DstATop 28.7ms -> 26ms 0.9x Xfermode_Multiply 35.5ms -> 31.3ms 0.88x Xfermode_Difference 31.8ms -> 27.7ms 0.87x Xfermode_Xor 30.1ms -> 26.1ms 0.87x BUG=skia: Review URL: https://codereview.chromium.org/1203513002
* Update some Sk4px APIs.Gravatar mtklein2015-06-22
| | | | | | | | | | | | | | | Mostly this is about ergonomics, making it easier to do good operations and hard / impossible to do bad ones. - SkAlpha / SkPMColor constructors become static factories. - Remove div255TruncNarrow(), rename div255RoundNarrow() to div255(). In practice we always want to round, and the narrowing to 8-bit is contextually obvious. - Rename fastMulDiv255Round() approxMulDiv255() to stress it's approximate-ness over its speed. Drop Round for the same reason as above... we should always round. - Add operator overloads so we don't have to keep throwing in seemingly-random Sk4px() or Sk4px::Wide() casts. - use operator*() for 8-bit x 8-bit -> 16-bit math. It's always what we want, and there's generally no 8x8->8 alternative. - MapFoo can take a const Func&. Don't think it makes a big difference, but nice to do. BUG=skia: Review URL: https://codereview.chromium.org/1202013002
* Plumb through out_row byte length so we can assert we stay underneath it.Gravatar mtklein2015-06-18
| | | | | | | | Sadly, not asserting for me yet. Can't hurt. BUG=chromium:491660 Review URL: https://codereview.chromium.org/1187173005
* switch bitmapshader internals over to pixmapGravatar reed2015-06-04
| | | | | | | BUG=skia: NOTRY=True Review URL: https://codereview.chromium.org/1158273007
* Everyone gets a namespace {}.Gravatar mtklein2015-05-22
| | | | | | | | | | | | | | | | | | If we include Sk4px.h, SkPMFloat.h, or SkNx.h into files with different SIMD flags, that could cause different definitions of the same method. Normally that's moot, because all the code inlines, but in Debug it tends not to. So in Debug, the linker picks one definition for us. That breaks _someone_. Wrapping everything in a namespace {} keeps the definitions separate. Tested locally, it fixes this bug. BUG=skia:3861 This code is not yet enabled in Chrome, so shouldn't affect the roll. NOTREECHECKS=true Review URL: https://codereview.chromium.org/1154523004
* Move Sk4px Xfermode code to a header so we can use it twice.Gravatar mtklein2015-05-22
| | | | | | | | | | | | | | | | - Once in SkXfermode as usual to pick up compile-time SSE and NEON - Once in SkXfermode_arm_neon to pick up run-time NEON This allows us to start cleaning up SkXfermode_arm_neon as we've done for SkXfermode_SSE2. I'm saving this catharsis for a day when I need it. The Sk4px xfermodes are generally faster than the existing NEON procs, so this should also have the side effect of a perf win there. This means our new Plus-AA code works for runtime NEON too. BUG=skia:3852 Review URL: https://codereview.chromium.org/1150313003
* Re-proc SkBlitRow::Color32 for ARM.Gravatar mtklein2015-05-22
| | | | | | | | | | This is a spiritual revert of http://crrev.com/1104183004. BUG=skia: Committed: https://skia.googlesource.com/skia/+/4e13a23d8f720e17660f26657b45b89fe4339004 Review URL: https://codereview.chromium.org/1145283003
* Revert of Re-proc SkBlitRow::Color32 for ARM. (patchset #3 id:40001 of ↵Gravatar mtklein2015-05-22
| | | | | | | | | | | | | | | | | | | | | | | | https://codereview.chromium.org/1145283003/) Reason for revert: http://build.chromium.org/p/tryserver.chromium.mac/builders/ios_rel_device_ninja/builds/70016/steps/compile%20%28with%20patch%29/logs/stdio Original issue's description: > Re-proc SkBlitRow::Color32 for ARM. > > This is a spiritual revert of http://crrev.com/1104183004. > > BUG=skia: > > Committed: https://skia.googlesource.com/skia/+/4e13a23d8f720e17660f26657b45b89fe4339004 TBR=reed@google.com,mtklein@chromium.org NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG=skia: Review URL: https://codereview.chromium.org/1157633003
* Re-proc SkBlitRow::Color32 for ARM.Gravatar mtklein2015-05-21
| | | | | | | | This is a spiritual revert of http://crrev.com/1104183004. BUG=skia: Review URL: https://codereview.chromium.org/1145283003
* Clean up Sk4f xfermodes and covered _SSE2 xfermodes.Gravatar mtklein2015-05-21
| | | | | | | | Before I get going on fixing Plus, it's nice to clear out the dead cruft. BUG=skia:3852 Review URL: https://codereview.chromium.org/1150833003
* Sk4px: Difference and ExclusionGravatar mtklein2015-05-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This will cause minor (off-by-one) diffs due to a little lost precision: colortype_xfermodes mixed_xfermodes xfermodes2 xfermodeimagefilter xfermodes3 xfermodes Desktop: Xfermode_Difference_aa 9.77ms -> 7.32ms 0.75x Xfermode_Exclusion_aa 8.49ms -> 6.21ms 0.73x Xfermode_Difference 17ms -> 7.54ms 0.44x Xfermode_Exclusion 13.5ms -> 5.09ms 0.38x N7: Xfermode_Difference_aa 32.2ms -> 27.6ms 0.86x Xfermode_Difference 43.9ms -> 32ms 0.73x Xfermode_Exclusion_aa 40.5ms -> 26.7ms 0.66x Xfermode_Exclusion 71.5ms -> 23.9ms 0.33x This wraps up the xfermodes implemented in Sk4f. BUG=skia: Review URL: https://codereview.chromium.org/1141213002
* add Min to SkNi, specialized for u8 and u16 on SSE and NEONGravatar mtklein2015-05-14
| | | | | | | | | | | 0x8001 / 0x7fff don't seem to work, but we were close: 0x8000 does. I plan to use this to implement the Difference xfermode, and it seems generally handy. BUG=skia: Review URL: https://codereview.chromium.org/1133933004
* Sk4px: alphas() and Load[24]Alphas()Gravatar mtklein2015-05-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | alphas() extracts the 4 alphas from an existing Sk4px as another Sk4px. LoadNAlphas() constructs an Sk4px from N packed alphas. In both cases, we end up with 4x repeated alphas aligned with their pixels. alphas() A0 R0 G0 B0 A1 R1 G1 B1 A2 R2 G2 B2 A3 R3 G3 B3 -> A0 A0 A0 A0 A1 A1 A1 A1 A2 A2 A2 A2 A3 A3 A3 A3 Load4Alphas() A0 A1 A2 A3 -> A0 A0 A0 A0 A1 A1 A1 A1 A2 A2 A2 A2 A3 A3 A3 A3 Load2Alphas() A0 A1 -> A0 A0 A0 A0 A1 A1 A1 A1 0 0 0 0 0 0 0 0 This is a 5-10% speedup for AA on Intel, and wash on ARM. AA is still mostly dominated by the final lerp. alphas() isn't used yet, but it's similar enough to Load[24]Alphas() that it was easier to write all at once. BUG=skia: Review URL: https://codereview.chromium.org/1138333003
* Turn on Sk4px xfermodes when we have NEON too.Gravatar mtklein2015-05-13
| | | | | | | | | | | For SSE, Sk4px is better than Sk4f is better than SkXfermodes_opts_SSE2 (where implemented). For NEON, Sk4px is better than SkXfermodes_opts_arm_neon is better than Sk4f (where implemented). This is a 1.6-1.9x speedup for Plus,Modulate, and Screen for NEON. BUG=skia: Review URL: https://codereview.chromium.org/1128053004
* Plus xfermode using Sk4px.Gravatar mtklein2015-05-12
| | | | | | | | | | | | | | | | | | | Xfermode_Plus runs 4-5x faster. We expect mixed_xfermodes to have a small diff. This is because kFoldCoverageIntoSrcAlpha was incorrectly set to true. This implementation handily beats the Sk4f impl, the portable impl, and the existing SSE2 impl. Reading the SkXfermodes_opts_SSE2.cpp file, I'm pretty confident that we'll be able to beat all SSE2 impls. I believe this impl will beat or match the existing NEON impl too, but that may not be true for more complicated xfermodes. They can take advantage of transposing ARGBARGB... to AAAARRRR.... cheaply and I haven't figured out an abstraction for that yet that doesn't screw SSE. Adds: - MapDstSrc() to Sk4px - saturatedAdd() to SkNi (only implemented as far as it's used). - div255Narrow() BUG=skia: Review URL: https://codereview.chromium.org/1138893002
* Sk4pxGravatar mtklein2015-05-12
| | | | | | | | | | Xfermode_SrcOver: SSE: 2.08ms -> 2.03ms (~2% faster) NEON: my N5 is noisy, but there appears to be no perf change BUG=skia: Review URL: https://codereview.chromium.org/1132273004
* We don't use boxBlurY.Gravatar mtklein2015-05-07
| | | | | | | | Also noticed nobody sets SK_DISABLE_BLUR_DIVISION_OPTIMIZATION. BUG=skia: Review URL: https://codereview.chromium.org/1134513003
* Really use SSE4 (and SSSE3) in SkBlurImage_SSE4Gravatar mtklein2015-05-06
| | | | | | | | | | | | We don't seem to be making good use of the available instruction set. SSE4.1 gives us an easy way to unpack a pixel into an __m128i, and SSSE3 gave us an easy way to do the reverse. This should be bit-perfect and about a 10% speedup. BUG=skia: Review URL: https://codereview.chromium.org/1123263003
* De-proc Color32Gravatar mtklein2015-04-27
| | | | | | | | | | | | | | | | | | | | | Also strips SK_SUPPORT_LEGACY_COLOR32_MATH, which is no longer needed. Seems handy to have SkTypes include the relevant intrinsics when we know we've got them, but I'm not married to it. Locally this looks like a pointlessly small perf win, but I'm mostly keen to get all the code together. BUG=skia: Committed: https://skia.googlesource.com/skia/+/376e9bc206b69d9190f38dfebb132a8769bbd72b Committed: https://skia.googlesource.com/skia/+/d65dc0cedd5b50dd407b6ff8fdc39123f11511cc CQ_EXTRA_TRYBOTS=client.skia.compile:Build-Ubuntu-GCC-Mips-Debug-Android-Trybot Review URL: https://codereview.chromium.org/1104183004
* Revert of De-proc Color32 (patchset #5 id:80001 of ↵Gravatar mtklein2015-04-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://codereview.chromium.org/1104183004/) Reason for revert: duh Original issue's description: > De-proc Color32 > > Also strips SK_SUPPORT_LEGACY_COLOR32_MATH, > which is no longer needed. > > Seems handy to have SkTypes include the relevant intrinsics when > we know we've got them, but I'm not married to it. > > Locally this looks like a pointlessly small perf win, but I'm mostly > keen to get all the code together. > > BUG=skia: > > Committed: https://skia.googlesource.com/skia/+/376e9bc206b69d9190f38dfebb132a8769bbd72b > > Committed: https://skia.googlesource.com/skia/+/d65dc0cedd5b50dd407b6ff8fdc39123f11511cc TBR=reed@google.com,mtklein@chromium.org NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG=skia: Review URL: https://codereview.chromium.org/1102363006
* De-proc Color32Gravatar mtklein2015-04-27
| | | | | | | | | | | | | | | | | Also strips SK_SUPPORT_LEGACY_COLOR32_MATH, which is no longer needed. Seems handy to have SkTypes include the relevant intrinsics when we know we've got them, but I'm not married to it. Locally this looks like a pointlessly small perf win, but I'm mostly keen to get all the code together. BUG=skia: Committed: https://skia.googlesource.com/skia/+/376e9bc206b69d9190f38dfebb132a8769bbd72b Review URL: https://codereview.chromium.org/1104183004
* Split rsqrt into rsqrt{0,1,2}, with increasing cost and precision on ARMGravatar mtklein2015-04-27
| | | | | | | | | | | | This is a logical no-op. Everything was using the equivalent of rsqrt1() before, and is now after. BUG=skia: Committed: https://skia.googlesource.com/skia/+/9de16283fdc8cc0d31a84f503578d0ecea4e8297 CQ_EXTRA_TRYBOTS=client.skia.compile:Build-Ubuntu-GCC-Arm64-Debug-Android-Trybot Review URL: https://codereview.chromium.org/1109913002
* Revert of De-proc Color32 (patchset #4 id:60001 of ↵Gravatar mtklein2015-04-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://codereview.chromium.org/1104183004/) Reason for revert: MIPS Original issue's description: > De-proc Color32 > > Also strips SK_SUPPORT_LEGACY_COLOR32_MATH, > which is no longer needed. > > Seems handy to have SkTypes include the relevant intrinsics when > we know we've got them, but I'm not married to it. > > Locally this looks like a pointlessly small perf win, but I'm mostly > keen to get all the code together. > > BUG=skia: > > Committed: https://skia.googlesource.com/skia/+/376e9bc206b69d9190f38dfebb132a8769bbd72b TBR=reed@google.com,mtklein@chromium.org NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG=skia: Review URL: https://codereview.chromium.org/1108163002
* De-proc Color32Gravatar mtklein2015-04-27
| | | | | | | | | | | | | | | Also strips SK_SUPPORT_LEGACY_COLOR32_MATH, which is no longer needed. Seems handy to have SkTypes include the relevant intrinsics when we know we've got them, but I'm not married to it. Locally this looks like a pointlessly small perf win, but I'm mostly keen to get all the code together. BUG=skia: Review URL: https://codereview.chromium.org/1104183004
* Revert of Split rsqrt into rsqrt{0,1,2}, with increasing cost and precision ↵Gravatar mtklein2015-04-27
| | | | | | | | | | | | | | | | | | | | | | | | on ARM (patchset #2 id:20001 of https://codereview.chromium.org/1109913002/) Reason for revert: arm64 typos Original issue's description: > Split rsqrt into rsqrt{0,1,2}, with increasing cost and precision on ARM > > This is a logical no-op. Everything was using the equivalent of rsqrt1() before, and is now after. > > BUG=skia: > > Committed: https://skia.googlesource.com/skia/+/9de16283fdc8cc0d31a84f503578d0ecea4e8297 TBR=reed@google.com,mtklein@chromium.org NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG=skia: Review URL: https://codereview.chromium.org/1105233003
* Split rsqrt into rsqrt{0,1,2}, with increasing cost and precision on ARMGravatar mtklein2015-04-27
| | | | | | | | This is a logical no-op. Everything was using the equivalent of rsqrt1() before, and is now after. BUG=skia: Review URL: https://codereview.chromium.org/1109913002
* Mike's radial gradient CL with better float -> int.Gravatar mtklein2015-04-27
| | | | | | | | | | | | | | | | | | patch from issue 1072303005 at patchset 40001 (http://crrev.com/1072303005#ps40001) This looks quite launchable. radial_gradient3, min of 100 samples: N5: 985µs -> 946µs MBP: 395µs -> 279µs On my MBP, most of the meat looks like it's now in reading the cache and writing to dst one color at a time. Is that something we could do in float math rather than with a lookup table? BUG=skia: CQ_EXTRA_TRYBOTS=client.skia.compile:Build-Mac10.8-Clang-Arm7-Debug-Android-Trybot,Build-Ubuntu-GCC-Arm7-Release-Android_NoNeon-Trybot Committed: https://skia.googlesource.com/skia/+/abf6c5cf95e921fae59efb487480e5b5081cf0ec Review URL: https://codereview.chromium.org/1109643002
* Revert of Mike's radial gradient CL with better float -> int. (patchset #7 ↵Gravatar mtklein2015-04-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | id:120001 of https://codereview.chromium.org/1109643002/) Reason for revert: compile failures. Original issue's description: > Mike's radial gradient CL with better float -> int. > > patch from issue 1072303005 at patchset 40001 (http://crrev.com/1072303005#ps40001) > > This looks quite launchable. radial_gradient3, min of 100 samples: > N5: 985µs -> 946µs > MBP: 395µs -> 279µs > > On my MBP, most of the meat looks like it's now in reading the cache and writing to dst one color at a time. Is that something we could do in float math rather than with a lookup table? > > BUG=skia: > > CQ_EXTRA_TRYBOTS=client.skia.android:Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Debug-Trybot,Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Debug-Trybot > > Committed: https://skia.googlesource.com/skia/+/abf6c5cf95e921fae59efb487480e5b5081cf0ec TBR=reed@google.com,robertphillips@google.com,mtklein@chromium.org NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG=skia: Review URL: https://codereview.chromium.org/1109883003
* Mike's radial gradient CL with better float -> int.Gravatar mtklein2015-04-27
| | | | | | | | | | | | | | | | patch from issue 1072303005 at patchset 40001 (http://crrev.com/1072303005#ps40001) This looks quite launchable. radial_gradient3, min of 100 samples: N5: 985µs -> 946µs MBP: 395µs -> 279µs On my MBP, most of the meat looks like it's now in reading the cache and writing to dst one color at a time. Is that something we could do in float math rather than with a lookup table? BUG=skia: CQ_EXTRA_TRYBOTS=client.skia.android:Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Debug-Trybot,Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Debug-Trybot Review URL: https://codereview.chromium.org/1109643002
* Update more directories under src/ to follow C++11 style rule for ↵Gravatar tfarina2015-04-27
| | | | | | | | | | | | | | | | | | | | | {virtual,override}. The Google style guide states that only one of {virtual,override,final} should be used for each declaration, since override implies virtual and final implies both virtual and override. The entries were found using the following command line: $ find src/ -iname "*.h" -o -iname "*.cpp" | xargs pcregrep -M "[^\n/]+virtual\ [^;{]+\ [a-zA-Z0-9_]+\([^;{]+\ override[ \n]*[;{]" The regex was a courtesy of nick@chromium.org BUG=None R=mtklein@google.com NOPRESUBMIT=true Review URL: https://codereview.chromium.org/1086143003
* Revert of Convert Color32 code to perfect blend. (patchset #6 id:100001 of ↵Gravatar mtklein2015-04-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://codereview.chromium.org/1098913002/) Reason for revert: Xfermode_SrcOver not looking encouraging. Up to 50% regressions. https://perf.skia.org/#3242 Original issue's description: > Convert Color32 code to perfect blend. > > Before we commit to blend_256_round_alt, let's make sure blend_perfect is > really slower in practice (i.e. regresses on perf.skia.org). > > blend_perfect is really the most desirable algorithm if we can afford it. Not > only is it correct, but it's easy to think about and break into correct pieces: > for instance, its div255() doesn't require any coordination with the multiply. > > This looks like a 30% hit according to microbenches. That said, microbenches > said my previous change would be a 20-25% perf improvement, but it didn't end > up showing a significant effect at a high level. > > As for correctness, I see a bunch of off-by-1 compared to blend_256_round_alt > (exactly what we'd expect), and one off-by-3 in a GM that looks like it has a > bunch of overdraw. > > BUG=skia: > > Committed: https://skia.googlesource.com/skia/+/61221e7f87a99765b0e034020e06bb018e2a08c2 TBR=reed@google.com,fmalita@chromium.org,mtklein@chromium.org NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG=skia: Review URL: https://codereview.chromium.org/1083923006
* Convert Color32 code to perfect blend.Gravatar mtklein2015-04-20
| | | | | | | | | | | | | | | | | | | | | Before we commit to blend_256_round_alt, let's make sure blend_perfect is really slower in practice (i.e. regresses on perf.skia.org). blend_perfect is really the most desirable algorithm if we can afford it. Not only is it correct, but it's easy to think about and break into correct pieces: for instance, its div255() doesn't require any coordination with the multiply. This looks like a 30% hit according to microbenches. That said, microbenches said my previous change would be a 20-25% perf improvement, but it didn't end up showing a significant effect at a high level. As for correctness, I see a bunch of off-by-1 compared to blend_256_round_alt (exactly what we'd expect), and one off-by-3 in a GM that looks like it has a bunch of overdraw. BUG=skia: Review URL: https://codereview.chromium.org/1098913002
* Rework SSE and NEON Color32 algorithms to be more correct and faster.Gravatar mtklein2015-04-17
| | | | | | | | | | | | | | | | | | | | This algorithm changes the blend math, guarded by SK_LEGACY_COLOR32_MATH. The new math is more correct: it's never off by more than 1, and correct in all the interesting 0x00 and 0xFF edge cases, where the old math was never off by more than 2, and not always correct on the edges. If you look at tests/BlendTest.cpp, the old code was using the `blend_256_plus1_trunc` algorithm, while the new code uses `blend_256_round_alt`. Neither uses `blend_perfect`, which is about ~35% slower than `blend_256_round_alt`. This will require an unfathomable number of rebaselines, first to Skia, then to Blink when I remove the guard. I plan to follow up with some integer SIMD abstractions that can unify these two implementations into a single algorithm. This was originally what I was working on here, but the correctness gains seem to be quite compelling. The only places these two algorithms really differ greatly now is the kernel function, and even there they can really both be expressed abstractly as: - multiply 8-bits and 8-bits producing 16-bits - add 16-bits to 16-bits, returning the top 8 bits. All the constants are the same, except SSE is a little faster to keep 8 16-bit inverse alphas, NEON's a little faster to keep 8 8-bit inverse alphas. I may need to take this small speed win back to unify the two. We should expect a ~25% speedup on Intel (mostly from unrolling to 8 pixels) and a ~20% speedup on ARM (mostly from using vaddhn to add `color`, round, and narrow back down to 8-bit all into one instruction. (I am probably missing several more related bugs here.) BUG=skia:3738,skia:420,chromium:111470 Review URL: https://codereview.chromium.org/1092433002
* Sk4h and Sk8h for SSEGravatar mtklein2015-04-14
| | | | | | | | | | These will underly the SkPMFloat-like class for uint16_t components. Sk4h will back a single-pixel version, and Sk8h any larger number than that. BUG=skia: Review URL: https://codereview.chromium.org/1088883005
* Rename SkNi to SkNb.Gravatar mtklein2015-04-14
| | | | | | | | | | | | | | | | As used today, SkNi is used in bool-y contexts. This keeps that, but under a new name, SkNb. This makes room for a new SkNi that's focused on integer-y things like loads, stores, arithmetic, etc. The main reason to split these is that we want different specializations for each use case: for bools, it's important for us to specialize 32- and 64-bit to support efficient float- and double- comparisons, but for integer work we're more likely to be looking at 8- and 16- bit lanes. Keeping these use cases siloed helps me manage the compexity of the backend NEON and SSE code. BUG=skia: Review URL: https://codereview.chromium.org/1083123002
* Replace NEON assembly memset16 and memset32 with intrinsic versions.Gravatar mtklein2015-04-10
| | | | | | | | | | | According to bench/MemsetBench.cpp, I've got them somewhere between 10% slower and a percent or two faster than the old assembly. BUG=skia: CQ_EXTRA_TRYBOTS=client.skia.android:Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Debug-Trybot Review URL: https://codereview.chromium.org/1075003002
* Remove ARM assembly memsets.Gravatar mtklein2015-04-09
| | | | | | | | | | | Step 1 of a zillion in the quest for NEON on iOS, and step 1 of a different zillion in the Great Assembly Purge. ios, arm, arm64, arm_v7, arm_v7_neon all build. BUG=skia: Review URL: https://codereview.chromium.org/1072063002