aboutsummaryrefslogtreecommitdiffhomepage
path: root/src/core/Sk4px.h
Commit message (Collapse)AuthorAge
* Implement four more xfermodes with Sk4px.Gravatar mtklein2015-06-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | HardLight, Overlay, Darken, and Lighten are all ~2x faster with SSE, ~25% faster with NEON. This covers all previously-implemented NEON xfermodes. 3 previous SSE xfermodes remain. Those need division and sqrt, so I'm planning on using SkPMFloat for them. It'll help the readability and NEON speed if I move that into [0,1] space first. The main new concept here is c.thenElse(t,e), which behaves like (c ? t : e) except, of course, both t and e are evaluated. This allows us to emulate conditionals with vectors. This also removes the concept of SkNb. Instead of a standalone bool vector, each SkNi or SkNf will just return their own types for comparisons. Turns out to be a lot more manageable this way. BUG=skia: Committed: https://skia.googlesource.com/skia/+/b9d4163bebab0f5639f9c5928bb5fc15f472dddc CQ_EXTRA_TRYBOTS=client.skia.compile:Build-Ubuntu-GCC-Arm64-Debug-Android-Trybot Review URL: https://codereview.chromium.org/1196713004
* Revert of Implement four more xfermodes with Sk4px. (patchset #16 id:290001 ↵Gravatar mtklein2015-06-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | of https://codereview.chromium.org/1196713004/) Reason for revert: 64-bit ARM build failures. Original issue's description: > Implement four more xfermodes with Sk4px. > > HardLight, Overlay, Darken, and Lighten are all > ~2x faster with SSE, ~25% faster with NEON. > > This covers all previously-implemented NEON xfermodes. > 3 previous SSE xfermodes remain. Those need division > and sqrt, so I'm planning on using SkPMFloat for them. > It'll help the readability and NEON speed if I move that > into [0,1] space first. > > The main new concept here is c.thenElse(t,e), which behaves like > (c ? t : e) except, of course, both t and e are evaluated. This allows > us to emulate conditionals with vectors. > > This also removes the concept of SkNb. Instead of a standalone bool > vector, each SkNi or SkNf will just return their own types for > comparisons. Turns out to be a lot more manageable this way. > > BUG=skia: > > Committed: https://skia.googlesource.com/skia/+/b9d4163bebab0f5639f9c5928bb5fc15f472dddc TBR=reed@google.com,mtklein@chromium.org NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG=skia: Review URL: https://codereview.chromium.org/1205703008
* Implement four more xfermodes with Sk4px.Gravatar mtklein2015-06-24
| | | | | | | | | | | | | | | | | | | | | | | HardLight, Overlay, Darken, and Lighten are all ~2x faster with SSE, ~25% faster with NEON. This covers all previously-implemented NEON xfermodes. 3 previous SSE xfermodes remain. Those need division and sqrt, so I'm planning on using SkPMFloat for them. It'll help the readability and NEON speed if I move that into [0,1] space first. The main new concept here is c.thenElse(t,e), which behaves like (c ? t : e) except, of course, both t and e are evaluated. This allows us to emulate conditionals with vectors. This also removes the concept of SkNb. Instead of a standalone bool vector, each SkNi or SkNf will just return their own types for comparisons. Turns out to be a lot more manageable this way. BUG=skia: Review URL: https://codereview.chromium.org/1196713004
* Update some Sk4px APIs.Gravatar mtklein2015-06-22
| | | | | | | | | | | | | | | Mostly this is about ergonomics, making it easier to do good operations and hard / impossible to do bad ones. - SkAlpha / SkPMColor constructors become static factories. - Remove div255TruncNarrow(), rename div255RoundNarrow() to div255(). In practice we always want to round, and the narrowing to 8-bit is contextually obvious. - Rename fastMulDiv255Round() approxMulDiv255() to stress it's approximate-ness over its speed. Drop Round for the same reason as above... we should always round. - Add operator overloads so we don't have to keep throwing in seemingly-random Sk4px() or Sk4px::Wide() casts. - use operator*() for 8-bit x 8-bit -> 16-bit math. It's always what we want, and there's generally no 8x8->8 alternative. - MapFoo can take a const Func&. Don't think it makes a big difference, but nice to do. BUG=skia: Review URL: https://codereview.chromium.org/1202013002
* Everyone gets a namespace {}.Gravatar mtklein2015-05-22
| | | | | | | | | | | | | | | | | | If we include Sk4px.h, SkPMFloat.h, or SkNx.h into files with different SIMD flags, that could cause different definitions of the same method. Normally that's moot, because all the code inlines, but in Debug it tends not to. So in Debug, the linker picks one definition for us. That breaks _someone_. Wrapping everything in a namespace {} keeps the definitions separate. Tested locally, it fixes this bug. BUG=skia:3861 This code is not yet enabled in Chrome, so shouldn't affect the roll. NOTREECHECKS=true Review URL: https://codereview.chromium.org/1154523004
* sk4px the rest of the easy xfermodes.Gravatar mtklein2015-05-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Adds and uses fastMulDiv255Round() where possible, which approximates x*y/255 as (x*y+x)/256. Seems like a sizeable speedup, as seen below on Exclusion, Screen, and Modulate. The existing NEON code uses this approximation for {Src,Dst}x{In,Out,Over}, and without it we'd regress speed there. This will require rebaselines whether or not we use this approximation: the x86 bots change if we do, the ARM bots change if we don't. None of the diffs are significant. Desktop: Xfermode_Screen_aa 5.82ms -> 5.54ms 0.95x Xfermode_Modulate_aa 5.67ms -> 5.36ms 0.95x Xfermode_Exclusion_aa 6.18ms -> 5.81ms 0.94x Xfermode_Exclusion 5.03ms -> 4.24ms 0.84x Xfermode_Screen 4.51ms -> 3.59ms 0.8x Xfermode_Modulate 4.2ms -> 3.19ms 0.76x Xfermode_DstOver 6.73ms -> 3.88ms 0.58x Xfermode_SrcOut 6.47ms -> 3.48ms 0.54x Xfermode_SrcIn 6.46ms -> 3.46ms 0.54x Xfermode_DstOut 6.49ms -> 3.41ms 0.52x Xfermode_DstIn 6.5ms -> 3.32ms 0.51x Xfermode_Src_aa 9.53ms -> 4.75ms 0.5x Xfermode_Clear_aa 9.65ms -> 4.8ms 0.5x Xfermode_DstIn_aa 11.5ms -> 5.57ms 0.49x Xfermode_DstOver_aa 11.6ms -> 5.63ms 0.49x Xfermode_SrcOut_aa 11.6ms -> 5.5ms 0.47x Xfermode_SrcIn_aa 11.7ms -> 5.51ms 0.47x Xfermode_DstOut_aa 11.7ms -> 5.4ms 0.46x N7 performance is close enough to 1x that I'm not sure whether this is a net win, net loss, or truly neutral. I figure the bots will show that. I experimented with another approximation, (x*(255-y))/255 ≈ (x*(256-y))/256. This was inconclusive, so I'm leaving it out for now. The remaining modes are the complicated conditional ones. BUG=skia: Review URL: https://codereview.chromium.org/1141953004
* Sk4px: Difference and ExclusionGravatar mtklein2015-05-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This will cause minor (off-by-one) diffs due to a little lost precision: colortype_xfermodes mixed_xfermodes xfermodes2 xfermodeimagefilter xfermodes3 xfermodes Desktop: Xfermode_Difference_aa 9.77ms -> 7.32ms 0.75x Xfermode_Exclusion_aa 8.49ms -> 6.21ms 0.73x Xfermode_Difference 17ms -> 7.54ms 0.44x Xfermode_Exclusion 13.5ms -> 5.09ms 0.38x N7: Xfermode_Difference_aa 32.2ms -> 27.6ms 0.86x Xfermode_Difference 43.9ms -> 32ms 0.73x Xfermode_Exclusion_aa 40.5ms -> 26.7ms 0.66x Xfermode_Exclusion 71.5ms -> 23.9ms 0.33x This wraps up the xfermodes implemented in Sk4f. BUG=skia: Review URL: https://codereview.chromium.org/1141213002
* Sk4px: SrcATop, DstATop, Xor, MultiplyGravatar mtklein2015-05-13
| | | | | | | | | | SSE runs 2-3x faster (than 4f), NEON runs 1.2-1.4x faster (than existing NEON). Small diffs on {aarectmodes, imagefilters_xfermodes, hairmodes, mixed_xfermodes} only on AA edges due to precision drop. BUG=skia: Review URL: https://codereview.chromium.org/1132853005
* Sk4px: alphas() and Load[24]Alphas()Gravatar mtklein2015-05-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | alphas() extracts the 4 alphas from an existing Sk4px as another Sk4px. LoadNAlphas() constructs an Sk4px from N packed alphas. In both cases, we end up with 4x repeated alphas aligned with their pixels. alphas() A0 R0 G0 B0 A1 R1 G1 B1 A2 R2 G2 B2 A3 R3 G3 B3 -> A0 A0 A0 A0 A1 A1 A1 A1 A2 A2 A2 A2 A3 A3 A3 A3 Load4Alphas() A0 A1 A2 A3 -> A0 A0 A0 A0 A1 A1 A1 A1 A2 A2 A2 A2 A3 A3 A3 A3 Load2Alphas() A0 A1 -> A0 A0 A0 A0 A1 A1 A1 A1 0 0 0 0 0 0 0 0 This is a 5-10% speedup for AA on Intel, and wash on ARM. AA is still mostly dominated by the final lerp. alphas() isn't used yet, but it's similar enough to Load[24]Alphas() that it was easier to write all at once. BUG=skia: Review URL: https://codereview.chromium.org/1138333003
* Plus xfermode using Sk4px.Gravatar mtklein2015-05-12
| | | | | | | | | | | | | | | | | | | Xfermode_Plus runs 4-5x faster. We expect mixed_xfermodes to have a small diff. This is because kFoldCoverageIntoSrcAlpha was incorrectly set to true. This implementation handily beats the Sk4f impl, the portable impl, and the existing SSE2 impl. Reading the SkXfermodes_opts_SSE2.cpp file, I'm pretty confident that we'll be able to beat all SSE2 impls. I believe this impl will beat or match the existing NEON impl too, but that may not be true for more complicated xfermodes. They can take advantage of transposing ARGBARGB... to AAAARRRR.... cheaply and I haven't figured out an abstraction for that yet that doesn't screw SSE. Adds: - MapDstSrc() to Sk4px - saturatedAdd() to SkNi (only implemented as far as it's used). - div255Narrow() BUG=skia: Review URL: https://codereview.chromium.org/1138893002
* Sk4pxGravatar mtklein2015-05-12
Xfermode_SrcOver: SSE: 2.08ms -> 2.03ms (~2% faster) NEON: my N5 is noisy, but there appears to be no perf change BUG=skia: Review URL: https://codereview.chromium.org/1132273004