aboutsummaryrefslogtreecommitdiffhomepage
path: root/src/core/Sk4pxXfermode.h
Commit message (Collapse)AuthorAge
* Port SkXfermode opts to SkOpts.hGravatar mtklein2015-07-31
| | | | | | | | | | | | | Renames Sk4pxXfermode.h to SkXfermode_opts.h, and refactors it a tiny bit internally. This moves xfermode optimization from being "compile-time everywhere but NEON" to simply "runtime everywhere". I don't anticipate any effect on perf or correctness. BUG=skia:4117 Review URL: https://codereview.chromium.org/1264543006
* 565 support for SIMD xfermodesGravatar mtklein2015-07-22
| | | | | | | | | | | | | | | | | | | | | | This uses the most basic approach possible: - to load an Sk4px from 565, convert to SkPMColors on the stack serially then load those SkPMColors. - to store an Sk4px to 565, store to SkPMColors on the stack then convert to 565 serially. Clearly, we can optimize these loads and stores. That's a TODO. The code using SkPMFloat is the same idea but a little more long-term viable, as we're only operating on one pixel at a time anyway. We could probably write 565 <-> SkPMFloat methods, but I'd rather not until it's really compelling. The speedups are varied but similar across SSE and NEON: a few uninteresting, many 50% faster, some 2x faster, and SoftLight ~4x faster. This will cause minor GM diffs, but I don't think any layout test changes. BUG=skia: Committed: https://skia.googlesource.com/skia/+/942930dcaa51f66d82cdaf46ae62efebd16c8cd0 Committed: https://skia.googlesource.com/skia/+/860dcaa2ddfdadc050af4f943a84a9d499315066 Review URL: https://codereview.chromium.org/1245673002
* Revert of 565 support for SIMD xfermodes (patchset #4 id:60001 of ↵Gravatar mtklein2015-07-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://codereview.chromium.org/1245673002/) Reason for revert: NEON 565 gold images have gone ugly. This is what I get for writing and testing SSE and just writing NEON. E.g. colortype_xfermodes, dstreadshuffle, bigbitmaprect, pictures, textbloblooper, aaxfermodes (only Plus) Original issue's description: > 565 support for SIMD xfermodes > > This uses the most basic approach possible: > - to load an Sk4px from 565, convert to SkPMColors on the stack serially then load those SkPMColors. > - to store an Sk4px to 565, store to SkPMColors on the stack then convert to 565 serially. > > Clearly, we can optimize these loads and stores. That's a TODO. > > The code using SkPMFloat is the same idea but a little more long-term viable, as we're only operating on one pixel at a time anyway. We could probably write 565 <-> SkPMFloat methods, but I'd rather not until it's really compelling. > > The speedups are varied but similar across SSE and NEON: a few uninteresting, many 50% faster, some 2x faster, and SoftLight ~4x faster. > > This will cause minor GM diffs, but I don't think any layout test changes. > > BUG=skia: > > Committed: https://skia.googlesource.com/skia/+/942930dcaa51f66d82cdaf46ae62efebd16c8cd0 > > Committed: https://skia.googlesource.com/skia/+/860dcaa2ddfdadc050af4f943a84a9d499315066 TBR=msarett@google.com,mtklein@chromium.org NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG=skia: Review URL: https://codereview.chromium.org/1248893004
* 565 support for SIMD xfermodesGravatar mtklein2015-07-21
| | | | | | | | | | | | | | | | | | | | This uses the most basic approach possible: - to load an Sk4px from 565, convert to SkPMColors on the stack serially then load those SkPMColors. - to store an Sk4px to 565, store to SkPMColors on the stack then convert to 565 serially. Clearly, we can optimize these loads and stores. That's a TODO. The code using SkPMFloat is the same idea but a little more long-term viable, as we're only operating on one pixel at a time anyway. We could probably write 565 <-> SkPMFloat methods, but I'd rather not until it's really compelling. The speedups are varied but similar across SSE and NEON: a few uninteresting, many 50% faster, some 2x faster, and SoftLight ~4x faster. This will cause minor GM diffs, but I don't think any layout test changes. BUG=skia: Committed: https://skia.googlesource.com/skia/+/942930dcaa51f66d82cdaf46ae62efebd16c8cd0 Review URL: https://codereview.chromium.org/1245673002
* De-templatize Sk4pxXfermode code a bit.Gravatar mtklein2015-07-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This deduplicates a few pieces of code: - we end up with one copy of each xfer32() driver loop instead of one per xfermode; - we end up with two* copies of each xfermode implementation instead of ten**. * For a given Mode: Mode() itself and xfer_aa<Mode>(). ** From unrolling: twice at a stride of 8, once at 4, once at 2, and once at 1, then all again for when we have AA. This decreases the size of SkXfermode.o from 1.5M to 620K on x86-64 and from 1.3M to 680K on ARMv7+NEON. If we wanted to, we could eliminate the xfer_aa<Mode>() copy by tagging each Mode() function as __attribute__((noinline)) or its equivalent. This would result in another ~100K space savings. Performance is affected in proportion to the original xfermode speed: fast modes like Plus take the largest proportional hit, and slow modes like HardLight or SoftLight see essentially no hit at all. This adds SK_VECTORCALL to help keep this code fast on ARMv7 and Windows. I've looked at the ARMv7 generated code... it looks good, even pretty. For compatibility with SK_VECTORCALL, we now pass the vector-sized arguments by value instead of by reference. Some refactoring now allows us to declare each mode as just a static function instead of a struct, which simplifies things. TBR=reed@google.com No public API changes. BUG=skia: Committed: https://skia.googlesource.com/skia/+/e617e1525916d7ee684142728c0905828caf49da CQ_EXTRA_TRYBOTS=client.skia.compile:Build-Ubuntu-GCC-Arm7-Debug-Android_NoNeon-Trybot Review URL: https://codereview.chromium.org/1242743004
* Revert of De-templatize Sk4pxXfermode code a bit. (patchset #2 id:20001 of ↵Gravatar mtklein2015-07-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://codereview.chromium.org/1242743004/) Reason for revert: http://build.chromium.org/p/client.skia.compile/builders/Build-Ubuntu-GCC-Arm7-Debug-Android_NoNeon/builds/1168/steps/build%20most/logs/stdio Original issue's description: > De-templatize Sk4pxXfermode code a bit. > > This deduplicates a few pieces of code: > - we end up with one copy of each xfer32() driver loop instead of one per xfermode; > - we end up with two* copies of each xfermode implementation instead of ten**. > > * For a given Mode: Mode() itself and xfer_aa<Mode>(). > ** From unrolling: twice at a stride of 8, once at 4, once at 2, and once at 1, then all again for when we have AA. > > This decreases the size of SkXfermode.o from 1.5M to 620K on x86-64 and from 1.3M to 680K on ARMv7+NEON. > > If we wanted to, we could eliminate the xfer_aa<Mode>() copy by tagging each Mode() function as __attribute__((noinline)) or its equivalent. This would result in another ~100K space savings. > > Performance is affected in proportion to the original xfermode speed: > fast modes like Plus take the largest proportional hit, and slow modes > like HardLight or SoftLight see essentially no hit at all. > > This adds SK_VECTORCALL to help keep this code fast on ARMv7 and Windows. I've looked at the ARMv7 generated code... it looks good, even pretty. > > For compatibility with SK_VECTORCALL, we now pass the vector-sized arguments by value instead of by reference. Some refactoring now allows us to declare each mode as just a static function instead of a struct, which simplifies things. > > TBR=reed@google.com > No public API changes. > > BUG=skia: > > Committed: https://skia.googlesource.com/skia/+/e617e1525916d7ee684142728c0905828caf49da TBR=msarett@google.com,mtklein@chromium.org NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG=skia: Review URL: https://codereview.chromium.org/1245273005
* De-templatize Sk4pxXfermode code a bit.Gravatar mtklein2015-07-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This deduplicates a few pieces of code: - we end up with one copy of each xfer32() driver loop instead of one per xfermode; - we end up with two* copies of each xfermode implementation instead of ten**. * For a given Mode: Mode() itself and xfer_aa<Mode>(). ** From unrolling: twice at a stride of 8, once at 4, once at 2, and once at 1, then all again for when we have AA. This decreases the size of SkXfermode.o from 1.5M to 620K on x86-64 and from 1.3M to 680K on ARMv7+NEON. If we wanted to, we could eliminate the xfer_aa<Mode>() copy by tagging each Mode() function as __attribute__((noinline)) or its equivalent. This would result in another ~100K space savings. Performance is affected in proportion to the original xfermode speed: fast modes like Plus take the largest proportional hit, and slow modes like HardLight or SoftLight see essentially no hit at all. This adds SK_VECTORCALL to help keep this code fast on ARMv7 and Windows. I've looked at the ARMv7 generated code... it looks good, even pretty. For compatibility with SK_VECTORCALL, we now pass the vector-sized arguments by value instead of by reference. Some refactoring now allows us to declare each mode as just a static function instead of a struct, which simplifies things. TBR=reed@google.com No public API changes. BUG=skia: Review URL: https://codereview.chromium.org/1242743004
* Revert of 565 support for SIMD xfermodes (patchset #3 id:40001 of ↵Gravatar mtklein2015-07-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://codereview.chromium.org/1245673002/) Reason for revert: 942930d (included in this roll) introduced a 140 kB sizes regression in libskia.so. Please investigate and reland if this regression is necessary. Original issue's description: > 565 support for SIMD xfermodes > > This uses the most basic approach possible: > - to load an Sk4px from 565, convert to SkPMColors on the stack serially then load those SkPMColors. > - to store an Sk4px to 565, store to SkPMColors on the stack then convert to 565 serially. > > Clearly, we can optimize these loads and stores. That's a TODO. > > The code using SkPMFloat is the same idea but a little more long-term viable, as we're only operating on one pixel at a time anyway. We could probably write 565 <-> SkPMFloat methods, but I'd rather not until it's really compelling. > > The speedups are varied but similar across SSE and NEON: a few uninteresting, many 50% faster, some 2x faster, and SoftLight ~4x faster. > > This will cause minor GM diffs, but I don't think any layout test changes. > > BUG=skia: > > Committed: https://skia.googlesource.com/skia/+/942930dcaa51f66d82cdaf46ae62efebd16c8cd0 TBR=msarett@google.com,mtklein@chromium.org NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG=skia: Review URL: https://codereview.chromium.org/1242973004
* 565 support for SIMD xfermodesGravatar mtklein2015-07-20
| | | | | | | | | | | | | | | | | | This uses the most basic approach possible: - to load an Sk4px from 565, convert to SkPMColors on the stack serially then load those SkPMColors. - to store an Sk4px to 565, store to SkPMColors on the stack then convert to 565 serially. Clearly, we can optimize these loads and stores. That's a TODO. The code using SkPMFloat is the same idea but a little more long-term viable, as we're only operating on one pixel at a time anyway. We could probably write 565 <-> SkPMFloat methods, but I'd rather not until it's really compelling. The speedups are varied but similar across SSE and NEON: a few uninteresting, many 50% faster, some 2x faster, and SoftLight ~4x faster. This will cause minor GM diffs, but I don't think any layout test changes. BUG=skia: Review URL: https://codereview.chromium.org/1245673002
* Clean up dead xfermode opts code.Gravatar mtklein2015-07-20
| | | | | | | | | | | | | Now that SK_SUPPORT_LEGACY_XFERMODES is unused, tons of code becomes dead. Nothing is needed in opts/ anymore for x86. We still do runtime NEON detection, which just duplicates Sk4pxXfermode. TBR=reed@google.com BUG=skia: Review URL: https://codereview.chromium.org/1230023011
* Fix buggy blend modes.Gravatar mtklein2015-07-14
| | | | | | | | | | | | The problem turns out to be not over- or underflow like I thought, but that I used different math for x*y/255 for alphas and colors, sometimes resulting in colors where alpha was one less than the maximum color component, which is not a valid SkPMColor. To be safe, I've switched over all four of these similar complex modes to use exact math everywhere. They now match the byte-by-byte code in SkXfermode.cpp exactly. This will slow down Darken and Lighten by about 2x. I plan to follow up with a CL to see if I can eek out some speed there, and another CL to add asserts that Sk4px code always writes valid SkPMColors. BUG=skia:4052 Review URL: https://codereview.chromium.org/1241683003
* 3-15% speedup to HardLight / Overlay xfermodes.Gravatar mtklein2015-07-14
| | | | | | | | | | | | | | | While investigating my bug (skia:4052) I saw this TODO and figured it'd make me feel better about an otherwise unsuccessful investigation. This speeds up HardLight and Overlay (same code) by about 15% with SSE, mostly by rewriting the logic from 1 cheap comparison and 2 expensive div255() calls to 2 cheap comparisons and 1 expensive div255(). NEON speeds up by a more modest ~3%. BUG=skia: Review URL: https://codereview.chromium.org/1230663005
* SoftLight with SkPMFloatGravatar mtklein2015-06-29
| | | | | | | | | | | | | | | | | | SSE speeds up about 4.5x over existing integer SSE, NEON speeds up about 3x over serial integer code. We expect 1-2 bit component diffs in the usual GMs. Still guarded by SK_SUPPORT_LEGACY_XFERMODES, which I'll now try to lift in Chrome. BUG=skia: Committed: https://skia.googlesource.com/skia/+/3e47d49b46b3ab62071218ef3dd44642c9713e04 CQ_EXTRA_TRYBOTS=client.skia:Test-ChromeOS-GCC-Daisy-CPU-NEON-Arm7-Debug-Trybot Review URL: https://codereview.chromium.org/1221493002
* Edges matter, part 2.Gravatar mtklein2015-06-29
| | | | | | | | | | | | | Affected modes: lighten, hard-light, overlay (== hard-light). This fixes a couple places where I used < when I should have used <=, or swapped the logic as I've done here. Caught by layout tests; our tests should be unchanged. https://storage.googleapis.com/chromium-layout-test-archives/linux_blink_rel/68935/layout-test-results/css3/blending/background-blend-mode-crossfade-image-gradient-diffs.html BUG=skia: Review URL: https://codereview.chromium.org/1217013003
* Revert of SoftLight with SkPMFloat (patchset #6 id:100001 of ↵Gravatar mtklein2015-06-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://codereview.chromium.org/1221493002/) Reason for revert: xfermodes and xfermodes2 show major diffs on Nexus 5 and Daisy (both ARMv7 w/NEON). Nexus 9 and SSE all look fine... Original issue's description: > SoftLight with SkPMFloat > > SSE speeds up about 4.5x over existing integer SSE, > NEON speeds up about 3x over serial integer code. > > We expect 1-2 bit component diffs in the usual GMs. > > Still guarded by SK_SUPPORT_LEGACY_XFERMODES, > which I'll now try to lift in Chrome. > > > BUG=skia: > > Committed: https://skia.googlesource.com/skia/+/3e47d49b46b3ab62071218ef3dd44642c9713e04 TBR=reed@google.com,mtklein@chromium.org NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG=skia: Review URL: https://codereview.chromium.org/1221683002
* SoftLight with SkPMFloatGravatar mtklein2015-06-29
| | | | | | | | | | | | | | SSE speeds up about 4.5x over existing integer SSE, NEON speeds up about 3x over serial integer code. We expect 1-2 bit component diffs in the usual GMs. Still guarded by SK_SUPPORT_LEGACY_XFERMODES, which I'll now try to lift in Chrome. BUG=skia: Review URL: https://codereview.chromium.org/1221493002
* Color dodge and burn with SkPMFloat.Gravatar mtklein2015-06-26
| | | | | | | | | | | | | | | | | Both 25-35% faster with SSE. With NEON, Burn measures as a ~10% regression, Dodge a huge 2.9x improvement. The Burn regression is somewhat artificial: we're drawing random colored rects onto an opaque white dst, so we're heavily biased toward the (d==da) fast path in the serial code. In the vector code there's no short-circuiting and we always pay a fixed cost for ColorBurn regardless of src or dst content. Dodge's fast paths, in contrast, only trigger when (s==sa) or (d==0), neither of which happens any more than randomly in our benchmark. I don't think (d==0) should happen at all. Similarly, the (s==0) Burn fast path is really only going to happen as often as SkRandom allows. In practice, the existing Burn benchmark is hitting its fast path 100% of the time. So I actually feel really great that this only dings the benchmark by 10%. Chrome's still guarded by SK_SUPPORT_LEGACY_XFERMODES, which I'll lift after finishing the last xfermode, SoftLight. BUG=skia: Review URL: https://codereview.chromium.org/1214443002
* Implement four more xfermodes with Sk4px.Gravatar mtklein2015-06-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | HardLight, Overlay, Darken, and Lighten are all ~2x faster with SSE, ~25% faster with NEON. This covers all previously-implemented NEON xfermodes. 3 previous SSE xfermodes remain. Those need division and sqrt, so I'm planning on using SkPMFloat for them. It'll help the readability and NEON speed if I move that into [0,1] space first. The main new concept here is c.thenElse(t,e), which behaves like (c ? t : e) except, of course, both t and e are evaluated. This allows us to emulate conditionals with vectors. This also removes the concept of SkNb. Instead of a standalone bool vector, each SkNi or SkNf will just return their own types for comparisons. Turns out to be a lot more manageable this way. BUG=skia: Committed: https://skia.googlesource.com/skia/+/b9d4163bebab0f5639f9c5928bb5fc15f472dddc CQ_EXTRA_TRYBOTS=client.skia.compile:Build-Ubuntu-GCC-Arm64-Debug-Android-Trybot Review URL: https://codereview.chromium.org/1196713004
* Revert of Implement four more xfermodes with Sk4px. (patchset #16 id:290001 ↵Gravatar mtklein2015-06-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | of https://codereview.chromium.org/1196713004/) Reason for revert: 64-bit ARM build failures. Original issue's description: > Implement four more xfermodes with Sk4px. > > HardLight, Overlay, Darken, and Lighten are all > ~2x faster with SSE, ~25% faster with NEON. > > This covers all previously-implemented NEON xfermodes. > 3 previous SSE xfermodes remain. Those need division > and sqrt, so I'm planning on using SkPMFloat for them. > It'll help the readability and NEON speed if I move that > into [0,1] space first. > > The main new concept here is c.thenElse(t,e), which behaves like > (c ? t : e) except, of course, both t and e are evaluated. This allows > us to emulate conditionals with vectors. > > This also removes the concept of SkNb. Instead of a standalone bool > vector, each SkNi or SkNf will just return their own types for > comparisons. Turns out to be a lot more manageable this way. > > BUG=skia: > > Committed: https://skia.googlesource.com/skia/+/b9d4163bebab0f5639f9c5928bb5fc15f472dddc TBR=reed@google.com,mtklein@chromium.org NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG=skia: Review URL: https://codereview.chromium.org/1205703008
* Implement four more xfermodes with Sk4px.Gravatar mtklein2015-06-24
| | | | | | | | | | | | | | | | | | | | | | | HardLight, Overlay, Darken, and Lighten are all ~2x faster with SSE, ~25% faster with NEON. This covers all previously-implemented NEON xfermodes. 3 previous SSE xfermodes remain. Those need division and sqrt, so I'm planning on using SkPMFloat for them. It'll help the readability and NEON speed if I move that into [0,1] space first. The main new concept here is c.thenElse(t,e), which behaves like (c ? t : e) except, of course, both t and e are evaluated. This allows us to emulate conditionals with vectors. This also removes the concept of SkNb. Instead of a standalone bool vector, each SkNi or SkNf will just return their own types for comparisons. Turns out to be a lot more manageable this way. BUG=skia: Review URL: https://codereview.chromium.org/1196713004
* Update some Sk4px APIs.Gravatar mtklein2015-06-22
| | | | | | | | | | | | | | | Mostly this is about ergonomics, making it easier to do good operations and hard / impossible to do bad ones. - SkAlpha / SkPMColor constructors become static factories. - Remove div255TruncNarrow(), rename div255RoundNarrow() to div255(). In practice we always want to round, and the narrowing to 8-bit is contextually obvious. - Rename fastMulDiv255Round() approxMulDiv255() to stress it's approximate-ness over its speed. Drop Round for the same reason as above... we should always round. - Add operator overloads so we don't have to keep throwing in seemingly-random Sk4px() or Sk4px::Wide() casts. - use operator*() for 8-bit x 8-bit -> 16-bit math. It's always what we want, and there's generally no 8x8->8 alternative. - MapFoo can take a const Func&. Don't think it makes a big difference, but nice to do. BUG=skia: Review URL: https://codereview.chromium.org/1202013002
* Simpler version of Plus w/ AA. ~25% faster too.Gravatar mtklein2015-05-22
| | | | | | | | BUG=skia:3852 TBR=fmalita@chromium.org Review URL: https://codereview.chromium.org/1150693003
* Move Sk4px Xfermode code to a header so we can use it twice.Gravatar mtklein2015-05-22
- Once in SkXfermode as usual to pick up compile-time SSE and NEON - Once in SkXfermode_arm_neon to pick up run-time NEON This allows us to start cleaning up SkXfermode_arm_neon as we've done for SkXfermode_SSE2. I'm saving this catharsis for a day when I need it. The Sk4px xfermodes are generally faster than the existing NEON procs, so this should also have the side effect of a perf win there. This means our new Plus-AA code works for runtime NEON too. BUG=skia:3852 Review URL: https://codereview.chromium.org/1150313003