aboutsummaryrefslogtreecommitdiffhomepage
path: root/src/jumper
Commit message (Collapse)AuthorAge
...
* centralize SI, Ctx, and load_and_inc()Gravatar Mike Klein2017-09-15
| | | | | | | | | | | | | | | | | We've got independent definitions of SI, LazyCtx/Ctx, and load_and_inc() in _stages.cpp and _lowp.cpp. It's a good time to centralize them, taking _stages.cpp's SI and load_and_inc(), and _lowp's Ctx. SI and load_and_inc() are uninterestingly different. But using _lowp's Ctx will let us get its prettier typed stage definitions into _stages.cpp, but that is not not done here. This is a pure refactor with no generated code changes. Change-Id: I53260b0fdc71a77bf9e3ed6f3df3a2a4cbd2392b Reviewed-on: https://skia-review.googlesource.com/47181 Reviewed-by: Mike Klein <mtklein@chromium.org> Commit-Queue: Mike Klein <mtklein@chromium.org>
* make most of SkColorPriv.h privateGravatar Cary Clark2017-09-15
| | | | | | | | | | | | | created new file src/core/SkColorData.h for internal consumption. Note that many of the functions there are unused as well. Bug: skia: 6898 R: reed@google.com Change-Id: I25bfd5a9c21f53558c4ca65a77eb5d322d897c6d Reviewed-on: https://skia-review.googlesource.com/46848 Commit-Queue: Cary Clark <caryclark@google.com> Reviewed-by: Mike Reed <reed@google.com>
* clean up SK_JUMPER_LEGACY_8BITGravatar Mike Klein2017-09-15
| | | | | | | Change-Id: I4d4093fcfc839f6e7468b7d9f89bb903186ab68d Reviewed-on: https://skia-review.googlesource.com/46761 Reviewed-by: Florin Malita <fmalita@chromium.org> Commit-Queue: Mike Klein <mtklein@chromium.org>
* fix android rollGravatar Mike Klein2017-09-14
| | | | | | | | | | | | | | | | Guarding loads of 8-15 with defined(__AVX2__) should prevent errors like these: external/skia/src/jumper/SkJumper_stages_lowp.cpp:287:46: error: 'memcpy' called with size bigger than buffer case 12: memcpy(&v, ptr, 12*sizeof(T)); break; The loads of 8-15 were of course unreachable, given the &(N-1) == &7. Change-Id: Ifcb5c177c6909e1df55cb564779a4d6610ff7b32 Reviewed-on: https://skia-review.googlesource.com/46521 Reviewed-by: Mike Klein <mtklein@chromium.org> Commit-Queue: Mike Klein <mtklein@chromium.org>
* grand unifried lowp stagesGravatar Mike Klein2017-09-14
| | | | | | | | | | | | | | | | | | | | | | | | | I have text_16_AA_FF -> 8888 (forcing RP) faster than head now on my laptop. I'm feeling confident that we can make this perform well. After looking at performance a bit more today, it looks like everything is within what I'd consider comparable in performance, especially on ARM. On x86-64 it looks like big bulk blits get a little slower and small mask blits get a little faster. Quality looks good, and maybe improved for 565. There are fewer platform-specific differences now in _lowp, and I think they're few enough now that we could even consider completing the unification by folding the 8-bit and float code together. Rename "div255()" to "rebias()", slap on a few coats of paint... Guarded for Chrome with SK_JUMPER_LEGACY_LOWP. Change-Id: I36309c07cf736f3cb31952cca66030ad56026318 Reviewed-on: https://skia-review.googlesource.com/45982 Reviewed-by: Herb Derby <herb@google.com> Reviewed-by: Florin Malita <fmalita@chromium.org> Commit-Queue: Mike Klein <mtklein@chromium.org>
* update SkJumper stages to clang 5Gravatar Mike Klein2017-09-08
| | | | | | | | | | | | | | | | | | | | To continue building stages, update Clang and update your GN args: $ brew update $ brew upgrade llvm $ find . | grep args.gn | xargs sed -ie 's/clang-4.0/clang-5.0/g' Some interesting codegen changes I noticed: - ARMv7: generally better register assignment, tighter code - ARMv7: dropped the 128-bit alignment hint when loading and storing dst "registers", unclear why. - HSW: now clearing the destination register before vgatherdps, to break a dependency on the previous value Change-Id: I4f804a4cbfcde530fad5ed535438174e852a9593 Reviewed-on: https://skia-review.googlesource.com/44241 Reviewed-by: Florin Malita <fmalita@chromium.org> Reviewed-by: Herb Derby <herb@google.com> Commit-Queue: Mike Klein <mtklein@chromium.org>
* clean up SK_JUMPER_LEGACY_X86_8BITGravatar Mike Klein2017-09-06
| | | | | | | Change-Id: I26c22c085efd70b65de927a9a8a041d03c170f2a Reviewed-on: https://skia-review.googlesource.com/42760 Commit-Queue: Florin Malita <fmalita@chromium.org> Reviewed-by: Florin Malita <fmalita@chromium.org>
* merge 0,1,2,3,... and 0.5fGravatar Mike Klein2017-09-05
| | | | | | | | | | Because floats are fun, the compiler cannot merge x + 0.5f + [0,1,2,3,4...] into x + [0.5,1.5,2.5,3.5,4.5,...]. But we can. Change-Id: I03b46c1ea0653877f35f6c888f29371b5f73d813 Reviewed-on: https://skia-review.googlesource.com/42480 Reviewed-by: Florin Malita <fmalita@chromium.org> Commit-Queue: Mike Klein <mtklein@chromium.org>
* add 8bit stages for load/store 565Gravatar Mike Reed2017-08-31
| | | | | | | | | | | approx 2.5x faster on arm64 for sprite 8888 --> 565 blits Bug: skia: Change-Id: I524f993fee16196385dc07cbec39ef378b1301e5 Reviewed-on: https://skia-review.googlesource.com/41162 Reviewed-by: Florin Malita <fmalita@chromium.org> Reviewed-by: Mike Klein <mtklein@chromium.org> Commit-Queue: Mike Reed <reed@google.com>
* 32-bit x86 8-bit stagesGravatar Mike Klein2017-08-30
| | | | | | | | | | | Shouldn't be anything tricky here. Guarded by SK_JUMPER_LEGACY_X86_8BIT for (Win) layout tests. Change-Id: I7580c7c18d1721f1301904c049ea2e59e9bda5d9 Reviewed-on: https://skia-review.googlesource.com/40692 Reviewed-by: Herb Derby <herb@google.com> Reviewed-by: Florin Malita <fmalita@chromium.org> Commit-Queue: Mike Klein <mtklein@chromium.org>
* remove 8-bit Params structGravatar Mike Klein2017-08-30
| | | | | | | | | | | | | I'm not sure why I wrote this to use a Params struct originally, but we should have plenty of registers in _8bit to pass everything directly and avoid the stack. Even once we enable the 8-bit pipeline on 32-bit x86, we'll have 4 general purpose registers and 4 vector registers to use, precisely what we're using here. Change-Id: I3e51ab73186edcdcb8bfaa6cc99d9516db7c032a Reviewed-on: https://skia-review.googlesource.com/40771 Reviewed-by: Florin Malita <fmalita@chromium.org> Commit-Queue: Mike Klein <mtklein@chromium.org>
* improve ARMv7 8-bit codegenGravatar Mike Klein2017-08-30
| | | | | | | | | | | | | | We need to make two changes to keep all values in registers: 1) pass raw U8 values instead of V structs that wrap them 2) switch to aapcs-vfp, which allows exactly 8x U8 arguments Code generation goes from ridiculous looking to lovely. Change-Id: Ibd53bdd86345b59bd987a1f79205645d80c5cbc3 Reviewed-on: https://skia-review.googlesource.com/40021 Commit-Queue: Mike Klein <mtklein@google.com> Reviewed-by: Florin Malita <fmalita@chromium.org>
* fast NEON divide-by-255Gravatar Mike Klein2017-08-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | We can approximate (xy + 127) / 255 with (xy + 255) / 256. On ARM this divide-by-255 is a single instruction, one of the two we use today to do a perfect divide-by-255 (#if 0). This cuts div-255 in half, or a full mul-div-255 by a third. The U16(255) constant can even be created in a single instruction without hitting memory, which is as good as it gets. Here's a nice little example: 0000000000000000 <sk_premul_8bit>: 0: f8408404 ldr x4, [x0], #8 // Load the next stage. 4: 2e23c000 umull v0.8h, v0.8b, v3.8b // r = r * a 8: 6f02e6b0 movi v16.2d, #0xff00ff00ff00ff // create U16(255) c: 2e23c021 umull v1.8h, v1.8b, v3.8b // g = g * a 10: 2e23c042 umull v2.8h, v2.8b, v3.8b // b = b * a 14: 0e304000 addhn v0.8b, v0.8h, v16.8h // r = div255(r) 18: 0e304021 addhn v1.8b, v1.8h, v16.8h // g = div255(g) 1c: 0e304042 addhn v2.8b, v2.8h, v16.8h // b = div255(b) 20: d61f0080 br x4 // JUMP! Change-Id: I4224ed3844abf6c67d9e42b67444a60f4aee8f08 Reviewed-on: https://skia-review.googlesource.com/40121 Commit-Queue: Mike Klein <mtklein@chromium.org> Reviewed-by: Florin Malita <fmalita@chromium.org>
* no more need for a constants pointerGravatar Mike Klein2017-08-29
| | | | | | | | | | | | | | | The only reason we were keeping SkJumper_constants around is that it was hard to get float/integer iota vectors on arm64 without relocations. Now that we're compiling arm64 normally as part of Skia, we don't have to worry about relocations. This means we can kill the struct and stop passing around that pointer. Change-Id: I013c6a735947f3db2bc87f2bfa38b7520d2e2fce Reviewed-on: https://skia-review.googlesource.com/40200 Commit-Queue: Mike Klein <mtklein@chromium.org> Reviewed-by: Herb Derby <herb@google.com>
* better NEON 8-bit stagesGravatar Mike Klein2017-08-29
| | | | | | | | | | | | | | | | Our interlaced approach works pretty well for x86, but on ARM we're a lot better off deinterlacing in loads and reinterlacing in stores. This leaves the stages mostly looking like the float stages, and cuts out some awkward parts from the code generation. Diffs are all invisible. Performance is noticeably better for some blend modes like Overlay. Change-Id: Ie599e823602bfd14552de78df44a621aea66e1a2 Reviewed-on: https://skia-review.googlesource.com/40100 Commit-Queue: Mike Klein <mtklein@chromium.org> Reviewed-by: Florin Malita <fmalita@chromium.org>
* use NEON 8-bit stages on ARMv7 tooGravatar Mike Klein2017-08-29
| | | | | | | | | | | | | | | | | We don't really use anything very ARMv8 specific in the 8-bit NEON stages, so we can just naturally extend what we're doing to ARMv7 too. Note that unlike the float stages, we're not requiring VFPv4 either, just NEON. VFPv4 is for FMA and F16<->F32 conversion, both of which are unnecessary for the integer pipeline. GMs and perf improvement are similar to the previous ARMv8 change. Change-Id: Id618801ea1920564c1deee144a640a4133c4505f Reviewed-on: https://skia-review.googlesource.com/39840 Commit-Queue: Mike Klein <mtklein@chromium.org> Reviewed-by: Mike Klein <mtklein@chromium.org> Reviewed-by: Herb Derby <herb@google.com>
* Revert "Revert "8-bit jumper on armv8""Gravatar Mike Klein2017-08-29
| | | | | | | | | | | | | | | | | This reverts commit 6d13575108299951ecdfba6d85c915fcec2bc028. Now with guards for "errors" like this: external/skia/src/jumper/SkJumper_stages_8bit.cpp:240:50: error: 'memcpy' called with size bigger than buffer case 12: memcpy(&v, src, 12*sizeof(T)); break; This code is unreachable and generally removed by Clang's optimizer anyway... as far as I can tell the code generation diff is arbitrary. Change-Id: I6216567caaa6166f71258bd25343a09e93892a10 Reviewed-on: https://skia-review.googlesource.com/39961 Reviewed-by: Mike Klein <mtklein@chromium.org> Commit-Queue: Mike Klein <mtklein@chromium.org>
* Revert "8-bit jumper on armv8"Gravatar Derek Sollenberger2017-08-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 08133583d5e1cdfdcc41b4bb078fcfb64137f058. Reason for revert: Blocking Android Autoroller on compile error. Original change's description: > 8-bit jumper on armv8 > > The GM diffs are all minor and what you'd expect. > > I did a quick performance sanity check, which also looks fine. > > $ out/ok bench rp filter:search=Modulate > [blendmode_rect_Modulate] 30.2ms @0 32ms @95 32ms @100 > [blendmode_mask_Modulate] 12.6ms @0 12.6ms @95 14.5ms @100 > ~~~> > [blendmode_rect_Modulate] 11.2ms @0 11.7ms @95 12.4ms @100 > [blendmode_mask_Modulate] 10.5ms @0 23.6ms @95 23.9ms @100 > > This isn't even really the fastest we can make 8-bit go on ARMv8; > it's actually much more natural to work de-interlaced there. Lots > of room to follow up. > > Change-Id: I86b1099f6742bcb0b8b4fa153e85eaba9567cbf7 > Reviewed-on: https://skia-review.googlesource.com/39740 > Reviewed-by: Florin Malita <fmalita@chromium.org> > Commit-Queue: Mike Klein <mtklein@chromium.org> TBR=mtklein@chromium.org,herb@google.com,fmalita@chromium.org,reed@google.com Change-Id: I71425d8b7fbb66be5cb50025871dd81358111da4 No-Presubmit: true No-Tree-Checks: true No-Try: true Reviewed-on: https://skia-review.googlesource.com/39980 Reviewed-by: Derek Sollenberger <djsollen@google.com> Commit-Queue: Derek Sollenberger <djsollen@google.com>
* 8-bit jumper on armv8Gravatar Mike Klein2017-08-28
| | | | | | | | | | | | | | | | | | | | | | The GM diffs are all minor and what you'd expect. I did a quick performance sanity check, which also looks fine. $ out/ok bench rp filter:search=Modulate [blendmode_rect_Modulate] 30.2ms @0 32ms @95 32ms @100 [blendmode_mask_Modulate] 12.6ms @0 12.6ms @95 14.5ms @100 ~~~> [blendmode_rect_Modulate] 11.2ms @0 11.7ms @95 12.4ms @100 [blendmode_mask_Modulate] 10.5ms @0 23.6ms @95 23.9ms @100 This isn't even really the fastest we can make 8-bit go on ARMv8; it's actually much more natural to work de-interlaced there. Lots of room to follow up. Change-Id: I86b1099f6742bcb0b8b4fa153e85eaba9567cbf7 Reviewed-on: https://skia-review.googlesource.com/39740 Reviewed-by: Florin Malita <fmalita@chromium.org> Commit-Queue: Mike Klein <mtklein@chromium.org>
* -Fast bot fixes for AVX+ codeGravatar Mike Klein2017-08-28
| | | | | | | | | | | | | | | | | | | 1) Replace a couple commas with semicolons. 2) Make sure to zero a couple vectors. 1) has no effect on code generation. 2) does add a bunch of self-vxorps, but they're cheap and we already do the equivalent for <AVX SSE code, and they're in not very performance-critical routines. We could circle back and guard these with !defined(JUMPER_IS_OFFLINE) if we really need the vectors to start uninitialized for speed. CQ_INCLUDE_TRYBOTS=skia.primary:Build-Debian9-Clang-x86_64-Release-Fast Change-Id: I1a13f3eb28d664dbc345d71c3adbc62be5ff7c45 Reviewed-on: https://skia-review.googlesource.com/39661 Reviewed-by: Mike Reed <reed@google.com> Commit-Queue: Mike Klein <mtklein@chromium.org>
* remove aarch64 offline compilationGravatar Mike Klein2017-08-28
| | | | | | | | | | | | | | | The baseline compiled into Skia is now pretty much identical. Minor diffs due to the offline code using -ffp-contract=fast, and the baseline not. Explicit calls to fma() are still FMAs, but we're no longer letting the compiler uncover FMAs we didn't explicitly call out. If this goes well, we should be able to turn on the 8-bit pipeline. Change-Id: I8f73157cfce7373574c20f6435fe86b46477afa9 Reviewed-on: https://skia-review.googlesource.com/39520 Reviewed-by: Herb Derby <herb@google.com> Commit-Queue: Mike Klein <mtklein@chromium.org>
* split up JUMPER defineGravatar Mike Klein2017-08-28
| | | | | | | | | | | | | | | | | | | | | | | | | Whether JUMPER is defined is starting to get a little overloaded: - are we compiling offline (defined) or as part of Skia (!defined)? - are we using Clang vector extensions (defined) or scalars (!defined)? This splits JUMPER into these two separate concerns: - JUMPER_IS_OFFLINE - JUMPER_IS_SCALAR, JUMPER_IS_NEON, JUMPER_IS_AVX2, etc. The upshot is that we'll now use Clang vector extensions when available for our "portable" baseline. On x86-64 and ARMv8 compiled by Clang, we're guaranteed to pick up SSE2 and NEON respectively. Our -Fast bot should even get all the way to AVX2. Another CL will do some refactoring in SkJumper to remove the redundant copies of guaranteed vector code on x86-64 and ARMv8. I didn't want to do that here yet to demonstrate that there is zero effect on the .S files from this CL. Change-Id: Ib5e8f00b35e8721b2cc7180e294840ffaf9dddce Reviewed-on: https://skia-review.googlesource.com/39500 Reviewed-by: Herb Derby <herb@google.com> Commit-Queue: Mike Klein <mtklein@chromium.org>
* rework plus blend modeGravatar Mike Klein2017-08-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The most interesting parts of this are how plus interacts with partial coverage. Plus needs its clamp to happen after the lerp. Luckily, some of its math folds away: d' = clamp[ d*(1-c) + (s+d)*c ] == clamp[ d - dc + sc + dc ] == clamp[ d + sc ] What's nice there is that coverage can be folded into the src term. This suggests that we can re-write the plus stage to clamp internally (and thus, be viable for 8-bit) if we always pre-scale with coverage. We don't have a way to pre-scale with 565 coverage until now, but it's only a step or two away from there. We can use the alternate formulation we derived for alpha for lerp_565, calculating the alpha coverage from red, green, and blue coverages _and_ the values of src and dst alpha. While we already pre-scale srcover today for 8-bit or constant coverage, we cannot do the same for 565. When evaluating the expression d' = s + (1-a)d we need the a term to be pre-scaled with red's coverage when calculating dr', with blue's when calculating db', etc. Essentially we need to carry around a bunch of extra values, and we've got no way to do that. So instead, we'll just carefully pre-scale plus with any coverage, and keep post-lerping srcover when we have 565 coverage. Change-Id: I7a7a52eec7d482e1b98bb8a01ea0a3d5e67bef65 Reviewed-on: https://skia-review.googlesource.com/38300 Commit-Queue: Mike Klein <mtklein@chromium.org> Reviewed-by: Florin Malita <fmalita@chromium.org>
* Remove SK_SUPPORT_LEGACY_RP_BLENDS-guarded codeGravatar Florin Malita2017-08-24
| | | | | | | | | | The flag is no longer used. Change-Id: I39156ef5683538263c2302f2fe3ba779e55dbc47 Reviewed-on: https://skia-review.googlesource.com/38360 Commit-Queue: Florin Malita <fmalita@chromium.org> Commit-Queue: Mike Klein <mtklein@chromium.org> Reviewed-by: Mike Klein <mtklein@chromium.org>
* merge object files before parsing into assemblyGravatar Mike Klein2017-08-24
| | | | | | | | | | | | | This extra ld pass can merge all our many redundant constants, both within an instruction set and across them. This should save a bunch of code size on x86-64, with no other impact. It cuts 12K off my local build of ok. Change-Id: Ib2bb4adf88564aca45e55ee53dcf6584265c7dbe Reviewed-on: https://skia-review.googlesource.com/37940 Commit-Queue: Mike Klein <mtklein@google.com> Reviewed-by: Florin Malita <fmalita@chromium.org>
* Document missing 8bit blend stagesGravatar Florin Malita2017-08-23
| | | | | | | Change-Id: Id626f954fe45546a015a1bd423f19cca5f8967a9 Reviewed-on: https://skia-review.googlesource.com/37861 Reviewed-by: Mike Klein <mtklein@google.com> Commit-Queue: Mike Klein <mtklein@google.com>
* ColorBurn/ColorDodge stage tweaksGravatar Florin Malita2017-08-23
| | | | | | | | | | | | | | | | | | | Minor speedup. Before: 10212.01 ? blendmode_rect_ColorBurn 8888 9216.78 ? blendmode_rect_ColorDodge 8888 After: 9635.44 ? blendmode_rect_ColorBurn 8888 8820.22 ? blendmode_rect_ColorDodge 8888 Change-Id: I9e8a9aa21e2370de3174c31821fb0676260d2643 Reviewed-on: https://skia-review.googlesource.com/37620 Reviewed-by: Mike Klein <mtklein@chromium.org> Commit-Queue: Florin Malita <fmalita@chromium.org>
* remove disabled mask load and store codeGravatar Mike Klein2017-08-22
| | | | | | | | | | Things ran slower when we attempted to turn it on, and we've already removed the analog in SkJumper_stages.cpp. Change-Id: I61afa38990bf54d1bff2b1902f09a14df4e17da9 Reviewed-on: https://skia-review.googlesource.com/37080 Reviewed-by: Mike Klein <mtklein@chromium.org> Commit-Queue: Mike Klein <mtklein@chromium.org>
* rename confusing lowp guardGravatar Mike Klein2017-08-15
| | | | | | | Change-Id: I346429015e5f902b0a35663e140bb9a025c4220e Reviewed-on: https://skia-review.googlesource.com/34680 Reviewed-by: Florin Malita <fmalita@chromium.org> Commit-Queue: Mike Klein <mtklein@chromium.org>
* Lowp overlay, hardlight stagesGravatar Florin Malita2017-08-14
| | | | | | | | | | | | | | | | | | | | Before: micros bench 7669.09 ? blendmode_rect_HardLight 8888 8707.13 ? blendmode_rect_Overlay 8888 After: micros bench 6679.60 ? blendmode_rect_HardLight 8888 6789.57 ? blendmode_rect_Overlay 8888 Change-Id: I52f389253fa07dafe18e572af550af7387264a16 Reviewed-on: https://skia-review.googlesource.com/34280 Commit-Queue: Florin Malita <fmalita@chromium.org> Reviewed-by: Mike Klein <mtklein@google.com>
* we never define BLEND_MODEGravatar Mike Klein2017-08-14
| | | | | | | | | Change-Id: I88f3e56971e9844ab2ff74edb0718e6b6e9c6559 Reviewed-on: https://skia-review.googlesource.com/34260 Reviewed-by: Mike Klein <mtklein@chromium.org> Reviewed-by: Florin Malita <fmalita@chromium.org> Commit-Queue: Mike Klein <mtklein@chromium.org> Commit-Queue: Florin Malita <fmalita@chromium.org>
* Simplify difference and exclusion.Gravatar Mike Klein2017-08-14
| | | | | | | | | | | | | | | | | | | | | | | We can fold through some math in these two modes. $ out/ok bench:samples=100 rp filter:search="Difference|Exclusion" serial Before: [blendmode_rect_Exclusion] 4.94ms @0 6.13ms @99 6.25ms @100 [blendmode_mask_Exclusion] 10.9ms @0 12.8ms @99 12.9ms @100 [blendmode_rect_Difference] 5.56ms @0 6.79ms @99 6.8ms @100 [blendmode_mask_Difference] 11.4ms @0 13.8ms @99 14.1ms @100 After: [blendmode_rect_Exclusion] 3.5ms @0 4.12ms @99 4.59ms @100 [blendmode_mask_Exclusion] 9.27ms @0 11.2ms @99 11.6ms @100 [blendmode_rect_Difference] 5.37ms @0 6.58ms @99 6.6ms @100 [blendmode_mask_Difference] 11ms @0 12.1ms @99 12.6ms @100 Change-Id: I03f32368244d4f979cfee83723fd78dfbc7d5fc1 Reviewed-on: https://skia-review.googlesource.com/33980 Commit-Queue: Florin Malita <fmalita@chromium.org> Reviewed-by: Florin Malita <fmalita@chromium.org>
* lowp: lighten, difference, exclusionGravatar Florin Malita2017-08-14
| | | | | | | Change-Id: I5773cf831c7e41a932bee1f2c6830085fb7db025 Reviewed-on: https://skia-review.googlesource.com/33764 Commit-Queue: Florin Malita <fmalita@chromium.org> Reviewed-by: Mike Klein <mtklein@google.com>
* Guard lowp changesGravatar Florin Malita2017-08-11
| | | | | | | | | | Chromium uses the lowp code, we have to stage the changes. TBR= Change-Id: I45e97a51eca285c9afc71926bbf736a03d0d146c Reviewed-on: https://skia-review.googlesource.com/33765 Reviewed-by: Florin Malita <fmalita@chromium.org> Commit-Queue: Florin Malita <fmalita@chromium.org>
* Lowp darken stageGravatar Florin Malita2017-08-11
| | | | | | | Change-Id: I4bf618ad8728541fcef3fc1c6aa5b3ca106d50dc Reviewed-on: https://skia-review.googlesource.com/33583 Commit-Queue: Florin Malita <fmalita@chromium.org> Reviewed-by: Mike Klein <mtklein@chromium.org>
* remove mask load() and store()Gravatar Mike Klein2017-08-11
| | | | | | | | | | | | | | | They appear to be slower than the generic load() and store() now. [blendmode_mask_Hue] 14.7ms @0 15.6ms @95 39.6ms @100 [blendmode_rect_Hue] 31.5ms @0 37.6ms @95 39.5ms @100 ~~> [blendmode_mask_Hue] 14.7ms @0 15.2ms @95 39.5ms @100 [blendmode_rect_Hue] 30.5ms @0 32.6ms @95 37.8ms @100 Change-Id: I674b75087b8139debead71f3016631bcb0cb0047 Reviewed-on: https://skia-review.googlesource.com/33800 Reviewed-by: Florin Malita <fmalita@chromium.org> Commit-Queue: Mike Klein <mtklein@chromium.org>
* double pump 8-bit stagesGravatar Mike Klein2017-08-11
| | | | | | | | | | | | | This basically unrolls all loops, handling twice as many pixels in a stride. We now pass around 4 native registers instead of just 2. I've temporarily disabled AVX2 mask loads and stores. It shouldn't be hard to turn them back on, but I'd want to test on AVX2 hardware first. Change-Id: I0907070f086a0650167456c149a479c1d96b8a2d Reviewed-on: https://skia-review.googlesource.com/33361 Reviewed-by: Florin Malita <fmalita@chromium.org> Commit-Queue: Mike Klein <mtklein@chromium.org>
* Replace interp() with clut_{3,4}D stages.Gravatar Mike Klein2017-08-10
| | | | | | | | | | | | | | | I tried to follow exactly the same strategy as a start. (Though I did fix the off-by-one dimensions.) It does rather look like we only need 3D and 4D now that I've looked at the call sites. Looks like about a 20% speedup. Change-Id: I8b1af64750ad1750716ee1ab0767e64591c7206a Reviewed-on: https://skia-review.googlesource.com/32842 Commit-Queue: Mike Klein <mtklein@google.com> Reviewed-by: Brian Osman <brianosman@google.com>
* add gamma stageGravatar Mike Klein2017-08-09
| | | | | | | | | | | | | | | | Until now we've been using 3 separate parametric stages to apply gamma to r,g,b. That works fine, but is kind of unnecessarily slow, and again less clear in a stack trace than seeing "gamma". The new bench runs in about 60% of the time the old one does on my Trashcan. BUG=skia:6939 Change-Id: I079698d3009b081f1c23a2e27fc26e373b439610 Reviewed-on: https://skia-review.googlesource.com/32721 Reviewed-by: Mike Reed <reed@google.com> Commit-Queue: Mike Klein <mtklein@chromium.org>
* add an invert stage for inverse CMYK -> CMYKGravatar Mike Klein2017-08-08
| | | | | | | | | | | | | | | | | This will be faster, but maybe more importantly it helps make debugging a stack trace clearer. It's confusing to see a "parametric transfer function" stages followed by a table transfer function stages... This leads to a little bit of cleanup in SkColorSpaceXform_A2B. I am uncertain whether we still need parametric_a. I need to do some more tracing through the code before I'd say it's impossible to reach in addTransferFn(). Change-Id: I52e85019f92d012a3086fc94cf64ae6c9307ea94 Reviewed-on: https://skia-review.googlesource.com/32040 Reviewed-by: Brian Osman <brianosman@google.com> Commit-Queue: Mike Klein <mtklein@chromium.org>
* clamp_1 is also a no-op with 8-bit lowpGravatar Mike Klein2017-08-04
| | | | | | | Change-Id: Ifef97d8f28c88c4ee3f7701aac6e383940ed5275 Reviewed-on: https://skia-review.googlesource.com/31020 Reviewed-by: Florin Malita <fmalita@chromium.org> Commit-Queue: Mike Klein <mtklein@chromium.org>
* 15-bit lowp is dead, long live 8-bit lowpGravatar Mike Klein2017-08-04
| | | | | | | Change-Id: Icc4b06094aeba3af99b534746f66286d776ef78a Reviewed-on: https://skia-review.googlesource.com/30920 Reviewed-by: Florin Malita <fmalita@chromium.org> Commit-Queue: Mike Klein <mtklein@chromium.org>
* same 16->8 bit packing trick for SSE2/SSE4.1Gravatar Mike Klein2017-08-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | It's funny how now that I'm on a machine that doesn't support AVX2, it's suddenly important for me that pack() is optimized for SSE! This is basically the same as this morning, without any weird AVX2 pack ordering issues. This replaces something like movdqa 2300(%rip), %xmm0 pshufb %xmm0, %xmm3 pshufb %xmm0, %xmm2 punpcklqdq %xmm3, %xmm2 (This is SSE4.1; the SSE2 version is worse.) with psrlw $8, %xmm3 psrlw $8, %xmm2 packuswb %xmm3, %xmm2 (SSE2 and SSE4.1 both.) It's always nice to not need to load a shuffle mask out of memory. Change-Id: I56fb30b31fcedc0ee84a4a71c483a597c8dc1622 Reviewed-on: https://skia-review.googlesource.com/30583 Reviewed-by: Florin Malita <fmalita@chromium.org> Commit-Queue: Mike Klein <mtklein@chromium.org>
* Store float and byte constant colors.Gravatar Mike Klein2017-08-03
| | | | | | | | | This makes loading them much simpler in 8-bit mode. Change-Id: I35ff34ebd0b93425c4e39e055bf4ade8cf8561e1 Reviewed-on: https://skia-review.googlesource.com/30621 Reviewed-by: Florin Malita <fmalita@chromium.org> Commit-Queue: Mike Klein <mtklein@chromium.org>
* _very_ minor srcover speedupGravatar Mike Klein2017-08-03
| | | | | | | | | | | | | | | | | | | | This is a consistent, very small speedup for srcover. SkRasterPipeline_run Before: 30.4057ns After: 30.1089ns i.e. a 1% speedup on the bench, maybe 3-4% improvment in srcover itself. The only reason I'd send this out now is that this will slightly change some pixels, so it's a good thing to sneak in before rebaselining. It's possible that other blend modes would benefit from the same, but I've only looked at srcover (and I've also changed dstover so that it doesn't look funny). Change-Id: Ic056ca0912d76648d43a78e0052176fd0f7934f1 Reviewed-on: https://skia-review.googlesource.com/30281 Reviewed-by: Florin Malita <fmalita@chromium.org> Commit-Queue: Mike Klein <mtklein@chromium.org>
* improve HSW 16->8 bit packGravatar Mike Klein2017-08-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | __builtin_convertvector(..., U8x4) is producing a fairly long sequence of code to convert U16x4 to U8x4 on HSW: vextracti128 $0x1,%ymm2,%xmm3 vmovdqa 0x1848(%rip),%xmm4 vpshufb %xmm4,%xmm3,%xmm3 vpshufb %xmm4,%xmm2,%xmm2 vpunpcklqdq %xmm3,%xmm2,%xmm2 vextracti128 $0x1,%ymm0,%xmm3 vpshufb %xmm4,%xmm3,%xmm3 vpshufb %xmm4,%xmm0,%xmm0 vpunpcklqdq %xmm3,%xmm0,%xmm0 vinserti128 $0x1,%xmm2,%ymm0,%ymm0 We can do much better with _mm256_packus_epi16: vinserti128 $0x1,%xmm0,%ymm2,%ymm3 vperm2i128 $0x31,%ymm0,%ymm2,%ymm0 vpackuswb %ymm0,%ymm3,%ymm0 vpackuswb packs the values in a somewhat surprising order, which the first two instructions get us lined up for. This is a pretty noticeable speedup, 7-8% on some benchmarks. The same sort of change could be made for SSE2 and SSE4.1 also using _mm_packus_epi16, but the difference for that change is much less dramatic. Might as well stick to focusing on HSW. Change-Id: I0d6765bd67e0d024d658a61d19e6f6826b4d392c Reviewed-on: https://skia-review.googlesource.com/30420 Reviewed-by: Florin Malita <fmalita@chromium.org> Commit-Queue: Mike Klein <mtklein@chromium.org>
* 8-bit hackingGravatar Mike Klein2017-08-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | I think we can replace a lot of legacy code with an SkRasterPipeline backend that works in 8-bit and stays interlaced. Think of this as a "lowerp" replacement for lowp. I'm having some trouble getting ARMv8 working. ARMv7 should be fine, but I want to turn it on separately from x86. I haven't looked at 32-bit x86 yet, but that's also on the todo list. Open questions to follow up on: - is it better to fold every multiply back down to 8-bit (as seen here), or to allow intermediates to accumulate in 16-bit and divide by 255 when done/needed? - is it better pass tightly packed 8-bit vectors between stages (as seen here), or to keep the 8-bit values unpacked in 16-bit lanes? - should we make V wider than 1 register? GMs look good. All diffs invisible and plausibly due to the 15->8 bit precision drop. A quick bench run showed this running in about 0.75x the time of the existing lowp backend. Change-Id: I24aa46ff1d19c0b9b8dc192d5b1821cab0b8843c Reviewed-on: https://skia-review.googlesource.com/29886 Commit-Queue: Mike Klein <mtklein@chromium.org> Reviewed-by: Florin Malita <fmalita@chromium.org>
* clamp to 0 in repeat and mirror image tilersGravatar Mike Klein2017-08-01
| | | | | | | | | | | | | | | | | | | If we were doing this math with real numbers or even just doubles, these clamps wouldn't be necessary. But we're favoring speed over accuracy here when we emulate fmod() and some of those inaccuracies end up with values outside the [0,tile) range, negative! To keep the spirit of fast over 100% accurate, I've just added a safety clamp to 0. The case in the unit test now returns 0 where it should really return something like 7 or 8, but at least we won't try to read _way_ outside the image buffer. BUG=chromium:749260 Change-Id: Ifc5cfe69798beccbb2a16547510158576e06eb3a Reviewed-on: https://skia-review.googlesource.com/29580 Reviewed-by: Florin Malita <fmalita@chromium.org> Commit-Queue: Mike Klein <mtklein@chromium.org>
* use new Stage ABI for ARMv7 tooGravatar Mike Klein2017-07-29
| | | | | | | | | | | | | | | | | | | | | | | | | ARMv7 can pass 16 floats as function arguments. We've been slicing that as 8 2-float vectors. This CL switches to 4 4-float vectors. We'll now operate on 4 pixels at a time instead of 2, at the expense of keeping the d-vectors (mostly used for blending) on the stack. It'll be interesting to see how this plays out performance-wise. One nice side effect is now both ARMv7 and ARMv8 use 4-float NEON vectors. Most of the code is now shared, with just a couple checks to use new instructions added in ARMv8. It looks like we do see a ~15% win: $ bin/droid out/monobench SkRasterPipeline_srgb 200 Before: 644.029ns After: 547.301ns ARMv8: 453.838ns (just for reference) Change-Id: I184ff29a36499e3cdb4c284809d40880b02c2236 Reviewed-on: https://skia-review.googlesource.com/27701 Reviewed-by: Mike Reed <reed@google.com> Commit-Queue: Mike Klein <mtklein@chromium.org>
* rearrange SkJumper registers on 32-bit x86Gravatar Mike Klein2017-07-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are not many registers on 32-bit x86, and we're using most to pass Stage function arguments. This means few are available as temporaries, and we're forced to hit the stack all the time. xmm registers are the most egregious example: we use all 8 registers pass data, leaving none free as temporaries. This CL cuts things down pretty dramatically, from passing 5 general purpose and 8 xmm registers to 2 general purpose and 4 xmm registers. One of the two general purpose registers is a pointer to space on the stack where we store all those other values. Every stage function needs to use the program pointer, so that stays in a general purpose register. Almost every stage uses the r,g,b,a vectors, so they stay in xmm registers. The rest (destination x,y, the tail mask, a pointer to tricky constants, and the dr,dg,db,da vectors) now live on the stack. The generated code is about 20K smaller and runs about 20% faster. $ out/monobench SkRasterPipeline_srgb 200 Before: 358.784ns After: 282.563ns Change-Id: Icc117af95c1a81c41109984b32e0841022f0d1a6 Reviewed-on: https://skia-review.googlesource.com/27620 Reviewed-by: Florin Malita <fmalita@chromium.org> Commit-Queue: Mike Klein <mtklein@chromium.org>