aboutsummaryrefslogtreecommitdiffhomepage
path: root/src/jumper
Commit message (Collapse)AuthorAge
...
* Remove SK_SUPPORT_LEGACY_RP_BLENDS-guarded codeGravatar Florin Malita2017-08-24
| | | | | | | | | | The flag is no longer used. Change-Id: I39156ef5683538263c2302f2fe3ba779e55dbc47 Reviewed-on: https://skia-review.googlesource.com/38360 Commit-Queue: Florin Malita <fmalita@chromium.org> Commit-Queue: Mike Klein <mtklein@chromium.org> Reviewed-by: Mike Klein <mtklein@chromium.org>
* merge object files before parsing into assemblyGravatar Mike Klein2017-08-24
| | | | | | | | | | | | | This extra ld pass can merge all our many redundant constants, both within an instruction set and across them. This should save a bunch of code size on x86-64, with no other impact. It cuts 12K off my local build of ok. Change-Id: Ib2bb4adf88564aca45e55ee53dcf6584265c7dbe Reviewed-on: https://skia-review.googlesource.com/37940 Commit-Queue: Mike Klein <mtklein@google.com> Reviewed-by: Florin Malita <fmalita@chromium.org>
* Document missing 8bit blend stagesGravatar Florin Malita2017-08-23
| | | | | | | Change-Id: Id626f954fe45546a015a1bd423f19cca5f8967a9 Reviewed-on: https://skia-review.googlesource.com/37861 Reviewed-by: Mike Klein <mtklein@google.com> Commit-Queue: Mike Klein <mtklein@google.com>
* ColorBurn/ColorDodge stage tweaksGravatar Florin Malita2017-08-23
| | | | | | | | | | | | | | | | | | | Minor speedup. Before: 10212.01 ? blendmode_rect_ColorBurn 8888 9216.78 ? blendmode_rect_ColorDodge 8888 After: 9635.44 ? blendmode_rect_ColorBurn 8888 8820.22 ? blendmode_rect_ColorDodge 8888 Change-Id: I9e8a9aa21e2370de3174c31821fb0676260d2643 Reviewed-on: https://skia-review.googlesource.com/37620 Reviewed-by: Mike Klein <mtklein@chromium.org> Commit-Queue: Florin Malita <fmalita@chromium.org>
* remove disabled mask load and store codeGravatar Mike Klein2017-08-22
| | | | | | | | | | Things ran slower when we attempted to turn it on, and we've already removed the analog in SkJumper_stages.cpp. Change-Id: I61afa38990bf54d1bff2b1902f09a14df4e17da9 Reviewed-on: https://skia-review.googlesource.com/37080 Reviewed-by: Mike Klein <mtklein@chromium.org> Commit-Queue: Mike Klein <mtklein@chromium.org>
* rename confusing lowp guardGravatar Mike Klein2017-08-15
| | | | | | | Change-Id: I346429015e5f902b0a35663e140bb9a025c4220e Reviewed-on: https://skia-review.googlesource.com/34680 Reviewed-by: Florin Malita <fmalita@chromium.org> Commit-Queue: Mike Klein <mtklein@chromium.org>
* Lowp overlay, hardlight stagesGravatar Florin Malita2017-08-14
| | | | | | | | | | | | | | | | | | | | Before: micros bench 7669.09 ? blendmode_rect_HardLight 8888 8707.13 ? blendmode_rect_Overlay 8888 After: micros bench 6679.60 ? blendmode_rect_HardLight 8888 6789.57 ? blendmode_rect_Overlay 8888 Change-Id: I52f389253fa07dafe18e572af550af7387264a16 Reviewed-on: https://skia-review.googlesource.com/34280 Commit-Queue: Florin Malita <fmalita@chromium.org> Reviewed-by: Mike Klein <mtklein@google.com>
* we never define BLEND_MODEGravatar Mike Klein2017-08-14
| | | | | | | | | Change-Id: I88f3e56971e9844ab2ff74edb0718e6b6e9c6559 Reviewed-on: https://skia-review.googlesource.com/34260 Reviewed-by: Mike Klein <mtklein@chromium.org> Reviewed-by: Florin Malita <fmalita@chromium.org> Commit-Queue: Mike Klein <mtklein@chromium.org> Commit-Queue: Florin Malita <fmalita@chromium.org>
* Simplify difference and exclusion.Gravatar Mike Klein2017-08-14
| | | | | | | | | | | | | | | | | | | | | | | We can fold through some math in these two modes. $ out/ok bench:samples=100 rp filter:search="Difference|Exclusion" serial Before: [blendmode_rect_Exclusion] 4.94ms @0 6.13ms @99 6.25ms @100 [blendmode_mask_Exclusion] 10.9ms @0 12.8ms @99 12.9ms @100 [blendmode_rect_Difference] 5.56ms @0 6.79ms @99 6.8ms @100 [blendmode_mask_Difference] 11.4ms @0 13.8ms @99 14.1ms @100 After: [blendmode_rect_Exclusion] 3.5ms @0 4.12ms @99 4.59ms @100 [blendmode_mask_Exclusion] 9.27ms @0 11.2ms @99 11.6ms @100 [blendmode_rect_Difference] 5.37ms @0 6.58ms @99 6.6ms @100 [blendmode_mask_Difference] 11ms @0 12.1ms @99 12.6ms @100 Change-Id: I03f32368244d4f979cfee83723fd78dfbc7d5fc1 Reviewed-on: https://skia-review.googlesource.com/33980 Commit-Queue: Florin Malita <fmalita@chromium.org> Reviewed-by: Florin Malita <fmalita@chromium.org>
* lowp: lighten, difference, exclusionGravatar Florin Malita2017-08-14
| | | | | | | Change-Id: I5773cf831c7e41a932bee1f2c6830085fb7db025 Reviewed-on: https://skia-review.googlesource.com/33764 Commit-Queue: Florin Malita <fmalita@chromium.org> Reviewed-by: Mike Klein <mtklein@google.com>
* Guard lowp changesGravatar Florin Malita2017-08-11
| | | | | | | | | | Chromium uses the lowp code, we have to stage the changes. TBR= Change-Id: I45e97a51eca285c9afc71926bbf736a03d0d146c Reviewed-on: https://skia-review.googlesource.com/33765 Reviewed-by: Florin Malita <fmalita@chromium.org> Commit-Queue: Florin Malita <fmalita@chromium.org>
* Lowp darken stageGravatar Florin Malita2017-08-11
| | | | | | | Change-Id: I4bf618ad8728541fcef3fc1c6aa5b3ca106d50dc Reviewed-on: https://skia-review.googlesource.com/33583 Commit-Queue: Florin Malita <fmalita@chromium.org> Reviewed-by: Mike Klein <mtklein@chromium.org>
* remove mask load() and store()Gravatar Mike Klein2017-08-11
| | | | | | | | | | | | | | | They appear to be slower than the generic load() and store() now. [blendmode_mask_Hue] 14.7ms @0 15.6ms @95 39.6ms @100 [blendmode_rect_Hue] 31.5ms @0 37.6ms @95 39.5ms @100 ~~> [blendmode_mask_Hue] 14.7ms @0 15.2ms @95 39.5ms @100 [blendmode_rect_Hue] 30.5ms @0 32.6ms @95 37.8ms @100 Change-Id: I674b75087b8139debead71f3016631bcb0cb0047 Reviewed-on: https://skia-review.googlesource.com/33800 Reviewed-by: Florin Malita <fmalita@chromium.org> Commit-Queue: Mike Klein <mtklein@chromium.org>
* double pump 8-bit stagesGravatar Mike Klein2017-08-11
| | | | | | | | | | | | | This basically unrolls all loops, handling twice as many pixels in a stride. We now pass around 4 native registers instead of just 2. I've temporarily disabled AVX2 mask loads and stores. It shouldn't be hard to turn them back on, but I'd want to test on AVX2 hardware first. Change-Id: I0907070f086a0650167456c149a479c1d96b8a2d Reviewed-on: https://skia-review.googlesource.com/33361 Reviewed-by: Florin Malita <fmalita@chromium.org> Commit-Queue: Mike Klein <mtklein@chromium.org>
* Replace interp() with clut_{3,4}D stages.Gravatar Mike Klein2017-08-10
| | | | | | | | | | | | | | | I tried to follow exactly the same strategy as a start. (Though I did fix the off-by-one dimensions.) It does rather look like we only need 3D and 4D now that I've looked at the call sites. Looks like about a 20% speedup. Change-Id: I8b1af64750ad1750716ee1ab0767e64591c7206a Reviewed-on: https://skia-review.googlesource.com/32842 Commit-Queue: Mike Klein <mtklein@google.com> Reviewed-by: Brian Osman <brianosman@google.com>
* add gamma stageGravatar Mike Klein2017-08-09
| | | | | | | | | | | | | | | | Until now we've been using 3 separate parametric stages to apply gamma to r,g,b. That works fine, but is kind of unnecessarily slow, and again less clear in a stack trace than seeing "gamma". The new bench runs in about 60% of the time the old one does on my Trashcan. BUG=skia:6939 Change-Id: I079698d3009b081f1c23a2e27fc26e373b439610 Reviewed-on: https://skia-review.googlesource.com/32721 Reviewed-by: Mike Reed <reed@google.com> Commit-Queue: Mike Klein <mtklein@chromium.org>
* add an invert stage for inverse CMYK -> CMYKGravatar Mike Klein2017-08-08
| | | | | | | | | | | | | | | | | This will be faster, but maybe more importantly it helps make debugging a stack trace clearer. It's confusing to see a "parametric transfer function" stages followed by a table transfer function stages... This leads to a little bit of cleanup in SkColorSpaceXform_A2B. I am uncertain whether we still need parametric_a. I need to do some more tracing through the code before I'd say it's impossible to reach in addTransferFn(). Change-Id: I52e85019f92d012a3086fc94cf64ae6c9307ea94 Reviewed-on: https://skia-review.googlesource.com/32040 Reviewed-by: Brian Osman <brianosman@google.com> Commit-Queue: Mike Klein <mtklein@chromium.org>
* clamp_1 is also a no-op with 8-bit lowpGravatar Mike Klein2017-08-04
| | | | | | | Change-Id: Ifef97d8f28c88c4ee3f7701aac6e383940ed5275 Reviewed-on: https://skia-review.googlesource.com/31020 Reviewed-by: Florin Malita <fmalita@chromium.org> Commit-Queue: Mike Klein <mtklein@chromium.org>
* 15-bit lowp is dead, long live 8-bit lowpGravatar Mike Klein2017-08-04
| | | | | | | Change-Id: Icc4b06094aeba3af99b534746f66286d776ef78a Reviewed-on: https://skia-review.googlesource.com/30920 Reviewed-by: Florin Malita <fmalita@chromium.org> Commit-Queue: Mike Klein <mtklein@chromium.org>
* same 16->8 bit packing trick for SSE2/SSE4.1Gravatar Mike Klein2017-08-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | It's funny how now that I'm on a machine that doesn't support AVX2, it's suddenly important for me that pack() is optimized for SSE! This is basically the same as this morning, without any weird AVX2 pack ordering issues. This replaces something like movdqa 2300(%rip), %xmm0 pshufb %xmm0, %xmm3 pshufb %xmm0, %xmm2 punpcklqdq %xmm3, %xmm2 (This is SSE4.1; the SSE2 version is worse.) with psrlw $8, %xmm3 psrlw $8, %xmm2 packuswb %xmm3, %xmm2 (SSE2 and SSE4.1 both.) It's always nice to not need to load a shuffle mask out of memory. Change-Id: I56fb30b31fcedc0ee84a4a71c483a597c8dc1622 Reviewed-on: https://skia-review.googlesource.com/30583 Reviewed-by: Florin Malita <fmalita@chromium.org> Commit-Queue: Mike Klein <mtklein@chromium.org>
* Store float and byte constant colors.Gravatar Mike Klein2017-08-03
| | | | | | | | | This makes loading them much simpler in 8-bit mode. Change-Id: I35ff34ebd0b93425c4e39e055bf4ade8cf8561e1 Reviewed-on: https://skia-review.googlesource.com/30621 Reviewed-by: Florin Malita <fmalita@chromium.org> Commit-Queue: Mike Klein <mtklein@chromium.org>
* _very_ minor srcover speedupGravatar Mike Klein2017-08-03
| | | | | | | | | | | | | | | | | | | | This is a consistent, very small speedup for srcover. SkRasterPipeline_run Before: 30.4057ns After: 30.1089ns i.e. a 1% speedup on the bench, maybe 3-4% improvment in srcover itself. The only reason I'd send this out now is that this will slightly change some pixels, so it's a good thing to sneak in before rebaselining. It's possible that other blend modes would benefit from the same, but I've only looked at srcover (and I've also changed dstover so that it doesn't look funny). Change-Id: Ic056ca0912d76648d43a78e0052176fd0f7934f1 Reviewed-on: https://skia-review.googlesource.com/30281 Reviewed-by: Florin Malita <fmalita@chromium.org> Commit-Queue: Mike Klein <mtklein@chromium.org>
* improve HSW 16->8 bit packGravatar Mike Klein2017-08-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | __builtin_convertvector(..., U8x4) is producing a fairly long sequence of code to convert U16x4 to U8x4 on HSW: vextracti128 $0x1,%ymm2,%xmm3 vmovdqa 0x1848(%rip),%xmm4 vpshufb %xmm4,%xmm3,%xmm3 vpshufb %xmm4,%xmm2,%xmm2 vpunpcklqdq %xmm3,%xmm2,%xmm2 vextracti128 $0x1,%ymm0,%xmm3 vpshufb %xmm4,%xmm3,%xmm3 vpshufb %xmm4,%xmm0,%xmm0 vpunpcklqdq %xmm3,%xmm0,%xmm0 vinserti128 $0x1,%xmm2,%ymm0,%ymm0 We can do much better with _mm256_packus_epi16: vinserti128 $0x1,%xmm0,%ymm2,%ymm3 vperm2i128 $0x31,%ymm0,%ymm2,%ymm0 vpackuswb %ymm0,%ymm3,%ymm0 vpackuswb packs the values in a somewhat surprising order, which the first two instructions get us lined up for. This is a pretty noticeable speedup, 7-8% on some benchmarks. The same sort of change could be made for SSE2 and SSE4.1 also using _mm_packus_epi16, but the difference for that change is much less dramatic. Might as well stick to focusing on HSW. Change-Id: I0d6765bd67e0d024d658a61d19e6f6826b4d392c Reviewed-on: https://skia-review.googlesource.com/30420 Reviewed-by: Florin Malita <fmalita@chromium.org> Commit-Queue: Mike Klein <mtklein@chromium.org>
* 8-bit hackingGravatar Mike Klein2017-08-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | I think we can replace a lot of legacy code with an SkRasterPipeline backend that works in 8-bit and stays interlaced. Think of this as a "lowerp" replacement for lowp. I'm having some trouble getting ARMv8 working. ARMv7 should be fine, but I want to turn it on separately from x86. I haven't looked at 32-bit x86 yet, but that's also on the todo list. Open questions to follow up on: - is it better to fold every multiply back down to 8-bit (as seen here), or to allow intermediates to accumulate in 16-bit and divide by 255 when done/needed? - is it better pass tightly packed 8-bit vectors between stages (as seen here), or to keep the 8-bit values unpacked in 16-bit lanes? - should we make V wider than 1 register? GMs look good. All diffs invisible and plausibly due to the 15->8 bit precision drop. A quick bench run showed this running in about 0.75x the time of the existing lowp backend. Change-Id: I24aa46ff1d19c0b9b8dc192d5b1821cab0b8843c Reviewed-on: https://skia-review.googlesource.com/29886 Commit-Queue: Mike Klein <mtklein@chromium.org> Reviewed-by: Florin Malita <fmalita@chromium.org>
* clamp to 0 in repeat and mirror image tilersGravatar Mike Klein2017-08-01
| | | | | | | | | | | | | | | | | | | If we were doing this math with real numbers or even just doubles, these clamps wouldn't be necessary. But we're favoring speed over accuracy here when we emulate fmod() and some of those inaccuracies end up with values outside the [0,tile) range, negative! To keep the spirit of fast over 100% accurate, I've just added a safety clamp to 0. The case in the unit test now returns 0 where it should really return something like 7 or 8, but at least we won't try to read _way_ outside the image buffer. BUG=chromium:749260 Change-Id: Ifc5cfe69798beccbb2a16547510158576e06eb3a Reviewed-on: https://skia-review.googlesource.com/29580 Reviewed-by: Florin Malita <fmalita@chromium.org> Commit-Queue: Mike Klein <mtklein@chromium.org>
* use new Stage ABI for ARMv7 tooGravatar Mike Klein2017-07-29
| | | | | | | | | | | | | | | | | | | | | | | | | ARMv7 can pass 16 floats as function arguments. We've been slicing that as 8 2-float vectors. This CL switches to 4 4-float vectors. We'll now operate on 4 pixels at a time instead of 2, at the expense of keeping the d-vectors (mostly used for blending) on the stack. It'll be interesting to see how this plays out performance-wise. One nice side effect is now both ARMv7 and ARMv8 use 4-float NEON vectors. Most of the code is now shared, with just a couple checks to use new instructions added in ARMv8. It looks like we do see a ~15% win: $ bin/droid out/monobench SkRasterPipeline_srgb 200 Before: 644.029ns After: 547.301ns ARMv8: 453.838ns (just for reference) Change-Id: I184ff29a36499e3cdb4c284809d40880b02c2236 Reviewed-on: https://skia-review.googlesource.com/27701 Reviewed-by: Mike Reed <reed@google.com> Commit-Queue: Mike Klein <mtklein@chromium.org>
* rearrange SkJumper registers on 32-bit x86Gravatar Mike Klein2017-07-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are not many registers on 32-bit x86, and we're using most to pass Stage function arguments. This means few are available as temporaries, and we're forced to hit the stack all the time. xmm registers are the most egregious example: we use all 8 registers pass data, leaving none free as temporaries. This CL cuts things down pretty dramatically, from passing 5 general purpose and 8 xmm registers to 2 general purpose and 4 xmm registers. One of the two general purpose registers is a pointer to space on the stack where we store all those other values. Every stage function needs to use the program pointer, so that stays in a general purpose register. Almost every stage uses the r,g,b,a vectors, so they stay in xmm registers. The rest (destination x,y, the tail mask, a pointer to tricky constants, and the dr,dg,db,da vectors) now live on the stack. The generated code is about 20K smaller and runs about 20% faster. $ out/monobench SkRasterPipeline_srgb 200 Before: 358.784ns After: 282.563ns Change-Id: Icc117af95c1a81c41109984b32e0841022f0d1a6 Reviewed-on: https://skia-review.googlesource.com/27620 Reviewed-by: Florin Malita <fmalita@chromium.org> Commit-Queue: Mike Klein <mtklein@chromium.org>
* clean up SK_SUPPORT_LEGACY_WIN32_JUMPERGravatar Mike Klein2017-07-27
| | | | | | | Change-Id: Icae3c6ce80a0bef097ea1010a4d065cc9d5a4c88 Reviewed-on: https://skia-review.googlesource.com/27560 Reviewed-by: Mike Klein <mtklein@chromium.org> Commit-Queue: Mike Klein <mtklein@chromium.org>
* convert over to 2d-modeGravatar Mike Klein2017-07-20
| | | | | | | | | | | | [√] convert all stages to use SkJumper_MemoryCtx / be 2d-compatible [√] convert compile to 2d also, remove 1d run/compile [√] convert all call sites [√] no diffs Change-Id: I3b806eb8fe0c3ec043359616409f7cd1211a1e43 Reviewed-on: https://skia-review.googlesource.com/24263 Commit-Queue: Mike Klein <mtklein@google.com> Reviewed-by: Florin Malita <fmalita@chromium.org>
* Target arm64-apple-ios for aarch64 stages.Gravatar Mike Klein2017-07-20
| | | | | | | | | | | This should avoid use of x18. BUG=skia:6873 Change-Id: Iffafe0a48784b03942325517a999ad9bb44c1f99 Reviewed-on: https://skia-review.googlesource.com/25180 Reviewed-by: Florin Malita <fmalita@chromium.org> Commit-Queue: Mike Klein <mtklein@chromium.org>
* experimental: draw into unpremulGravatar Mike Reed2017-07-19
| | | | | | | | | | raster-only Bug: skia: Change-Id: I3af19f031083c9cc258f73ba6a2f6020bb15f110 Reviewed-on: https://skia-review.googlesource.com/24400 Commit-Queue: Mike Reed <reed@google.com> Reviewed-by: Mike Klein <mtklein@chromium.org>
* remove gather_i8, unify memory-touching contextsGravatar Mike Klein2017-07-18
| | | | | | | | | | | | | | | gather_i8 is now unused, so we can remove it. That in turn makes the ctable field of SkJumper_GatherCtx unused. After removing ctable, SkJumper_GatherCtx and SkJumper_PtrStride look identical, so I've now fused them into SkJumper_MemoryCtx, which will eventually be used by everything loading from, gathering from, or storing to memory. Change-Id: Ia882d2dbd54c9fcf9a8250a1ce83304389dd284a Reviewed-on: https://skia-review.googlesource.com/24085 Reviewed-by: Mike Reed <reed@google.com> Commit-Queue: Mike Klein <mtklein@chromium.org>
* start on raster pipeline 2d modeGravatar Mike Klein2017-07-18
| | | | | | | | | | - Add run_2d(x,y,w,h) and start_pipeline_2d(). - Add and test a 2d-compatible store_8888_2d stage. Change-Id: Ib9c225d1b8cb40471ae4333df1d06eec4d506f8a Reviewed-on: https://skia-review.googlesource.com/24401 Commit-Queue: Mike Klein <mtklein@chromium.org> Reviewed-by: Florin Malita <fmalita@chromium.org>
* minor fixes to start_pipeline_lowpGravatar Mike Klein2017-07-18
| | | | | | | | | | - in _lowp.cpp, JUMPER is always defined, so no need to check. - the return type of this function has been void for a while. Change-Id: I5271e8dab784f46c7ffa9cfba6eb55b5e399b537 Reviewed-on: https://skia-review.googlesource.com/24326 Reviewed-by: Mike Reed <reed@google.com> Commit-Queue: Mike Klein <mtklein@chromium.org>
* add 32-bit Windows SkJumper backendGravatar Mike Klein2017-07-17
| | | | | | | | | | | | | | | | | | | | The most interesting part of this is getting the call to start_pipeline to work. From there it should be just like the other x86 backend. The 32-bit calling conventions are the same across Linux/Mac and Windows, so that's nice. The tricky bit is that Linux and Mac align the stack to 16 bytes, while Windows only to 4. I think this force_align_arg_pointer attribute on start_pipeline does the trick. This needs a guard for layout tests. CQ_INCLUDE_TRYBOTS=skia.primary:Test-Win2k8-MSVC-GCE-CPU-AVX2-x86-Debug;master.tryserver.blink:win10_blink_rel,win7_blink_rel;master.tryserver.chromium.win:win_chromium_rel_ng Change-Id: Ia74d22e5a4ce5483c9817b8a8f89dd21885bbd14 Reviewed-on: https://skia-review.googlesource.com/20968 Commit-Queue: Mike Klein <mtklein@chromium.org> Reviewed-by: Florin Malita <fmalita@chromium.org> Reviewed-by: Mike Reed <reed@google.com>
* add stages for black and white colorsGravatar Mike Reed2017-07-06
| | | | | | | | | | | | | | histogram of test skps: black: 1/7 white: 2/7 other: 4/7 Bug: skia: Change-Id: I3a092899d31ce87837e66e5c8ea9ec5e0f239361 Reviewed-on: https://skia-review.googlesource.com/21408 Reviewed-by: Herb Derby <herb@google.com> Commit-Queue: Mike Reed <reed@google.com>
* optimize for diff matrix typesGravatar Mike Reed2017-07-05
| | | | | | | | Bug: skia: Change-Id: I671e07c5bbb9e4ced92303c9959143324f7a6bdc Reviewed-on: https://skia-review.googlesource.com/21523 Commit-Queue: Mike Reed <reed@google.com> Reviewed-by: Herb Derby <herb@google.com>
* 2pt conical stage for focal-point-outside caseGravatar Florin Malita2017-06-29
| | | | | | | | | | | | | | | | | A couple of annoyances here: 1) the prev vector_scale stage is not usable for masking, as NaN values can propagate through => switch to actual masking 2) for the outside case, we must select the min root when the gradient is flipped => split into two templated stages (_min, _max) (I'm not convinced that we need to flip the gradient for RP at all; we can investigate later) Change-Id: I0283812d613a53124f2987d1aea1f26e4533655e Reviewed-on: https://skia-review.googlesource.com/21162 Reviewed-by: Mike Klein <mtklein@chromium.org> Commit-Queue: Florin Malita <fmalita@chromium.org>
* 2pt conical stage for focal-pt-on-edge caseGravatar Florin Malita2017-06-28
| | | | | | | | | | | | | | | | | | | When the focal point is on the edge of the end circle, the quadratic equation devolves to linear. Add a stage to handle this case. As a complication, this case can produce "degenerate" values: 1) t == NaN 2) R(t) < 0 For these, we're supposed to draw transparent black - which means overwriting the color from the gradient stage. To support this, build a 0/1 vector mask in the context, and apply it post-gradient-stage. Change-Id: Ice4e3243abfd8c784bb810f6c310aed7a4ac7dc8 Reviewed-on: https://skia-review.googlesource.com/21111 Commit-Queue: Florin Malita <fmalita@chromium.org> Reviewed-by: Mike Klein <mtklein@google.com>
* build regenerating SkJumper stages into GNGravatar Mike Klein2017-06-28
| | | | | | | | | | | | | | | | | I _think_ this makes it so changes to _stages.cpp or _lowp.cpp get noticed, regenerated, and baked into Skia all in the same Ninja invocation. Now you just need to set up the tools we use in GN: skia_jumper_clang = ... skia_jumper_objdump = ... skia_jumper_ccache = ... Change-Id: I09fb54d965644ff6e5825056fb0be2c7cab2ea92 Reviewed-on: https://skia-review.googlesource.com/21140 Reviewed-by: Florin Malita <fmalita@chromium.org> Commit-Queue: Mike Klein <mtklein@chromium.org>
* 2ptconical stageGravatar Florin Malita2017-06-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | Initial impl, for the well-behaved case (focal point inside). MBP numbers - Before: 3365.87 ! gradient_conical_clamp_shallow srgb 3590.88 ! gradient_conical_clamp_shallow_dither srgb 3376.91 ! gradient_conical_clamp_3color srgb 3351.64 ! gradient_conical_clamp_hicolor srgb 3379.35 ! gradient_conical_clamp srgb After: 648.93 ! gradient_conical_clamp_shallow srgb 665.12 ! gradient_conical_clamp_shallow_dither srgb 773.98 ! gradient_conical_clamp_3color srgb 1175.35 ! gradient_conical_clamp_hicolor srgb 619.17 ! gradient_conical_clamp srgb Change-Id: I07b22a758363e1f340a6041bca53bdef74229eb9 Reviewed-on: https://skia-review.googlesource.com/20906 Commit-Queue: Florin Malita <fmalita@chromium.org> Reviewed-by: Mike Klein <mtklein@chromium.org>
* be more explicit about not expecting 32-bit x86 jumper backend on windowsGravatar Mike Klein2017-06-28
| | | | | | | | | | | Looks like Clang/Win is defining __i386__, but we're not linking in stage functions (they don't exist yet for Windows). Change-Id: I78fdd3e1d89020bc6c64bc1cd5dfb3fbca720b2e Reviewed-on: https://skia-review.googlesource.com/21103 Commit-Queue: Mike Klein <mtklein@chromium.org> Commit-Queue: Mike Klein <mtklein@google.com> Reviewed-by: Mike Klein <mtklein@google.com>
* add bgra as 1st class formatGravatar Mike Klein2017-06-27
| | | | | | | | | | | | This is a start to eliminating swap_rb as a stage. I've just hit the main hot spots here. Going to look into the ~dozen other spots to see how they should work next. Change-Id: I26fb46a042facf7bd6fff3b47c9fcee86d7142fd Reviewed-on: https://skia-review.googlesource.com/20982 Commit-Queue: Mike Klein <mtklein@chromium.org> Reviewed-by: Mike Reed <reed@google.com>
* remove unused "swap" stageGravatar Mike Klein2017-06-27
| | | | | | | Change-Id: I25619f010f8ac6441529cfe8dff2d8c42d7400cf Reviewed-on: https://skia-review.googlesource.com/20988 Reviewed-by: Mike Reed <reed@google.com> Commit-Queue: Mike Klein <mtklein@chromium.org>
* specialize loaders for dst registers, to avoid move/swap stagesGravatar Mike Reed2017-06-27
| | | | | | | | Bug: skia: Change-Id: I75d82ef2226c5f116b7de2208c4e914739414b6d Reviewed-on: https://skia-review.googlesource.com/20984 Commit-Queue: Mike Reed <reed@google.com> Reviewed-by: Mike Klein <mtklein@chromium.org>
* try not zeroing registers in start_pipelineGravatar Mike Klein2017-06-27
| | | | | | | | | | | | | | | | | Generally stages take care of state setup themselves, either with seed_shader, constant_color, a load, etc. I think these zeros may be unnecessarily cautious. This can't make anything draw more correctly, but it could make things - draw wrong - draw more slowly - draw more quickly so it's an interesting thing to try and keep an eye on. Change-Id: I7e5ea3cd79e55a65e1dbd214601e147ba3815b87 Reviewed-on: https://skia-review.googlesource.com/20976 Reviewed-by: Mike Reed <reed@google.com> Commit-Queue: Mike Klein <mtklein@chromium.org>
* add dumbest possible 32-bit SkJumper backendGravatar Mike Klein2017-06-27
| | | | | | | | | | | | | | | | Everything uses a ton of stack, nothing tail calls, and for now this is non-Windows only. But, it does run faster than the portable serial code. On my trashcan, running `monobench SkRasterPipeline_compile`: - Normal 64-bit AVX build: 43.6ns - Before this CL, 32-bit: 707.9ns - This CL: 147.5ns Change-Id: I4a8929570ace47193ed8925c58b70bb22d6b1447 Reviewed-on: https://skia-review.googlesource.com/20964 Reviewed-by: Mike Reed <reed@google.com> Reviewed-by: Herb Derby <herb@google.com> Commit-Queue: Mike Klein <mtklein@chromium.org>
* add _hsw lowp backendGravatar Mike Klein2017-06-27
| | | | | | | | | | CQ_INCLUDE_TRYBOTS=skia.primary:Build-Ubuntu-Clang-x86_64-Debug-MSAN Change-Id: Id53279c17589b3434629bb644358ee238af8649f Reviewed-on: https://skia-review.googlesource.com/20269 Commit-Queue: Mike Klein <mtklein@chromium.org> Reviewed-by: Herb Derby <herb@google.com> Reviewed-by: Mike Reed <reed@google.com>
* somewhat less silly tail loads and storesGravatar Mike Klein2017-06-26
| | | | | | | | | | | | | | | | | | | | No reason to keep going one at a time when we know there are generally better ways to handle loading a power-of-two number of low lanes. This strategy scales up too, with quick answers for 8 (one 8 byte load), 12 (one 8 byte, one 4 byte), etc. $ ninja -C out monobench; and out/monobench SkRasterPipeline_compile 300 Before: 46.946ns After: 43.341ns (This happens to be _lowp. Expect similar small speedups elsewhere.) Change-Id: I08f87769ea3c9f06ad13d2b1d5326e542b9b63a8 Reviewed-on: https://skia-review.googlesource.com/20903 Reviewed-by: Herb Derby <herb@google.com> Commit-Queue: Mike Klein <mtklein@chromium.org>
* lean more on the compiler in lowp stagesGravatar Mike Klein2017-06-26
| | | | | | | | | | | | | | This refactors {from,to}_{byte,8888} to lean a bit more on the compiler, and to share code between the two. The algorithm is not exactly the same, but it's comparable, and the results of course are identical. This new algorithm is a lot easier to generalize to AVX2, and parallels the full-precision {from,to}_{byte,8888} functions in _stages.cpp. Change-Id: I31ea90d65967bf4ede2497d1e2197cb0e7648bf8 Reviewed-on: https://skia-review.googlesource.com/20828 Reviewed-by: Herb Derby <herb@google.com> Commit-Queue: Mike Klein <mtklein@chromium.org>