| Commit message (Collapse) | Author | Age |
... | |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This lets them pick up runtime CPU specializations. Here I've plugged in SSE4.1. This is still one of the N prelude CLs to full 8-at-a-time AVX.
I've moved the union of the stages used by SkRasterPipelineBench and SkRasterPipelineBlitter to SkOpts... they'll all be used by the blitter eventually. Picking up SSE4.1 specialization here (even still just 4 pixels at a time) is a significant speedup, especially to store_srgb(), so much that it's no longer really interesting to compare against the fused-but-default-instruction-set version in the bench. So that's gone now.
That left the SkRasterPipeline unit test as the only other user of the EasyFn simplified interface to SkRasterPipeline. So I converted that back down to the bare-metal interface, and EasyFn and its friends became SkRasterPipeline_opts.h exclusive abbreviations (now called Kernel_Sk4f). This isn't really unexpected: SkXfermode also wanted to build up its own little abstractions, and once you build your own abstraction, the value of an additional EasyFn-like layer plummets to negative.
For simplicity I've left the SkXfermode stages alone, except srcover() which was always part of the blitter. No particular reason except keeping the churn down while I hack. These _can_ be in SkOpts, but don't have to be until we go 8-at-a-time.
BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2752
CQ_INCLUDE_TRYBOTS=master.client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
Change-Id: I3b476b18232a1598d8977e425be2150059ab71dc
Reviewed-on: https://skia-review.googlesource.com/2752
Reviewed-by: Mike Klein <mtklein@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is a less generally applicable trick than I have previously hoped. The need to thread through contexts into each stage really means you can only include one context-dependent stage in each fused batch.
We can still manually fuse these, of course, as you can see in SkRasterPipelineBench. It's just that we can't really write a generic compile-time template to do it except for context-free stages. And since we can't write a generic version, and I have only this one specific use case right now, I've kept it quite specific to that use case.
This does work pretty well for this use case, though. Here's the fused clamp-then-store-565:
+0x00 pushq %rbp
+0x01 movq %rsp, %rbp
+0x04 movq 8(%rdi), %rax
+0x08 xorps %xmm4, %xmm4
+0x0b maxps %xmm4, %xmm3
+0x0e maxps %xmm4, %xmm0
+0x11 maxps %xmm4, %xmm1
+0x14 maxps %xmm4, %xmm2
+0x17 minps 4262818(%rip), %xmm3
+0x1e minps %xmm3, %xmm0
+0x21 minps %xmm3, %xmm1
+0x24 minps %xmm3, %xmm2
+0x27 movaps 4965378(%rip), %xmm3
+0x2e mulps %xmm3, %xmm0
+0x31 cvtps2dq %xmm0, %xmm0
+0x35 pslld $11, %xmm0
+0x3a mulps 4965375(%rip), %xmm1
+0x41 cvtps2dq %xmm1, %xmm1
+0x45 pslld $5, %xmm1
+0x4a mulps %xmm3, %xmm2
+0x4d cvtps2dq %xmm2, %xmm2
+0x51 orpd %xmm0, %xmm2
+0x55 orpd %xmm1, %xmm2
+0x59 pshufb 4474510(%rip), %xmm2
+0x62 movq %xmm2, (%rax,%rsi,2)
+0x67 popq %rbp
+0x68 retq
BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2745
Change-Id: Ia7d66aecc6cbff154158d2600d7874feed1a76f6
Reviewed-on: https://skia-review.googlesource.com/2745
Reviewed-by: Mike Reed <reed@google.com>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We used to step at a 4-pixel stride as long as possible, then run up to 3 times, one pixel at a time. Now replace those 1-at-a-time runs with a single tail stamp if there are 1-3 remaining pixels.
This style is simply more efficient: e.g. we'll blend and lerp once for 3 pixels instead of 3 times. This should make short blits significantly more efficient. It's also more future-oriented... AVX+ on Intel and SVE on ARM support masked loads and stores, so we can do the entire tail in one direct step.
This also makes it possible to re-arrange the code a bit to encapsulate each stage better. I think generally this code reads more clearly than the old code, but YMMV. I've arranged things so you write one function, but it's compiled into two specializations, one for tail=0 (Body) and one for tail>0 (Tail). It's pretty tidy.
For now I've just burned a register to pass around tail. It's 2 bits now, maybe soon 3 with AVX, and capped at 4 for even the craziest new toys, so there are plenty of places we can pack it if we want to get clever.
BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2717
Change-Id: I45852a3e5d4c5b5e9315302c46601aee0d32265f
Reviewed-on: https://skia-review.googlesource.com/2717
Reviewed-by: Mike Reed <reed@google.com>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Today if you use the simple SK_RASTER_STAGE interface to build a pipeline, each stage you add calls into a next stage. The last stage you add calls into a special backstop stage JustReturn that, well, just returns, ending the pipeline.
This adds last(), which cuts that last stage off the pipeline. Instead, the stage you add using last() returns directly, ending the pipeline itself without jumping into JustReturn.
This reduces the overhead of using the pipelined version of SkRasterPipelineBench from ~25% to ~20% on my desktop.
Also, add docs.
BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2713
Change-Id: I11469378e2765c6e34db52eb3eef648d6612da3f
Reviewed-on: https://skia-review.googlesource.com/2713
Reviewed-by: Mike Reed <reed@google.com>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
I think we convinced ourselves that denorms, while a good chunk of half floats,
cover a rather small fraction of the representable range, which is always
close enough to zero to flush.
This makes both paths of the conversion to or from float considerably simpler.
These functions now work for zero-or-normal half floats (excluding infinite, NaN).
I'm not aware of a term for this class so I've called them "ordinary".
A handful of GMs and SKPs draw differently in --config f16, but all imperceptibly.
BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2256023002
Review-Url: https://codereview.chromium.org/2256023002
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Most visibly this adds a macro SK_RASTER_STAGE that cuts down on the boilerplate of defining a raster pipeline stage function.
Most interestingly, SK_RASTER_STAGE doesn't define a SkRasterPipeline::Fn, but rather a new type EasyFn. This function is always static and inlined, and the details of interacting with the SkRasterPipeline::Stage are taken care of for you: ctx is just passed as a void*, and st->next() is always called. All EasyFns have to do is take care of the meat of the work: update r,g,b, etc. and read and write from their context.
The really neat new feature here is that you can either add EasyFns to a pipeline with the new append() functions, _or_ call them directly yourself. This lets you use the same set of pieces to build either a pipelined version of the function or a custom, fused version. The bench shows this off.
On my desktop, the pipeline version of the bench takes about 25% more time to run than the fused one.
The old approach to creating stages still works fine. I haven't updated SkXfermode.cpp or SkArithmeticMode.cpp because they seemed just as clear using Fn directly as they would have using EasyFn.
If this looks okay to you I will rework the comments in SkRasterPipeline to explain SK_RASTER_STAGE and EasyFn a bit as I've done here in the CL description.
BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2195853002
Review-Url: https://codereview.chromium.org/2195853002
|
|
|
|
|
|
|
|
|
|
| |
This lets us get at logical >> in a nicely principled way.
BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2197683002
CQ_INCLUDE_TRYBOTS=master.client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
Review-Url: https://codereview.chromium.org/2197683002
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Should feel very similar to Sk4h_store4:
NEON uses its native instruction, SSE unpacks manually.
Since we'll have our F16s in 4 Sk4h by the time we're done here,
this also extracts an Sk4h->Sk4f routine from the old uint64_t->Sk4f one.
BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2184753002
CQ_INCLUDE_TRYBOTS=master.client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
Review-Url: https://codereview.chromium.org/2184753002
|
|
|
|
|
|
|
|
|
| |
3 minor diffs to a couple GMs from fixed transfermodes (arithmetic and plus).
BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2181033002
Review-Url: https://codereview.chromium.org/2181033002
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This clamps to [0,1] premul just before every store to memory.
By making the clamp a stage itself, this design makes it easy to move the clamp
around, to replace it with a debug-only assert-we're-clamped stage for certain
formats, clamp in more places, programatically not clamp, etc. etc.
Before this change, clamping was a little haphazard: store_srgb clamped
R, G and B to [0,1], but not A, and didn't clamp the colors to A. 565
didn't clamp at all.
6 GMs draw subtly differently in sRGB, I think because we've started clamping
colors to alpha to enforce premultiplication better. No changes for 565.
My hope is that now no other stage need ever concern itself with clamping.
So we don't double-clamp, I've added a _noclamp version of sk_linear_to_srgb()
that simply asserts a clamp isn't necessary. This happens to expose the Sk4f
_needs_trunc version that might be useful for power users (*cough* Matt *cough*).
BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2178793002
Review-Url: https://codereview.chromium.org/2178793002
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is an experiment / demo to have our 565 backend fold into
SkRasterPipelineBlitter as it grows more powerful. I plan to follow up with
the same for the other 8888 format.
Blur mask filters look significantly different (better) after this change.
We keep the full 13-14-13 bits of precision for mask blits, where the old code
uses 11-11-10 bit intermediates.
BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2172343002
Review-Url: https://codereview.chromium.org/2172343002
|
|
|
|
|
|
|
|
|
|
|
| |
This is now pixel-exact with the existing sRGB SW impl as far as I've tested.
BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2146413002
CQ_INCLUDE_TRYBOTS=master.client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot,Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Debug-ASAN-Trybot
Committed: https://skia.googlesource.com/skia/+/3011e337693a9786f62d8de9ac4b239ad6dbdaca
Review-Url: https://codereview.chromium.org/2146413002
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
https://codereview.chromium.org/2146413002/ )
Reason for revert:
Leaking the blitter
https://build.chromium.org/p/client.skia/builders/Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Debug-ASAN/builds/7908/steps/test_skia%20on%20Ubuntu/logs/stdio
Original issue's description:
> Add SkRasterPipeline blitter.
>
> This is now pixel-exact with the existing sRGB SW impl as far as I've tested.
>
> BUG=skia:
> GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2146413002
> CQ_INCLUDE_TRYBOTS=master.client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
>
> Committed: https://skia.googlesource.com/skia/+/3011e337693a9786f62d8de9ac4b239ad6dbdaca
TBR=reed@google.com,mtklein@chromium.org
# Skipping CQ checks because original CL landed less than 1 days ago.
NOPRESUBMIT=true
NOTREECHECKS=true
NOTRY=true
BUG=skia:
Review-Url: https://codereview.chromium.org/2178523002
|
|
This is now pixel-exact with the existing sRGB SW impl as far as I've tested.
BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2146413002
CQ_INCLUDE_TRYBOTS=master.client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
Review-Url: https://codereview.chromium.org/2146413002
|