| Commit message (Collapse) | Author | Age |
... | |
|
|
|
|
|
|
|
|
|
| |
The flag is no longer used.
Change-Id: I39156ef5683538263c2302f2fe3ba779e55dbc47
Reviewed-on: https://skia-review.googlesource.com/38360
Commit-Queue: Florin Malita <fmalita@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
Reviewed-by: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This extra ld pass can merge all our many redundant constants,
both within an instruction set and across them.
This should save a bunch of code size on x86-64, with no other impact.
It cuts 12K off my local build of ok.
Change-Id: Ib2bb4adf88564aca45e55ee53dcf6584265c7dbe
Reviewed-on: https://skia-review.googlesource.com/37940
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Florin Malita <fmalita@chromium.org>
|
|
|
|
|
|
|
| |
Change-Id: Id626f954fe45546a015a1bd423f19cca5f8967a9
Reviewed-on: https://skia-review.googlesource.com/37861
Reviewed-by: Mike Klein <mtklein@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Minor speedup.
Before:
10212.01 ? blendmode_rect_ColorBurn 8888
9216.78 ? blendmode_rect_ColorDodge 8888
After:
9635.44 ? blendmode_rect_ColorBurn 8888
8820.22 ? blendmode_rect_ColorDodge 8888
Change-Id: I9e8a9aa21e2370de3174c31821fb0676260d2643
Reviewed-on: https://skia-review.googlesource.com/37620
Reviewed-by: Mike Klein <mtklein@chromium.org>
Commit-Queue: Florin Malita <fmalita@chromium.org>
|
|
|
|
|
|
|
|
|
|
| |
Things ran slower when we attempted to turn it on,
and we've already removed the analog in SkJumper_stages.cpp.
Change-Id: I61afa38990bf54d1bff2b1902f09a14df4e17da9
Reviewed-on: https://skia-review.googlesource.com/37080
Reviewed-by: Mike Klein <mtklein@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
| |
Change-Id: I346429015e5f902b0a35663e140bb9a025c4220e
Reviewed-on: https://skia-review.googlesource.com/34680
Reviewed-by: Florin Malita <fmalita@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Before:
micros bench
7669.09 ? blendmode_rect_HardLight 8888
8707.13 ? blendmode_rect_Overlay 8888
After:
micros bench
6679.60 ? blendmode_rect_HardLight 8888
6789.57 ? blendmode_rect_Overlay 8888
Change-Id: I52f389253fa07dafe18e572af550af7387264a16
Reviewed-on: https://skia-review.googlesource.com/34280
Commit-Queue: Florin Malita <fmalita@chromium.org>
Reviewed-by: Mike Klein <mtklein@google.com>
|
|
|
|
|
|
|
|
|
| |
Change-Id: I88f3e56971e9844ab2ff74edb0718e6b6e9c6559
Reviewed-on: https://skia-review.googlesource.com/34260
Reviewed-by: Mike Klein <mtklein@chromium.org>
Reviewed-by: Florin Malita <fmalita@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
Commit-Queue: Florin Malita <fmalita@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We can fold through some math in these two modes.
$ out/ok bench:samples=100 rp filter:search="Difference|Exclusion" serial
Before:
[blendmode_rect_Exclusion] 4.94ms @0 6.13ms @99 6.25ms @100
[blendmode_mask_Exclusion] 10.9ms @0 12.8ms @99 12.9ms @100
[blendmode_rect_Difference] 5.56ms @0 6.79ms @99 6.8ms @100
[blendmode_mask_Difference] 11.4ms @0 13.8ms @99 14.1ms @100
After:
[blendmode_rect_Exclusion] 3.5ms @0 4.12ms @99 4.59ms @100
[blendmode_mask_Exclusion] 9.27ms @0 11.2ms @99 11.6ms @100
[blendmode_rect_Difference] 5.37ms @0 6.58ms @99 6.6ms @100
[blendmode_mask_Difference] 11ms @0 12.1ms @99 12.6ms @100
Change-Id: I03f32368244d4f979cfee83723fd78dfbc7d5fc1
Reviewed-on: https://skia-review.googlesource.com/33980
Commit-Queue: Florin Malita <fmalita@chromium.org>
Reviewed-by: Florin Malita <fmalita@chromium.org>
|
|
|
|
|
|
|
| |
Change-Id: I5773cf831c7e41a932bee1f2c6830085fb7db025
Reviewed-on: https://skia-review.googlesource.com/33764
Commit-Queue: Florin Malita <fmalita@chromium.org>
Reviewed-by: Mike Klein <mtklein@google.com>
|
|
|
|
|
|
|
|
|
|
| |
Chromium uses the lowp code, we have to stage the changes.
TBR=
Change-Id: I45e97a51eca285c9afc71926bbf736a03d0d146c
Reviewed-on: https://skia-review.googlesource.com/33765
Reviewed-by: Florin Malita <fmalita@chromium.org>
Commit-Queue: Florin Malita <fmalita@chromium.org>
|
|
|
|
|
|
|
| |
Change-Id: I4bf618ad8728541fcef3fc1c6aa5b3ca106d50dc
Reviewed-on: https://skia-review.googlesource.com/33583
Commit-Queue: Florin Malita <fmalita@chromium.org>
Reviewed-by: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
They appear to be slower than the generic load() and store() now.
[blendmode_mask_Hue] 14.7ms @0 15.6ms @95 39.6ms @100
[blendmode_rect_Hue] 31.5ms @0 37.6ms @95 39.5ms @100
~~>
[blendmode_mask_Hue] 14.7ms @0 15.2ms @95 39.5ms @100
[blendmode_rect_Hue] 30.5ms @0 32.6ms @95 37.8ms @100
Change-Id: I674b75087b8139debead71f3016631bcb0cb0047
Reviewed-on: https://skia-review.googlesource.com/33800
Reviewed-by: Florin Malita <fmalita@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This basically unrolls all loops, handling twice as many pixels in a
stride. We now pass around 4 native registers instead of just 2.
I've temporarily disabled AVX2 mask loads and stores. It shouldn't be
hard to turn them back on, but I'd want to test on AVX2 hardware first.
Change-Id: I0907070f086a0650167456c149a479c1d96b8a2d
Reviewed-on: https://skia-review.googlesource.com/33361
Reviewed-by: Florin Malita <fmalita@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
I tried to follow exactly the same strategy as a start.
(Though I did fix the off-by-one dimensions.)
It does rather look like we only need 3D and 4D now
that I've looked at the call sites.
Looks like about a 20% speedup.
Change-Id: I8b1af64750ad1750716ee1ab0767e64591c7206a
Reviewed-on: https://skia-review.googlesource.com/32842
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Brian Osman <brianosman@google.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Until now we've been using 3 separate parametric stages to apply
gamma to r,g,b. That works fine, but is kind of unnecessarily
slow, and again less clear in a stack trace than seeing "gamma".
The new bench runs in about 60% of the time the old one does
on my Trashcan.
BUG=skia:6939
Change-Id: I079698d3009b081f1c23a2e27fc26e373b439610
Reviewed-on: https://skia-review.googlesource.com/32721
Reviewed-by: Mike Reed <reed@google.com>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This will be faster, but maybe more importantly it helps make debugging
a stack trace clearer. It's confusing to see a "parametric transfer
function" stages followed by a table transfer function stages...
This leads to a little bit of cleanup in SkColorSpaceXform_A2B.
I am uncertain whether we still need parametric_a. I need to do some
more tracing through the code before I'd say it's impossible to reach in
addTransferFn().
Change-Id: I52e85019f92d012a3086fc94cf64ae6c9307ea94
Reviewed-on: https://skia-review.googlesource.com/32040
Reviewed-by: Brian Osman <brianosman@google.com>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
| |
Change-Id: Ifef97d8f28c88c4ee3f7701aac6e383940ed5275
Reviewed-on: https://skia-review.googlesource.com/31020
Reviewed-by: Florin Malita <fmalita@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
| |
Change-Id: Icc4b06094aeba3af99b534746f66286d776ef78a
Reviewed-on: https://skia-review.googlesource.com/30920
Reviewed-by: Florin Malita <fmalita@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It's funny how now that I'm on a machine that doesn't support AVX2,
it's suddenly important for me that pack() is optimized for SSE!
This is basically the same as this morning, without any weird AVX2
pack ordering issues. This replaces something like
movdqa 2300(%rip), %xmm0
pshufb %xmm0, %xmm3
pshufb %xmm0, %xmm2
punpcklqdq %xmm3, %xmm2
(This is SSE4.1; the SSE2 version is worse.)
with
psrlw $8, %xmm3
psrlw $8, %xmm2
packuswb %xmm3, %xmm2
(SSE2 and SSE4.1 both.)
It's always nice to not need to load a shuffle mask out of memory.
Change-Id: I56fb30b31fcedc0ee84a4a71c483a597c8dc1622
Reviewed-on: https://skia-review.googlesource.com/30583
Reviewed-by: Florin Malita <fmalita@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
| |
This makes loading them much simpler in 8-bit mode.
Change-Id: I35ff34ebd0b93425c4e39e055bf4ade8cf8561e1
Reviewed-on: https://skia-review.googlesource.com/30621
Reviewed-by: Florin Malita <fmalita@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is a consistent, very small speedup for srcover.
SkRasterPipeline_run
Before: 30.4057ns
After: 30.1089ns
i.e. a 1% speedup on the bench, maybe 3-4% improvment in srcover itself.
The only reason I'd send this out now is that this will slightly change
some pixels, so it's a good thing to sneak in before rebaselining.
It's possible that other blend modes would benefit from the same, but
I've only looked at srcover (and I've also changed dstover so that it
doesn't look funny).
Change-Id: Ic056ca0912d76648d43a78e0052176fd0f7934f1
Reviewed-on: https://skia-review.googlesource.com/30281
Reviewed-by: Florin Malita <fmalita@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
__builtin_convertvector(..., U8x4) is producing a fairly long
sequence of code to convert U16x4 to U8x4 on HSW:
vextracti128 $0x1,%ymm2,%xmm3
vmovdqa 0x1848(%rip),%xmm4
vpshufb %xmm4,%xmm3,%xmm3
vpshufb %xmm4,%xmm2,%xmm2
vpunpcklqdq %xmm3,%xmm2,%xmm2
vextracti128 $0x1,%ymm0,%xmm3
vpshufb %xmm4,%xmm3,%xmm3
vpshufb %xmm4,%xmm0,%xmm0
vpunpcklqdq %xmm3,%xmm0,%xmm0
vinserti128 $0x1,%xmm2,%ymm0,%ymm0
We can do much better with _mm256_packus_epi16:
vinserti128 $0x1,%xmm0,%ymm2,%ymm3
vperm2i128 $0x31,%ymm0,%ymm2,%ymm0
vpackuswb %ymm0,%ymm3,%ymm0
vpackuswb packs the values in a somewhat surprising order,
which the first two instructions get us lined up for.
This is a pretty noticeable speedup, 7-8% on some benchmarks.
The same sort of change could be made for SSE2 and SSE4.1 also
using _mm_packus_epi16, but the difference for that change is
much less dramatic. Might as well stick to focusing on HSW.
Change-Id: I0d6765bd67e0d024d658a61d19e6f6826b4d392c
Reviewed-on: https://skia-review.googlesource.com/30420
Reviewed-by: Florin Malita <fmalita@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
I think we can replace a lot of legacy code with an SkRasterPipeline
backend that works in 8-bit and stays interlaced. Think of this as a
"lowerp" replacement for lowp.
I'm having some trouble getting ARMv8 working.
ARMv7 should be fine, but I want to turn it on separately from x86.
I haven't looked at 32-bit x86 yet, but that's also on the todo list.
Open questions to follow up on:
- is it better to fold every multiply back down to 8-bit
(as seen here), or to allow intermediates to accumulate
in 16-bit and divide by 255 when done/needed?
- is it better pass tightly packed 8-bit vectors between stages (as
seen here), or to keep the 8-bit values unpacked in 16-bit lanes?
- should we make V wider than 1 register?
GMs look good. All diffs invisible and plausibly due to the 15->8 bit
precision drop. A quick bench run showed this running in about 0.75x
the time of the existing lowp backend.
Change-Id: I24aa46ff1d19c0b9b8dc192d5b1821cab0b8843c
Reviewed-on: https://skia-review.googlesource.com/29886
Commit-Queue: Mike Klein <mtklein@chromium.org>
Reviewed-by: Florin Malita <fmalita@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If we were doing this math with real numbers or even just doubles, these
clamps wouldn't be necessary. But we're favoring speed over accuracy
here when we emulate fmod() and some of those inaccuracies end up with
values outside the [0,tile) range, negative!
To keep the spirit of fast over 100% accurate, I've just added a safety
clamp to 0. The case in the unit test now returns 0 where it should
really return something like 7 or 8, but at least we won't try to read
_way_ outside the image buffer.
BUG=chromium:749260
Change-Id: Ifc5cfe69798beccbb2a16547510158576e06eb3a
Reviewed-on: https://skia-review.googlesource.com/29580
Reviewed-by: Florin Malita <fmalita@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
ARMv7 can pass 16 floats as function arguments. We've been slicing that
as 8 2-float vectors. This CL switches to 4 4-float vectors.
We'll now operate on 4 pixels at a time instead of 2, at the expense of
keeping the d-vectors (mostly used for blending) on the stack. It'll
be interesting to see how this plays out performance-wise.
One nice side effect is now both ARMv7 and ARMv8 use 4-float NEON
vectors. Most of the code is now shared, with just a couple checks
to use new instructions added in ARMv8.
It looks like we do see a ~15% win:
$ bin/droid out/monobench SkRasterPipeline_srgb 200
Before: 644.029ns
After: 547.301ns
ARMv8: 453.838ns (just for reference)
Change-Id: I184ff29a36499e3cdb4c284809d40880b02c2236
Reviewed-on: https://skia-review.googlesource.com/27701
Reviewed-by: Mike Reed <reed@google.com>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There are not many registers on 32-bit x86, and we're using most to pass
Stage function arguments. This means few are available as temporaries,
and we're forced to hit the stack all the time. xmm registers are the
most egregious example: we use all 8 registers pass data, leaving none
free as temporaries.
This CL cuts things down pretty dramatically, from passing 5 general
purpose and 8 xmm registers to 2 general purpose and 4 xmm registers.
One of the two general purpose registers is a pointer to space on the
stack where we store all those other values.
Every stage function needs to use the program pointer, so that stays in
a general purpose register. Almost every stage uses the r,g,b,a
vectors, so they stay in xmm registers. The rest (destination x,y, the
tail mask, a pointer to tricky constants, and the dr,dg,db,da vectors)
now live on the stack.
The generated code is about 20K smaller and runs about 20% faster.
$ out/monobench SkRasterPipeline_srgb 200
Before: 358.784ns
After: 282.563ns
Change-Id: Icc117af95c1a81c41109984b32e0841022f0d1a6
Reviewed-on: https://skia-review.googlesource.com/27620
Reviewed-by: Florin Malita <fmalita@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
| |
Change-Id: Icae3c6ce80a0bef097ea1010a4d065cc9d5a4c88
Reviewed-on: https://skia-review.googlesource.com/27560
Reviewed-by: Mike Klein <mtklein@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
| |
[√] convert all stages to use SkJumper_MemoryCtx / be 2d-compatible
[√] convert compile to 2d also, remove 1d run/compile
[√] convert all call sites
[√] no diffs
Change-Id: I3b806eb8fe0c3ec043359616409f7cd1211a1e43
Reviewed-on: https://skia-review.googlesource.com/24263
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Florin Malita <fmalita@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
| |
This should avoid use of x18.
BUG=skia:6873
Change-Id: Iffafe0a48784b03942325517a999ad9bb44c1f99
Reviewed-on: https://skia-review.googlesource.com/25180
Reviewed-by: Florin Malita <fmalita@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
| |
raster-only
Bug: skia:
Change-Id: I3af19f031083c9cc258f73ba6a2f6020bb15f110
Reviewed-on: https://skia-review.googlesource.com/24400
Commit-Queue: Mike Reed <reed@google.com>
Reviewed-by: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
gather_i8 is now unused, so we can remove it.
That in turn makes the ctable field of SkJumper_GatherCtx unused.
After removing ctable, SkJumper_GatherCtx and SkJumper_PtrStride look
identical, so I've now fused them into SkJumper_MemoryCtx, which will
eventually be used by everything loading from, gathering from, or
storing to memory.
Change-Id: Ia882d2dbd54c9fcf9a8250a1ce83304389dd284a
Reviewed-on: https://skia-review.googlesource.com/24085
Reviewed-by: Mike Reed <reed@google.com>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
| |
- Add run_2d(x,y,w,h) and start_pipeline_2d().
- Add and test a 2d-compatible store_8888_2d stage.
Change-Id: Ib9c225d1b8cb40471ae4333df1d06eec4d506f8a
Reviewed-on: https://skia-review.googlesource.com/24401
Commit-Queue: Mike Klein <mtklein@chromium.org>
Reviewed-by: Florin Malita <fmalita@chromium.org>
|
|
|
|
|
|
|
|
|
|
| |
- in _lowp.cpp, JUMPER is always defined, so no need to check.
- the return type of this function has been void for a while.
Change-Id: I5271e8dab784f46c7ffa9cfba6eb55b5e399b537
Reviewed-on: https://skia-review.googlesource.com/24326
Reviewed-by: Mike Reed <reed@google.com>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The most interesting part of this is getting the call to start_pipeline
to work. From there it should be just like the other x86 backend.
The 32-bit calling conventions are the same across Linux/Mac and
Windows, so that's nice. The tricky bit is that Linux and Mac
align the stack to 16 bytes, while Windows only to 4. I think
this force_align_arg_pointer attribute on start_pipeline does the trick.
This needs a guard for layout tests.
CQ_INCLUDE_TRYBOTS=skia.primary:Test-Win2k8-MSVC-GCE-CPU-AVX2-x86-Debug;master.tryserver.blink:win10_blink_rel,win7_blink_rel;master.tryserver.chromium.win:win_chromium_rel_ng
Change-Id: Ia74d22e5a4ce5483c9817b8a8f89dd21885bbd14
Reviewed-on: https://skia-review.googlesource.com/20968
Commit-Queue: Mike Klein <mtklein@chromium.org>
Reviewed-by: Florin Malita <fmalita@chromium.org>
Reviewed-by: Mike Reed <reed@google.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
histogram of test skps:
black: 1/7
white: 2/7
other: 4/7
Bug: skia:
Change-Id: I3a092899d31ce87837e66e5c8ea9ec5e0f239361
Reviewed-on: https://skia-review.googlesource.com/21408
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Reed <reed@google.com>
|
|
|
|
|
|
|
|
| |
Bug: skia:
Change-Id: I671e07c5bbb9e4ced92303c9959143324f7a6bdc
Reviewed-on: https://skia-review.googlesource.com/21523
Commit-Queue: Mike Reed <reed@google.com>
Reviewed-by: Herb Derby <herb@google.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A couple of annoyances here:
1) the prev vector_scale stage is not usable for masking, as NaN values can propagate through
=> switch to actual masking
2) for the outside case, we must select the min root when the gradient is flipped
=> split into two templated stages (_min, _max)
(I'm not convinced that we need to flip the gradient for RP at all; we can investigate later)
Change-Id: I0283812d613a53124f2987d1aea1f26e4533655e
Reviewed-on: https://skia-review.googlesource.com/21162
Reviewed-by: Mike Klein <mtklein@chromium.org>
Commit-Queue: Florin Malita <fmalita@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When the focal point is on the edge of the end circle, the quadratic
equation devolves to linear. Add a stage to handle this case.
As a complication, this case can produce "degenerate" values:
1) t == NaN
2) R(t) < 0
For these, we're supposed to draw transparent black - which means
overwriting the color from the gradient stage. To support this, build
a 0/1 vector mask in the context, and apply it post-gradient-stage.
Change-Id: Ice4e3243abfd8c784bb810f6c310aed7a4ac7dc8
Reviewed-on: https://skia-review.googlesource.com/21111
Commit-Queue: Florin Malita <fmalita@chromium.org>
Reviewed-by: Mike Klein <mtklein@google.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
I _think_ this makes it so changes to _stages.cpp or _lowp.cpp get
noticed, regenerated, and baked into Skia all in the same Ninja
invocation.
Now you just need to set up the tools we use in GN:
skia_jumper_clang = ...
skia_jumper_objdump = ...
skia_jumper_ccache = ...
Change-Id: I09fb54d965644ff6e5825056fb0be2c7cab2ea92
Reviewed-on: https://skia-review.googlesource.com/21140
Reviewed-by: Florin Malita <fmalita@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Initial impl, for the well-behaved case (focal point inside).
MBP numbers -
Before:
3365.87 ! gradient_conical_clamp_shallow srgb
3590.88 ! gradient_conical_clamp_shallow_dither srgb
3376.91 ! gradient_conical_clamp_3color srgb
3351.64 ! gradient_conical_clamp_hicolor srgb
3379.35 ! gradient_conical_clamp srgb
After:
648.93 ! gradient_conical_clamp_shallow srgb
665.12 ! gradient_conical_clamp_shallow_dither srgb
773.98 ! gradient_conical_clamp_3color srgb
1175.35 ! gradient_conical_clamp_hicolor srgb
619.17 ! gradient_conical_clamp srgb
Change-Id: I07b22a758363e1f340a6041bca53bdef74229eb9
Reviewed-on: https://skia-review.googlesource.com/20906
Commit-Queue: Florin Malita <fmalita@chromium.org>
Reviewed-by: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
| |
Looks like Clang/Win is defining __i386__, but we're not linking in
stage functions (they don't exist yet for Windows).
Change-Id: I78fdd3e1d89020bc6c64bc1cd5dfb3fbca720b2e
Reviewed-on: https://skia-review.googlesource.com/21103
Commit-Queue: Mike Klein <mtklein@chromium.org>
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Mike Klein <mtklein@google.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is a start to eliminating swap_rb as a stage.
I've just hit the main hot spots here. Going to look into
the ~dozen other spots to see how they should work next.
Change-Id: I26fb46a042facf7bd6fff3b47c9fcee86d7142fd
Reviewed-on: https://skia-review.googlesource.com/20982
Commit-Queue: Mike Klein <mtklein@chromium.org>
Reviewed-by: Mike Reed <reed@google.com>
|
|
|
|
|
|
|
| |
Change-Id: I25619f010f8ac6441529cfe8dff2d8c42d7400cf
Reviewed-on: https://skia-review.googlesource.com/20988
Reviewed-by: Mike Reed <reed@google.com>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
| |
Bug: skia:
Change-Id: I75d82ef2226c5f116b7de2208c4e914739414b6d
Reviewed-on: https://skia-review.googlesource.com/20984
Commit-Queue: Mike Reed <reed@google.com>
Reviewed-by: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Generally stages take care of state setup themselves, either with
seed_shader, constant_color, a load, etc. I think these zeros may
be unnecessarily cautious.
This can't make anything draw more correctly, but it could make things
- draw wrong
- draw more slowly
- draw more quickly
so it's an interesting thing to try and keep an eye on.
Change-Id: I7e5ea3cd79e55a65e1dbd214601e147ba3815b87
Reviewed-on: https://skia-review.googlesource.com/20976
Reviewed-by: Mike Reed <reed@google.com>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Everything uses a ton of stack, nothing tail calls, and for now this is
non-Windows only. But, it does run faster than the portable serial code.
On my trashcan, running `monobench SkRasterPipeline_compile`:
- Normal 64-bit AVX build: 43.6ns
- Before this CL, 32-bit: 707.9ns
- This CL: 147.5ns
Change-Id: I4a8929570ace47193ed8925c58b70bb22d6b1447
Reviewed-on: https://skia-review.googlesource.com/20964
Reviewed-by: Mike Reed <reed@google.com>
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
| |
CQ_INCLUDE_TRYBOTS=skia.primary:Build-Ubuntu-Clang-x86_64-Debug-MSAN
Change-Id: Id53279c17589b3434629bb644358ee238af8649f
Reviewed-on: https://skia-review.googlesource.com/20269
Commit-Queue: Mike Klein <mtklein@chromium.org>
Reviewed-by: Herb Derby <herb@google.com>
Reviewed-by: Mike Reed <reed@google.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
No reason to keep going one at a time when we know there are generally
better ways to handle loading a power-of-two number of low lanes.
This strategy scales up too, with quick answers for 8 (one 8 byte load),
12 (one 8 byte, one 4 byte), etc.
$ ninja -C out monobench; and out/monobench SkRasterPipeline_compile 300
Before: 46.946ns
After: 43.341ns
(This happens to be _lowp. Expect similar small speedups elsewhere.)
Change-Id: I08f87769ea3c9f06ad13d2b1d5326e542b9b63a8
Reviewed-on: https://skia-review.googlesource.com/20903
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This refactors {from,to}_{byte,8888} to lean a bit more on the compiler,
and to share code between the two. The algorithm is not exactly the
same, but it's comparable, and the results of course are identical.
This new algorithm is a lot easier to generalize to AVX2, and parallels
the full-precision {from,to}_{byte,8888} functions in _stages.cpp.
Change-Id: I31ea90d65967bf4ede2497d1e2197cb0e7648bf8
Reviewed-on: https://skia-review.googlesource.com/20828
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|