| Commit message (Collapse) | Author | Age |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Chrome generally uses BGRA buffers, so srcover_rgba_8888 isn't really
doing them any good. Probably a good idea to cover both kN32 options
any time we specialize like this?
There's one small diff, so I've lazily guarded this by
SK_LEGACY_LOWP_STAGES, which I want to rebaseline today anyway.
Change-Id: Ice672aa01a3fc83be0798580d6730a54df075478
Reviewed-on: https://skia-review.googlesource.com/63301
Commit-Queue: Mike Klein <mtklein@chromium.org>
Reviewed-by: Mike Reed <reed@google.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This fills out a couple more matrix and gather stages.
Deletes a not particularly important unit test that was using a
scale matrix in a weird, non-lowp compatible way.
This will require guards for Blink layout tests.
Change-Id: I54cb228ff541f771e8f4758f07d26c5161d48af3
Reviewed-on: https://skia-review.googlesource.com/62520
Reviewed-by: Mike Reed <reed@google.com>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is a no-op refactor.
It's just always surprised me that the matrix_scale_translate
stage expects [tx ty sx sy], when scales precede the translates
in the names and in both normal row-major and column-major matrix
layouts.
This switches to [sx sy tx ty], scale then translate.
Change-Id: I2d88701121ae8013facd5a28bb0ff520211db5a6
Reviewed-on: https://skia-review.googlesource.com/62541
Reviewed-by: Mike Reed <reed@google.com>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We're going to want to assign types to the stages depending on their
inputs and outputs:
GG: x,y -> x,y
GP: x,y -> r,g,b,a
PP: r,g,b,a -> r,g,b,a
(There are a couple other degenerate cases here, where a stage ignores
its inputs or creates no outputs, but we can always just pretend their
null input or output is one type or the other arbitrarily.)
The GG stages will be pretty much entirely float code, and the GP stages
a mix of float math and byte stuff.
Since we've chosen U16 to match our register size in _lowp land,
we'll unpack each F register across two of those for transport between
stages. This is a notional, free operation in both directions.
Change-Id: I605311d0dc327a1a3a9d688173d9498c1658e715
Reviewed-on: https://skia-review.googlesource.com/60800
Reviewed-by: Herb Derby <herb@google.com>
Reviewed-by: Florin Malita <fmalita@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
| |
As this array grows longer it causes troublesome code generation
when we're compiling offline, but it's easy as an argument.
Change-Id: I53526443f534f29d3bff17c3aec24a9e916c9b86
Reviewed-on: https://skia-review.googlesource.com/60564
Commit-Queue: Mike Klein <mtklein@chromium.org>
Reviewed-by: Herb Derby <herb@google.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It's properly 16 today because of HSW/lowp stages handling 16 pixels at
a time, but it hasn't yet had an effect on lowp so we didn't notice.
As we add lowp shader stages this will start to matter,
so might as well bump it up to 16 now.
(One day _skx lowp stages could bump this up to 32.)
Change-Id: Idd8185c08e12dc657389a35bf659662c9670f98a
Reviewed-on: https://skia-review.googlesource.com/60565
Reviewed-by: Mike Klein <mtklein@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There are non-zero values of a that make infinite 1.0f/a.
Let's just check for the real thing we care about, that
scale is finite.
Bug: skia:7123
Change-Id: If97574c9f3f2f0b73c749d0bea9aa19e6114f4d1
Reviewed-on: https://skia-review.googlesource.com/58460
Reviewed-by: Mike Reed <reed@google.com>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Today gradient mirror and repeat don't explicitly clamp. They work fine for
normal float values, but blow up with inputs like infinity and NaN, and
those aren't hard to construct with a combination of a funky matrix and
some squaring for xy -> radius.
So explicitly clamp in each of the three matrix tilers.
This should fix the fuzz at the associated bug.
Bug: skia:7093
Change-Id: Idd44e3c7a1ed95e2b1ace8eb953b62eddeb4e00e
Reviewed-on: https://skia-review.googlesource.com/55702
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is something I came up with while writing _lowp.cpp.
This should all be a logical no-op, but there are some code generation
changes. I'm not exactly sure why.
Change-Id: Iaad36b5298b37fe26ebd375a147a48852f98e1e4
Reviewed-on: https://skia-review.googlesource.com/52003
Commit-Queue: Mike Klein <mtklein@chromium.org>
Reviewed-by: Herb Derby <herb@google.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The lowp start_pipeline() always zeros, and with floats we always zero
when compiled as part of Skia, so this just makes the offline float
consistent with the others.
It's getting confusing to think about which code zeros and which
doesn't, and it'd be nicer to be able to rely on zeros.
This should change code generation only to the start_pipelines in
the .S files.
Change-Id: I1178b83c01e609e40dc7912d8d56df8e36eb339d
Reviewed-on: https://skia-review.googlesource.com/52001
Reviewed-by: Florin Malita <fmalita@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We look at t to create a mask in mask_2pt_conical_degenerates to be
applied later to the colors after the normal gradient stages have run.
But if t itself is NaN, that will wreak havoc in the normal gradient
stages. So in addition to building the mask to kill off degenerate
colors, let's also set degenerate t to zero, which should be a safe
value.
This fixes the fuzz mentioned in this bug.
BUG=skia:7078
Change-Id: I8301450c707bdbf941abd0339959f9e60d46d955
Reviewed-on: https://skia-review.googlesource.com/52763
Reviewed-by: Florin Malita <fmalita@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
All three image tile modes go through exclusive_clamp() and then a
gather today, so we can move the work of exclusive_clamp() into eac
gather_ stage, eliminating the need for clamp_{x,y} stages.
Luckily, we've got a convenient place to bottleneck this, ptr_and_ix(),
which works out the pointer and vector of indices to load for gathers.
This deletes SkRasterPipeline_repeat_tiling unit test, which now
no longer exactly makes sense. It tests that repeat_x does that
clamp, but that's now done automatically outside that stage.
Change-Id: I24637ef60921bec7aa00082984c0c6a49dd86ca9
Reviewed-on: https://skia-review.googlesource.com/50260
Commit-Queue: Mike Klein <mtklein@chromium.org>
Reviewed-by: Mike Reed <reed@google.com>
Reviewed-by: Florin Malita <fmalita@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
| |
This makes loading into 16-bit channels more natural in _lowp.cpp.
Update a unit test to stop using out-of-range "colors".
Change-Id: I494687aac87948b60a40de447aa1527cf7167b2d
Cq-Include-Trybots: skia.primary:Test-Debian9-Clang-GCE-CPU-AVX2-x86_64-Release-UBSAN_float_cast_overflow
Reviewed-on: https://skia-review.googlesource.com/47580
Reviewed-by: Mike Klein <mtklein@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- load_565 allows 565-src sprite blits
- scale_565 / lerp_565 allow subpixel text
- luminance_to_alpha is a color filter, and lets us write grey 8
And update CachedDecodingPixelRefTest with a yet more robust color.
Change-Id: I8af499c43f0f28093744d9c2993af553e36c9526
Reviewed-on: https://skia-review.googlesource.com/47021
Commit-Queue: Mike Klein <mtklein@chromium.org>
Reviewed-by: Mike Reed <reed@google.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This reverts commit d286bfbd96f8b7ccf1cbce74f07d2f3917dbec30.
Reason for revert:
../../../src/core/SkRasterPipeline.cpp:98:34: runtime error: 4.87906e+09 is outside the range of representable values of type 'unsigned short'
Excellent new bot!
Original change's description:
> Bump stored lowp uniform color to 16-bit storage.
>
> This makes loading into 16-bit channels more natural in _lowp.cpp.
>
> Change-Id: I1ed393873654060ef52f4632d670465528006bbd
> Reviewed-on: https://skia-review.googlesource.com/47261
> Reviewed-by: Mike Reed <reed@google.com>
> Commit-Queue: Mike Klein <mtklein@chromium.org>
TBR=mtklein@chromium.org,reed@google.com
Change-Id: Ia65645c1261a7b31588c4ddaf2b1b3b327d265b0
No-Presubmit: true
No-Tree-Checks: true
No-Try: true
Reviewed-on: https://skia-review.googlesource.com/47540
Reviewed-by: Mike Klein <mtklein@google.com>
Commit-Queue: Mike Klein <mtklein@google.com>
|
|
|
|
|
|
|
|
|
| |
This makes loading into 16-bit channels more natural in _lowp.cpp.
Change-Id: I1ed393873654060ef52f4632d670465528006bbd
Reviewed-on: https://skia-review.googlesource.com/47261
Reviewed-by: Mike Reed <reed@google.com>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
| |
Change-Id: I4d4093fcfc839f6e7468b7d9f89bb903186ab68d
Reviewed-on: https://skia-review.googlesource.com/46761
Reviewed-by: Florin Malita <fmalita@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Guarding loads of 8-15 with defined(__AVX2__) should prevent errors
like these:
external/skia/src/jumper/SkJumper_stages_lowp.cpp:287:46: error:
'memcpy' called with size bigger than buffer
case 12: memcpy(&v, ptr, 12*sizeof(T)); break;
The loads of 8-15 were of course unreachable, given the &(N-1) == &7.
Change-Id: Ifcb5c177c6909e1df55cb564779a4d6610ff7b32
Reviewed-on: https://skia-review.googlesource.com/46521
Reviewed-by: Mike Klein <mtklein@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
I have text_16_AA_FF -> 8888 (forcing RP) faster than head now on my
laptop. I'm feeling confident that we can make this perform well.
After looking at performance a bit more today, it looks like everything
is within what I'd consider comparable in performance, especially on
ARM. On x86-64 it looks like big bulk blits get a little slower and
small mask blits get a little faster.
Quality looks good, and maybe improved for 565.
There are fewer platform-specific differences now in _lowp, and I think
they're few enough now that we could even consider completing the
unification by folding the 8-bit and float code together. Rename
"div255()" to "rebias()", slap on a few coats of paint...
Guarded for Chrome with SK_JUMPER_LEGACY_LOWP.
Change-Id: I36309c07cf736f3cb31952cca66030ad56026318
Reviewed-on: https://skia-review.googlesource.com/45982
Reviewed-by: Herb Derby <herb@google.com>
Reviewed-by: Florin Malita <fmalita@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
To continue building stages, update Clang and update your GN args:
$ brew update
$ brew upgrade llvm
$ find . | grep args.gn | xargs sed -ie 's/clang-4.0/clang-5.0/g'
Some interesting codegen changes I noticed:
- ARMv7: generally better register assignment, tighter code
- ARMv7: dropped the 128-bit alignment hint when loading and storing dst "registers",
unclear why.
- HSW: now clearing the destination register before vgatherdps,
to break a dependency on the previous value
Change-Id: I4f804a4cbfcde530fad5ed535438174e852a9593
Reviewed-on: https://skia-review.googlesource.com/44241
Reviewed-by: Florin Malita <fmalita@chromium.org>
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
| |
Because floats are fun, the compiler cannot merge x + 0.5f +
[0,1,2,3,4...] into x + [0.5,1.5,2.5,3.5,4.5,...]. But we can.
Change-Id: I03b46c1ea0653877f35f6c888f29371b5f73d813
Reviewed-on: https://skia-review.googlesource.com/42480
Reviewed-by: Florin Malita <fmalita@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
| |
approx 2.5x faster on arm64 for sprite 8888 --> 565 blits
Bug: skia:
Change-Id: I524f993fee16196385dc07cbec39ef378b1301e5
Reviewed-on: https://skia-review.googlesource.com/41162
Reviewed-by: Florin Malita <fmalita@chromium.org>
Reviewed-by: Mike Klein <mtklein@chromium.org>
Commit-Queue: Mike Reed <reed@google.com>
|
|
|
|
|
|
|
|
|
|
|
| |
Shouldn't be anything tricky here.
Guarded by SK_JUMPER_LEGACY_X86_8BIT for (Win) layout tests.
Change-Id: I7580c7c18d1721f1301904c049ea2e59e9bda5d9
Reviewed-on: https://skia-review.googlesource.com/40692
Reviewed-by: Herb Derby <herb@google.com>
Reviewed-by: Florin Malita <fmalita@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
I'm not sure why I wrote this to use a Params struct originally, but we
should have plenty of registers in _8bit to pass everything directly and
avoid the stack. Even once we enable the 8-bit pipeline on 32-bit x86,
we'll have 4 general purpose registers and 4 vector registers to use,
precisely what we're using here.
Change-Id: I3e51ab73186edcdcb8bfaa6cc99d9516db7c032a
Reviewed-on: https://skia-review.googlesource.com/40771
Reviewed-by: Florin Malita <fmalita@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The only reason we were keeping SkJumper_constants around is that it was
hard to get float/integer iota vectors on arm64 without relocations.
Now that we're compiling arm64 normally as part of Skia, we don't have
to worry about relocations.
This means we can kill the struct and stop passing around that pointer.
Change-Id: I013c6a735947f3db2bc87f2bfa38b7520d2e2fce
Reviewed-on: https://skia-review.googlesource.com/40200
Commit-Queue: Mike Klein <mtklein@chromium.org>
Reviewed-by: Herb Derby <herb@google.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This reverts commit 6d13575108299951ecdfba6d85c915fcec2bc028.
Now with guards for "errors" like this:
external/skia/src/jumper/SkJumper_stages_8bit.cpp:240:50: error:
'memcpy' called with size bigger than buffer
case 12: memcpy(&v, src, 12*sizeof(T)); break;
This code is unreachable and generally removed by Clang's optimizer
anyway... as far as I can tell the code generation diff is arbitrary.
Change-Id: I6216567caaa6166f71258bd25343a09e93892a10
Reviewed-on: https://skia-review.googlesource.com/39961
Reviewed-by: Mike Klein <mtklein@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
1) Replace a couple commas with semicolons.
2) Make sure to zero a couple vectors.
1) has no effect on code generation.
2) does add a bunch of self-vxorps, but they're cheap and we already do
the equivalent for <AVX SSE code, and they're in not very
performance-critical routines. We could circle back and guard these
with !defined(JUMPER_IS_OFFLINE) if we really need the vectors to start
uninitialized for speed.
CQ_INCLUDE_TRYBOTS=skia.primary:Build-Debian9-Clang-x86_64-Release-Fast
Change-Id: I1a13f3eb28d664dbc345d71c3adbc62be5ff7c45
Reviewed-on: https://skia-review.googlesource.com/39661
Reviewed-by: Mike Reed <reed@google.com>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The most interesting parts of this are how plus interacts with partial
coverage. Plus needs its clamp to happen after the lerp.
Luckily, some of its math folds away:
d' = clamp[ d*(1-c) + (s+d)*c ] ==
clamp[ d - dc + sc + dc ] ==
clamp[ d + sc ]
What's nice there is that coverage can be folded into the src term.
This suggests that we can re-write the plus stage to clamp internally
(and thus, be viable for 8-bit) if we always pre-scale with coverage.
We don't have a way to pre-scale with 565 coverage until now, but
it's only a step or two away from there. We can use the alternate
formulation we derived for alpha for lerp_565, calculating the alpha
coverage from red, green, and blue coverages _and_ the values of src
and dst alpha.
While we already pre-scale srcover today for 8-bit or constant coverage,
we cannot do the same for 565. When evaluating the expression
d' = s + (1-a)d
we need the a term to be pre-scaled with red's coverage when calculating
dr', with blue's when calculating db', etc. Essentially we need to
carry around a bunch of extra values, and we've got no way to do that.
So instead, we'll just carefully pre-scale plus with any coverage, and
keep post-lerping srcover when we have 565 coverage.
Change-Id: I7a7a52eec7d482e1b98bb8a01ea0a3d5e67bef65
Reviewed-on: https://skia-review.googlesource.com/38300
Commit-Queue: Mike Klein <mtklein@chromium.org>
Reviewed-by: Florin Malita <fmalita@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This extra ld pass can merge all our many redundant constants,
both within an instruction set and across them.
This should save a bunch of code size on x86-64, with no other impact.
It cuts 12K off my local build of ok.
Change-Id: Ib2bb4adf88564aca45e55ee53dcf6584265c7dbe
Reviewed-on: https://skia-review.googlesource.com/37940
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Florin Malita <fmalita@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Minor speedup.
Before:
10212.01 ? blendmode_rect_ColorBurn 8888
9216.78 ? blendmode_rect_ColorDodge 8888
After:
9635.44 ? blendmode_rect_ColorBurn 8888
8820.22 ? blendmode_rect_ColorDodge 8888
Change-Id: I9e8a9aa21e2370de3174c31821fb0676260d2643
Reviewed-on: https://skia-review.googlesource.com/37620
Reviewed-by: Mike Klein <mtklein@chromium.org>
Commit-Queue: Florin Malita <fmalita@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Before:
micros bench
7669.09 ? blendmode_rect_HardLight 8888
8707.13 ? blendmode_rect_Overlay 8888
After:
micros bench
6679.60 ? blendmode_rect_HardLight 8888
6789.57 ? blendmode_rect_Overlay 8888
Change-Id: I52f389253fa07dafe18e572af550af7387264a16
Reviewed-on: https://skia-review.googlesource.com/34280
Commit-Queue: Florin Malita <fmalita@chromium.org>
Reviewed-by: Mike Klein <mtklein@google.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We can fold through some math in these two modes.
$ out/ok bench:samples=100 rp filter:search="Difference|Exclusion" serial
Before:
[blendmode_rect_Exclusion] 4.94ms @0 6.13ms @99 6.25ms @100
[blendmode_mask_Exclusion] 10.9ms @0 12.8ms @99 12.9ms @100
[blendmode_rect_Difference] 5.56ms @0 6.79ms @99 6.8ms @100
[blendmode_mask_Difference] 11.4ms @0 13.8ms @99 14.1ms @100
After:
[blendmode_rect_Exclusion] 3.5ms @0 4.12ms @99 4.59ms @100
[blendmode_mask_Exclusion] 9.27ms @0 11.2ms @99 11.6ms @100
[blendmode_rect_Difference] 5.37ms @0 6.58ms @99 6.6ms @100
[blendmode_mask_Difference] 11ms @0 12.1ms @99 12.6ms @100
Change-Id: I03f32368244d4f979cfee83723fd78dfbc7d5fc1
Reviewed-on: https://skia-review.googlesource.com/33980
Commit-Queue: Florin Malita <fmalita@chromium.org>
Reviewed-by: Florin Malita <fmalita@chromium.org>
|
|
|
|
|
|
|
| |
Change-Id: I5773cf831c7e41a932bee1f2c6830085fb7db025
Reviewed-on: https://skia-review.googlesource.com/33764
Commit-Queue: Florin Malita <fmalita@chromium.org>
Reviewed-by: Mike Klein <mtklein@google.com>
|
|
|
|
|
|
|
| |
Change-Id: I4bf618ad8728541fcef3fc1c6aa5b3ca106d50dc
Reviewed-on: https://skia-review.googlesource.com/33583
Commit-Queue: Florin Malita <fmalita@chromium.org>
Reviewed-by: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
They appear to be slower than the generic load() and store() now.
[blendmode_mask_Hue] 14.7ms @0 15.6ms @95 39.6ms @100
[blendmode_rect_Hue] 31.5ms @0 37.6ms @95 39.5ms @100
~~>
[blendmode_mask_Hue] 14.7ms @0 15.2ms @95 39.5ms @100
[blendmode_rect_Hue] 30.5ms @0 32.6ms @95 37.8ms @100
Change-Id: I674b75087b8139debead71f3016631bcb0cb0047
Reviewed-on: https://skia-review.googlesource.com/33800
Reviewed-by: Florin Malita <fmalita@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This basically unrolls all loops, handling twice as many pixels in a
stride. We now pass around 4 native registers instead of just 2.
I've temporarily disabled AVX2 mask loads and stores. It shouldn't be
hard to turn them back on, but I'd want to test on AVX2 hardware first.
Change-Id: I0907070f086a0650167456c149a479c1d96b8a2d
Reviewed-on: https://skia-review.googlesource.com/33361
Reviewed-by: Florin Malita <fmalita@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
I tried to follow exactly the same strategy as a start.
(Though I did fix the off-by-one dimensions.)
It does rather look like we only need 3D and 4D now
that I've looked at the call sites.
Looks like about a 20% speedup.
Change-Id: I8b1af64750ad1750716ee1ab0767e64591c7206a
Reviewed-on: https://skia-review.googlesource.com/32842
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Brian Osman <brianosman@google.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Until now we've been using 3 separate parametric stages to apply
gamma to r,g,b. That works fine, but is kind of unnecessarily
slow, and again less clear in a stack trace than seeing "gamma".
The new bench runs in about 60% of the time the old one does
on my Trashcan.
BUG=skia:6939
Change-Id: I079698d3009b081f1c23a2e27fc26e373b439610
Reviewed-on: https://skia-review.googlesource.com/32721
Reviewed-by: Mike Reed <reed@google.com>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This will be faster, but maybe more importantly it helps make debugging
a stack trace clearer. It's confusing to see a "parametric transfer
function" stages followed by a table transfer function stages...
This leads to a little bit of cleanup in SkColorSpaceXform_A2B.
I am uncertain whether we still need parametric_a. I need to do some
more tracing through the code before I'd say it's impossible to reach in
addTransferFn().
Change-Id: I52e85019f92d012a3086fc94cf64ae6c9307ea94
Reviewed-on: https://skia-review.googlesource.com/32040
Reviewed-by: Brian Osman <brianosman@google.com>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
| |
Change-Id: Icc4b06094aeba3af99b534746f66286d776ef78a
Reviewed-on: https://skia-review.googlesource.com/30920
Reviewed-by: Florin Malita <fmalita@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It's funny how now that I'm on a machine that doesn't support AVX2,
it's suddenly important for me that pack() is optimized for SSE!
This is basically the same as this morning, without any weird AVX2
pack ordering issues. This replaces something like
movdqa 2300(%rip), %xmm0
pshufb %xmm0, %xmm3
pshufb %xmm0, %xmm2
punpcklqdq %xmm3, %xmm2
(This is SSE4.1; the SSE2 version is worse.)
with
psrlw $8, %xmm3
psrlw $8, %xmm2
packuswb %xmm3, %xmm2
(SSE2 and SSE4.1 both.)
It's always nice to not need to load a shuffle mask out of memory.
Change-Id: I56fb30b31fcedc0ee84a4a71c483a597c8dc1622
Reviewed-on: https://skia-review.googlesource.com/30583
Reviewed-by: Florin Malita <fmalita@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
| |
This makes loading them much simpler in 8-bit mode.
Change-Id: I35ff34ebd0b93425c4e39e055bf4ade8cf8561e1
Reviewed-on: https://skia-review.googlesource.com/30621
Reviewed-by: Florin Malita <fmalita@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is a consistent, very small speedup for srcover.
SkRasterPipeline_run
Before: 30.4057ns
After: 30.1089ns
i.e. a 1% speedup on the bench, maybe 3-4% improvment in srcover itself.
The only reason I'd send this out now is that this will slightly change
some pixels, so it's a good thing to sneak in before rebaselining.
It's possible that other blend modes would benefit from the same, but
I've only looked at srcover (and I've also changed dstover so that it
doesn't look funny).
Change-Id: Ic056ca0912d76648d43a78e0052176fd0f7934f1
Reviewed-on: https://skia-review.googlesource.com/30281
Reviewed-by: Florin Malita <fmalita@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
__builtin_convertvector(..., U8x4) is producing a fairly long
sequence of code to convert U16x4 to U8x4 on HSW:
vextracti128 $0x1,%ymm2,%xmm3
vmovdqa 0x1848(%rip),%xmm4
vpshufb %xmm4,%xmm3,%xmm3
vpshufb %xmm4,%xmm2,%xmm2
vpunpcklqdq %xmm3,%xmm2,%xmm2
vextracti128 $0x1,%ymm0,%xmm3
vpshufb %xmm4,%xmm3,%xmm3
vpshufb %xmm4,%xmm0,%xmm0
vpunpcklqdq %xmm3,%xmm0,%xmm0
vinserti128 $0x1,%xmm2,%ymm0,%ymm0
We can do much better with _mm256_packus_epi16:
vinserti128 $0x1,%xmm0,%ymm2,%ymm3
vperm2i128 $0x31,%ymm0,%ymm2,%ymm0
vpackuswb %ymm0,%ymm3,%ymm0
vpackuswb packs the values in a somewhat surprising order,
which the first two instructions get us lined up for.
This is a pretty noticeable speedup, 7-8% on some benchmarks.
The same sort of change could be made for SSE2 and SSE4.1 also
using _mm_packus_epi16, but the difference for that change is
much less dramatic. Might as well stick to focusing on HSW.
Change-Id: I0d6765bd67e0d024d658a61d19e6f6826b4d392c
Reviewed-on: https://skia-review.googlesource.com/30420
Reviewed-by: Florin Malita <fmalita@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
I think we can replace a lot of legacy code with an SkRasterPipeline
backend that works in 8-bit and stays interlaced. Think of this as a
"lowerp" replacement for lowp.
I'm having some trouble getting ARMv8 working.
ARMv7 should be fine, but I want to turn it on separately from x86.
I haven't looked at 32-bit x86 yet, but that's also on the todo list.
Open questions to follow up on:
- is it better to fold every multiply back down to 8-bit
(as seen here), or to allow intermediates to accumulate
in 16-bit and divide by 255 when done/needed?
- is it better pass tightly packed 8-bit vectors between stages (as
seen here), or to keep the 8-bit values unpacked in 16-bit lanes?
- should we make V wider than 1 register?
GMs look good. All diffs invisible and plausibly due to the 15->8 bit
precision drop. A quick bench run showed this running in about 0.75x
the time of the existing lowp backend.
Change-Id: I24aa46ff1d19c0b9b8dc192d5b1821cab0b8843c
Reviewed-on: https://skia-review.googlesource.com/29886
Commit-Queue: Mike Klein <mtklein@chromium.org>
Reviewed-by: Florin Malita <fmalita@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If we were doing this math with real numbers or even just doubles, these
clamps wouldn't be necessary. But we're favoring speed over accuracy
here when we emulate fmod() and some of those inaccuracies end up with
values outside the [0,tile) range, negative!
To keep the spirit of fast over 100% accurate, I've just added a safety
clamp to 0. The case in the unit test now returns 0 where it should
really return something like 7 or 8, but at least we won't try to read
_way_ outside the image buffer.
BUG=chromium:749260
Change-Id: Ifc5cfe69798beccbb2a16547510158576e06eb3a
Reviewed-on: https://skia-review.googlesource.com/29580
Reviewed-by: Florin Malita <fmalita@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There are not many registers on 32-bit x86, and we're using most to pass
Stage function arguments. This means few are available as temporaries,
and we're forced to hit the stack all the time. xmm registers are the
most egregious example: we use all 8 registers pass data, leaving none
free as temporaries.
This CL cuts things down pretty dramatically, from passing 5 general
purpose and 8 xmm registers to 2 general purpose and 4 xmm registers.
One of the two general purpose registers is a pointer to space on the
stack where we store all those other values.
Every stage function needs to use the program pointer, so that stays in
a general purpose register. Almost every stage uses the r,g,b,a
vectors, so they stay in xmm registers. The rest (destination x,y, the
tail mask, a pointer to tricky constants, and the dr,dg,db,da vectors)
now live on the stack.
The generated code is about 20K smaller and runs about 20% faster.
$ out/monobench SkRasterPipeline_srgb 200
Before: 358.784ns
After: 282.563ns
Change-Id: Icc117af95c1a81c41109984b32e0841022f0d1a6
Reviewed-on: https://skia-review.googlesource.com/27620
Reviewed-by: Florin Malita <fmalita@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
| |
[√] convert all stages to use SkJumper_MemoryCtx / be 2d-compatible
[√] convert compile to 2d also, remove 1d run/compile
[√] convert all call sites
[√] no diffs
Change-Id: I3b806eb8fe0c3ec043359616409f7cd1211a1e43
Reviewed-on: https://skia-review.googlesource.com/24263
Commit-Queue: Mike Klein <mtklein@google.com>
Reviewed-by: Florin Malita <fmalita@chromium.org>
|
|
|
|
|
|
|
|
|
|
| |
raster-only
Bug: skia:
Change-Id: I3af19f031083c9cc258f73ba6a2f6020bb15f110
Reviewed-on: https://skia-review.googlesource.com/24400
Commit-Queue: Mike Reed <reed@google.com>
Reviewed-by: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
gather_i8 is now unused, so we can remove it.
That in turn makes the ctable field of SkJumper_GatherCtx unused.
After removing ctable, SkJumper_GatherCtx and SkJumper_PtrStride look
identical, so I've now fused them into SkJumper_MemoryCtx, which will
eventually be used by everything loading from, gathering from, or
storing to memory.
Change-Id: Ia882d2dbd54c9fcf9a8250a1ce83304389dd284a
Reviewed-on: https://skia-review.googlesource.com/24085
Reviewed-by: Mike Reed <reed@google.com>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|