| Commit message (Collapse) | Author | Age |
|
|
|
|
|
|
| |
Change-Id: I70bd64d114a2460534bcb51d356e13d9bc3b8603
Reviewed-on: https://skia-review.googlesource.com/11491
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
| |
Change-Id: I71d85ffe29bc11678ff1e696fa4a2c93d0b4fcbe
Reviewed-on: https://skia-review.googlesource.com/11446
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Its alignment (sometimes 4, sometimes 16) has proven to be error-prone.
This also means we don't really need LazyCtx::load().
I think I only had it there to make sure we were doing unaligned loads
of F4; the better way is to just never declare the data as aligned...
The generated code isn't quite as good, but I can live with it.
Change-Id: I5d57a580ca12c94ca84a5e8b72a66cf8d0c829eb
Reviewed-on: https://skia-review.googlesource.com/11406
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
| |
Nothing too tricky here.
Change-Id: I2a10548efc75a6fd875fcb242790880d9b9a28fd
Reviewed-on: https://skia-review.googlesource.com/11388
Commit-Queue: Mike Klein <mtklein@chromium.org>
Reviewed-by: Matt Sarett <msarett@google.com>
|
|
|
|
|
|
|
| |
Change-Id: I2d58538ab071b217d8dbbf2d802493d9045eabf2
Reviewed-on: https://skia-review.googlesource.com/11384
Reviewed-by: Matt Sarett <msarett@google.com>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Pretty much the same deal as the last CL going the other direction:
split store_f16 into to_half() and store4(). Platforms that had fused
strategies here get a little less optimal, but the code's easier to
follow, maintain, and reuse.
Also adds widen_cast() to encapsulate the fairly common pattern of
expanding one of our logical vector types (e.g. 8-byte U16) up to the
width of the physical vector type (e.g. 16-byte __m128i). This
operation is deeply understood by Clang, and often is a no-op.
I could make bit_cast() do this, but it seems clearer to have two names.
Change-Id: I7ba5bb4746acfcaa6d486379f67e07baee3820b2
Reviewed-on: https://skia-review.googlesource.com/11204
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
load_f16 gets slightly worse codegen for ARMv7, SSE2, SSE4.1, and AVX
from splitting it apart compared to the previous fused versions. But
the stage code becomes much simpler.
I'm happy to make those trades until someone complains.
load4() will be useful on its own to implement a couple other stages.
Everything draws the same. I intend to follow up with more of the
same sort of refactoring, but this was tricky enough a change I want
to do them in small steps.
Change-Id: Ib4aa86a58d000f2d7916937cd4f22dc2bd135a49
Reviewed-on: https://skia-review.googlesource.com/11186
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
| |
Change-Id: I217d7b562f5fa443978044e17469ba757c061209
Reviewed-on: https://skia-review.googlesource.com/10971
Reviewed-by: Mike Klein <mtklein@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
On Linux and Mac there's always a red zone of 128 bytes of stack space
for us to use without touching the stack pointer. We'd been generating
stage code as if that's not there because it's not there on Windows.
We have a separate .S file for Windows anyway, so there's no need to
ignore the red zone when we know it's there.
Change-Id: I81a7841020bb8aad68bf35feac851727ef1d0758
Reviewed-on: https://skia-review.googlesource.com/10965
Commit-Queue: Mike Klein <mtklein@chromium.org>
Reviewed-by: Mike Klein <mtklein@chromium.org>
Reviewed-by: Herb Derby <herb@google.com>
|
|
|
|
|
|
|
|
| |
Change-Id: I39538c90cd4c68691c3956e3f51616b77e4c90d1
Reviewed-on: https://skia-review.googlesource.com/10961
Reviewed-by: Mike Klein <mtklein@chromium.org>
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
| |
Change-Id: I0c48ec80eee8b7c7e9fb980efa8ed1dad5ad9768
Reviewed-on: https://skia-review.googlesource.com/10924
Reviewed-by: Mike Klein <mtklein@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
| |
Change-Id: I17ce08a7ec62ef8ffe8ae567079d669a87ef9a9c
Reviewed-on: https://skia-review.googlesource.com/10921
Reviewed-by: Mike Klein <mtklein@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
| |
Change-Id: I5ff3599448d027fcac43a53e98a801ce672ce5ee
Reviewed-on: https://skia-review.googlesource.com/10861
Commit-Queue: Mike Klein <mtklein@chromium.org>
Reviewed-by: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Chromium Mac bots are getting tripped up by stages being visible.
.hidden and .private_extern are -fvisibilty=hidden for ELF and MACH-O.
CQ_INCLUDE_TRYBOTS=skia.primary:Build-Mac-Clang-arm-Debug-iOS
Change-Id: I8dbb04f514eead4ab480664f2674db4b57611b84
Reviewed-on: https://skia-review.googlesource.com/10622
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
I went with the unified-in-one-.cpp approach mostly to make it easy to
roll out SkJumper. I no longer see any difficultly rolling out the
assembly files, and it's possible the unified .cpp approach just makes
things harder.
Let's see if it's any easier to get Chrome's official build to work with
normal assembly files. It's not going to be a problem to roll out.
This is a partial revert of https://skia-review.googlesource.com/c/9336.
CQ_INCLUDE_TRYBOTS=skia.primary:Test-Win2k8-MSVC-GCE-CPU-AVX2-x86_64-Debug,Test-Mac-Clang-MacMini6.2-CPU-AVX-x86_64-Debug,Test-Ubuntu-Clang-GCE-CPU-AVX2-x86_64-Debug,Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Debug
Change-Id: Idfdbd2d322452b44bc0adaf6dc299cc7649bc51e
Reviewed-on: https://skia-review.googlesource.com/10561
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This technique lets us generate a single source file, use the C++
preprocessor, and avoid the pain of working with assemblers.
By using the section attribute or declspec allocate, we can put these
data arrays into the .text section, making them ordinary code.
This is like the previous solution, except it should actually run.
CQ_INCLUDE_TRYBOTS=skia.primary:Test-Win2k8-MSVC-GCE-CPU-AVX2-x86_64-Debug,Test-Mac-Clang-MacMini6.2-CPU-AVX-x86_64-Debug,Test-Ubuntu-Clang-GCE-CPU-AVX2-x86_64-Debug,Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Debug
Change-Id: Ide7675f6cf32eb4831ff02906acbdc3faaeaa684
Reviewed-on: https://skia-review.googlesource.com/9336
Reviewed-by: Mike Klein <mtklein@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
| |
Change-Id: I4bc6d1a8787c540fd1a29274650d34392e56651c
Reviewed-on: https://skia-review.googlesource.com/9223
Reviewed-by: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
SkRasterPipeline_f16: 63 -> 58 (8888+f16 loads, f16 store)
SkRasterPipeline_srgb: 96 -> 84 (2x 8888 loads, 8888 store)
PS3 has a simpler way to build the mask, in a uint64_t.
Timing is still roughlt the same.
Change-Id: Ie278611dff02281e5a0f3a57185050bbe852bff0
Reviewed-on: https://skia-review.googlesource.com/9165
Commit-Queue: Mike Klein <mtklein@chromium.org>
Reviewed-by: Herb Derby <herb@google.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
This makes stages that don't use a context pointer look a little
cleaner, especially on ARM. No interesting speed difference on x86.
What do you think?
Change-Id: I445472be2aa8a7c3bc8cba443fa477a3628118ba
Reviewed-on: https://skia-review.googlesource.com/9155
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This allows %rip addressing as long as it's not going into a data
section. This lets us use switch tables, avoiding loops and stack.
On HSW,
SkRasterPipeline_f16: 90 -> 63
SkRasterPipeline_srgb: 170 -> 97
Change-Id: I3ca2e4ff819b70beea78be75579f9d80c06979e8
Reviewed-on: https://skia-review.googlesource.com/9146
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We have plenty general purpose registers to spare on x86-64,
so the cheapest thing to do is use one to hold the usual 'tail'.
Speedups on HSW:
SkRasterPipeline_srgb: 292 -> 170
SkRasterPipeline_f16: 122 -> 90
There's plenty more room to improve here, e.g. using mask loads and
stores, but this seems to be enough to get things working reasonably.
BUG=skia:6289
Change-Id: I8c0ed325391822e9f36636500350205e93942111
Reviewed-on: https://skia-review.googlesource.com/9110
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Today we use mad() to get FMAs where possible.
-ffp-contract=fast lets the compiler generate them if it spots an opportunity.
It looks like it's found a mix of FMAs and FMSs.
I will follow up by seeing if we can relax the use of mad().
Quick experiments say no, but less quick experiments may say otherwise.
Change-Id: I5228811cfbf11cccc0d715672a464fd1e1cea3b0
Reviewed-on: https://skia-review.googlesource.com/9136
Reviewed-by: Mike Klein <mtklein@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
| |
Mostly I think this will help me handle the AVX tails better.
But there are some wins here already, particularly in AVX and ARM code.
Change-Id: Ie79b4c2c4ab455277c313f15d360cbf8e4bb7836
Reviewed-on: https://skia-review.googlesource.com/9126
Reviewed-by: Mike Klein <mtklein@chromium.org>
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
| |
Change-Id: I2c63e0996e4689950f8f3b82da0fb07941c26044
Reviewed-on: https://skia-review.googlesource.com/8952
Reviewed-by: Mike Klein <mtklein@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Slight changes to clamp to make it look more like the other two.
Mirror gets a fun new SSE/AVX abs() that requires no constants:
abs(v) = v & (0-v)
Change-Id: Iab4a61e39a7d28b47d9a10e7283df58b5e5a034e
Reviewed-on: https://skia-review.googlesource.com/8950
Reviewed-by: Mike Klein <mtklein@chromium.org>
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
| |
Change-Id: I123caaee0bb8e3967c0a1f2acf1d80bcf0f41758
Reviewed-on: https://skia-review.googlesource.com/8944
Reviewed-by: Mike Klein <mtklein@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
| |
Change-Id: I6057ba3e9243641fecbc6b78f6f83ee3265ad3d4
Reviewed-on: https://skia-review.googlesource.com/8941
Reviewed-by: Mike Klein <mtklein@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
| |
Change-Id: Icbd41e3dde9b39a61ccbe8e7622334ae53e5212a
Reviewed-on: https://skia-review.googlesource.com/8922
Reviewed-by: Mike Klein <mtklein@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
| |
As far as I can tell, this draws identically to the SSE4.1 backend.
Change-Id: Id650db59a84d779b84d45f42e60321732e28d803
Reviewed-on: https://skia-review.googlesource.com/8913
Reviewed-by: Mike Klein <mtklein@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Decimal byte encoding makes more horizontal space for comments,
which are the only thing you really want to read.
No code change here.
Change-Id: I674d78c898976063b0d89b747af41c62dc294303
Reviewed-on: https://skia-review.googlesource.com/8899
Reviewed-by: Mike Klein <mtklein@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
AVX is a nice little halfway point between SSE4.1 and HSW, in terms
of instructions available, performance, and availability.
Intel chips have had AVX since ~2011, compared to ~2013 for HSW and
~2007 for SSE4.1. Like HSW it's got 8-wide 256-bit float vectors,
but integer (and double) operations are essentially still only 128-bit.
It also doesn't have F16 conversion or FMA instructions.
It doesn't look like this is going to be a burden to maintain, and only
adds a few KB of code size. In exchange, we now run 8x wide on 45% to
70% of x86 machines, depending on the OS.
In my brief testing, speed eerily resembles exact geometric progression:
SSE4.1: 1x speed (baseline)
AVX: ~sqrt(2)x speed
HSW: ~2x speed
This adds all the basic plumbing for AVX but leaves it disabled.
I'll flip it on once I've implemented the f16 TODOs.
Change-Id: I1c378dabb8a06386646371bf78ade9e9432b006f
Reviewed-on: https://skia-review.googlesource.com/8898
Reviewed-by: Mike Klein <mtklein@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
| |
swap_rb is a big limiting factor on Windows and Linux.
set_rgb just happened to be nearby and easy.
Change-Id: Ic529c7578eeb278476821090127fa8fb1f70c04f
Reviewed-on: https://skia-review.googlesource.com/8859
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
| |
Going to start filling these in in biggest-bang-for-the-buck order.
lerp_u8 (i.e. text drawing) is number 1 right now.
Change-Id: If58eaf8ddbb93a6b954c3700fa1a476dca94a809
Reviewed-on: https://skia-review.googlesource.com/8856
Reviewed-by: Herb Derby <herb@google.com>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
| |
Change-Id: I723ae1ecaebf43e84bf47163e44e7899faf31c8a
Reviewed-on: https://skia-review.googlesource.com/8824
Reviewed-by: Mike Klein <mtklein@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This should be a big win on Windows, but I haven't timed there yet.
On my Mac, it's a solid 2% speedup.
PS1 was insufficiently ambitious, but was this for posterity:
No need to vzeroupper twice on Windows.
On Windows start_pipeline() will vzeroupper,
so no need to do it in just_return().
Change-Id: I099320b95da85900a60ce96fdb7a216a36db1858
Reviewed-on: https://skia-review.googlesource.com/8821
Commit-Queue: Mike Klein <mtklein@chromium.org>
Reviewed-by: Herb Derby <herb@google.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- Compile stages with -DWIN to pick up MS-specific start_pipeline().
- Add SkJumper_generated_win.S with MS-specific assembly.
- Add a minimal asm tool to our GN Windows toolchain.
The SkRasterPipeline_f16 benchmark run ~4x faster on my desktop.
Change-Id: Ia45afb4ecb6a055e2c0e43f0f54f59e081c23b7f
Reviewed-on: https://skia-review.googlesource.com/8778
Reviewed-by: Mike Klein <mtklein@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
|
|
Change-Id: Ie356b062372af3516a437d27bafa20d98e28edd6
Reviewed-on: https://skia-review.googlesource.com/8678
Commit-Queue: Mike Klein <mtklein@chromium.org>
Reviewed-by: Mike Klein <mtklein@chromium.org>
|