aboutsummaryrefslogtreecommitdiffhomepage
path: root/src/jumper/SkJumper_generated.S
Commit message (Collapse)AuthorAge
* jumper, gather_8888Gravatar Mike Klein2017-04-06
| | | | | | | Change-Id: I70bd64d114a2460534bcb51d356e13d9bc3b8603 Reviewed-on: https://skia-review.googlesource.com/11491 Reviewed-by: Herb Derby <herb@google.com> Commit-Queue: Mike Klein <mtklein@chromium.org>
* jumper, add load_f32()Gravatar Mike Klein2017-04-06
| | | | | | | Change-Id: I71d85ffe29bc11678ff1e696fa4a2c93d0b4fcbe Reviewed-on: https://skia-review.googlesource.com/11446 Reviewed-by: Herb Derby <herb@google.com> Commit-Queue: Mike Klein <mtklein@chromium.org>
* jumper, kill off F4Gravatar Mike Klein2017-04-06
| | | | | | | | | | | | | | | | Its alignment (sometimes 4, sometimes 16) has proven to be error-prone. This also means we don't really need LazyCtx::load(). I think I only had it there to make sure we were doing unaligned loads of F4; the better way is to just never declare the data as aligned... The generated code isn't quite as good, but I can live with it. Change-Id: I5d57a580ca12c94ca84a5e8b72a66cf8d0c829eb Reviewed-on: https://skia-review.googlesource.com/11406 Reviewed-by: Herb Derby <herb@google.com> Commit-Queue: Mike Klein <mtklein@chromium.org>
* jumper, to_2dot2 and from_2dot2Gravatar Mike Klein2017-04-05
| | | | | | | | | Nothing too tricky here. Change-Id: I2a10548efc75a6fd875fcb242790880d9b9a28fd Reviewed-on: https://skia-review.googlesource.com/11388 Commit-Queue: Mike Klein <mtklein@chromium.org> Reviewed-by: Matt Sarett <msarett@google.com>
* jumper, load_u16_be and store_u16_beGravatar Mike Klein2017-04-05
| | | | | | | Change-Id: I2d58538ab071b217d8dbbf2d802493d9045eabf2 Reviewed-on: https://skia-review.googlesource.com/11384 Reviewed-by: Matt Sarett <msarett@google.com> Commit-Queue: Mike Klein <mtklein@chromium.org>
* jumper, split store_f16 into to_half, store4Gravatar Mike Klein2017-04-04
| | | | | | | | | | | | | | | | | | | Pretty much the same deal as the last CL going the other direction: split store_f16 into to_half() and store4(). Platforms that had fused strategies here get a little less optimal, but the code's easier to follow, maintain, and reuse. Also adds widen_cast() to encapsulate the fairly common pattern of expanding one of our logical vector types (e.g. 8-byte U16) up to the width of the physical vector type (e.g. 16-byte __m128i). This operation is deeply understood by Clang, and often is a no-op. I could make bit_cast() do this, but it seems clearer to have two names. Change-Id: I7ba5bb4746acfcaa6d486379f67e07baee3820b2 Reviewed-on: https://skia-review.googlesource.com/11204 Reviewed-by: Herb Derby <herb@google.com> Commit-Queue: Mike Klein <mtklein@chromium.org>
* jumper, factor out load4() and from_half()Gravatar Mike Klein2017-04-04
| | | | | | | | | | | | | | | | | | | load_f16 gets slightly worse codegen for ARMv7, SSE2, SSE4.1, and AVX from splitting it apart compared to the previous fused versions. But the stage code becomes much simpler. I'm happy to make those trades until someone complains. load4() will be useful on its own to implement a couple other stages. Everything draws the same. I intend to follow up with more of the same sort of refactoring, but this was tricky enough a change I want to do them in small steps. Change-Id: Ib4aa86a58d000f2d7916937cd4f22dc2bd135a49 Reviewed-on: https://skia-review.googlesource.com/11186 Reviewed-by: Herb Derby <herb@google.com> Commit-Queue: Mike Klein <mtklein@chromium.org>
* jumper, a couple simple loads and storesGravatar Mike Klein2017-03-31
| | | | | | | Change-Id: I217d7b562f5fa443978044e17469ba757c061209 Reviewed-on: https://skia-review.googlesource.com/10971 Reviewed-by: Mike Klein <mtklein@chromium.org> Commit-Queue: Mike Klein <mtklein@chromium.org>
* jumper, only ignore red zone on WindowsGravatar Mike Klein2017-03-31
| | | | | | | | | | | | | | | On Linux and Mac there's always a red zone of 128 bytes of stack space for us to use without touching the stack pointer. We'd been generating stage code as if that's not there because it's not there on Windows. We have a separate .S file for Windows anyway, so there's no need to ignore the red zone when we know it's there. Change-Id: I81a7841020bb8aad68bf35feac851727ef1d0758 Reviewed-on: https://skia-review.googlesource.com/10965 Commit-Queue: Mike Klein <mtklein@chromium.org> Reviewed-by: Mike Klein <mtklein@chromium.org> Reviewed-by: Herb Derby <herb@google.com>
* jumper, caught up on blend modesGravatar Mike Klein2017-03-31
| | | | | | | | Change-Id: I39538c90cd4c68691c3956e3f51616b77e4c90d1 Reviewed-on: https://skia-review.googlesource.com/10961 Reviewed-by: Mike Klein <mtklein@chromium.org> Reviewed-by: Herb Derby <herb@google.com> Commit-Queue: Mike Klein <mtklein@chromium.org>
* jumper, another batch of blend modesGravatar Mike Klein2017-03-31
| | | | | | | Change-Id: I0c48ec80eee8b7c7e9fb980efa8ed1dad5ad9768 Reviewed-on: https://skia-review.googlesource.com/10924 Reviewed-by: Mike Klein <mtklein@chromium.org> Commit-Queue: Mike Klein <mtklein@chromium.org>
* jumper, more blend modesGravatar Mike Klein2017-03-31
| | | | | | | Change-Id: I17ce08a7ec62ef8ffe8ae567079d669a87ef9a9c Reviewed-on: https://skia-review.googlesource.com/10921 Reviewed-by: Mike Klein <mtklein@chromium.org> Commit-Queue: Mike Klein <mtklein@chromium.org>
* tell Google3 we do not need executable stackGravatar Mike Klein2017-03-31
| | | | | | | Change-Id: I5ff3599448d027fcac43a53e98a801ce672ce5ee Reviewed-on: https://skia-review.googlesource.com/10861 Commit-Queue: Mike Klein <mtklein@chromium.org> Reviewed-by: Mike Klein <mtklein@chromium.org>
* Don't export stage symbols.Gravatar Mike Klein2017-03-30
| | | | | | | | | | | | Chromium Mac bots are getting tripped up by stages being visible. .hidden and .private_extern are -fvisibilty=hidden for ELF and MACH-O. CQ_INCLUDE_TRYBOTS=skia.primary:Build-Mac-Clang-arm-Debug-iOS Change-Id: I8dbb04f514eead4ab480664f2674db4b57611b84 Reviewed-on: https://skia-review.googlesource.com/10622 Reviewed-by: Herb Derby <herb@google.com> Commit-Queue: Mike Klein <mtklein@chromium.org>
* jumper, revert to generating .S filesGravatar Mike Klein2017-03-29
| | | | | | | | | | | | | | | | | | | I went with the unified-in-one-.cpp approach mostly to make it easy to roll out SkJumper. I no longer see any difficultly rolling out the assembly files, and it's possible the unified .cpp approach just makes things harder. Let's see if it's any easier to get Chrome's official build to work with normal assembly files. It's not going to be a problem to roll out. This is a partial revert of https://skia-review.googlesource.com/c/9336. CQ_INCLUDE_TRYBOTS=skia.primary:Test-Win2k8-MSVC-GCE-CPU-AVX2-x86_64-Debug,Test-Mac-Clang-MacMini6.2-CPU-AVX-x86_64-Debug,Test-Ubuntu-Clang-GCE-CPU-AVX2-x86_64-Debug,Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Debug Change-Id: Idfdbd2d322452b44bc0adaf6dc299cc7649bc51e Reviewed-on: https://skia-review.googlesource.com/10561 Reviewed-by: Herb Derby <herb@google.com> Commit-Queue: Mike Klein <mtklein@chromium.org>
* Back to code as data arrays, this time in .text.Gravatar Mike Klein2017-03-07
| | | | | | | | | | | | | | | | | This technique lets us generate a single source file, use the C++ preprocessor, and avoid the pain of working with assemblers. By using the section attribute or declspec allocate, we can put these data arrays into the .text section, making them ordinary code. This is like the previous solution, except it should actually run. CQ_INCLUDE_TRYBOTS=skia.primary:Test-Win2k8-MSVC-GCE-CPU-AVX2-x86_64-Debug,Test-Mac-Clang-MacMini6.2-CPU-AVX-x86_64-Debug,Test-Ubuntu-Clang-GCE-CPU-AVX2-x86_64-Debug,Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Debug Change-Id: Ide7675f6cf32eb4831ff02906acbdc3faaeaa684 Reviewed-on: https://skia-review.googlesource.com/9336 Reviewed-by: Mike Klein <mtklein@chromium.org> Commit-Queue: Mike Klein <mtklein@chromium.org>
* SkJumper: store_f32Gravatar Mike Klein2017-03-03
| | | | | | Change-Id: I4bc6d1a8787c540fd1a29274650d34392e56651c Reviewed-on: https://skia-review.googlesource.com/9223 Reviewed-by: Mike Klein <mtklein@chromium.org>
* SkJumper: use AVX2 mask loads and stores for U32Gravatar Mike Klein2017-03-02
| | | | | | | | | | | | | SkRasterPipeline_f16: 63 -> 58 (8888+f16 loads, f16 store) SkRasterPipeline_srgb: 96 -> 84 (2x 8888 loads, 8888 store) PS3 has a simpler way to build the mask, in a uint64_t. Timing is still roughlt the same. Change-Id: Ie278611dff02281e5a0f3a57185050bbe852bff0 Reviewed-on: https://skia-review.googlesource.com/9165 Commit-Queue: Mike Klein <mtklein@chromium.org> Reviewed-by: Herb Derby <herb@google.com>
* SkJumper: skip null contextsGravatar Mike Klein2017-03-02
| | | | | | | | | | | | This makes stages that don't use a context pointer look a little cleaner, especially on ARM. No interesting speed difference on x86. What do you think? Change-Id: I445472be2aa8a7c3bc8cba443fa477a3628118ba Reviewed-on: https://skia-review.googlesource.com/9155 Reviewed-by: Herb Derby <herb@google.com> Commit-Queue: Mike Klein <mtklein@chromium.org>
* SkJumper: be more precise by rejecting data sections.Gravatar Mike Klein2017-03-02
| | | | | | | | | | | | | | This allows %rip addressing as long as it's not going into a data section. This lets us use switch tables, avoiding loops and stack. On HSW, SkRasterPipeline_f16: 90 -> 63 SkRasterPipeline_srgb: 170 -> 97 Change-Id: I3ca2e4ff819b70beea78be75579f9d80c06979e8 Reviewed-on: https://skia-review.googlesource.com/9146 Reviewed-by: Herb Derby <herb@google.com> Commit-Queue: Mike Klein <mtklein@chromium.org>
* SkJumper: handle the <kStride tail in AVX+ mode.Gravatar Mike Klein2017-03-02
| | | | | | | | | | | | | | | | | | | We have plenty general purpose registers to spare on x86-64, so the cheapest thing to do is use one to hold the usual 'tail'. Speedups on HSW: SkRasterPipeline_srgb: 292 -> 170 SkRasterPipeline_f16: 122 -> 90 There's plenty more room to improve here, e.g. using mask loads and stores, but this seems to be enough to get things working reasonably. BUG=skia:6289 Change-Id: I8c0ed325391822e9f36636500350205e93942111 Reviewed-on: https://skia-review.googlesource.com/9110 Reviewed-by: Herb Derby <herb@google.com> Commit-Queue: Mike Klein <mtklein@chromium.org>
* SkJumper: allow the compiler to generate FMAsGravatar Mike Klein2017-03-02
| | | | | | | | | | | | | | | Today we use mad() to get FMAs where possible. -ffp-contract=fast lets the compiler generate them if it spots an opportunity. It looks like it's found a mix of FMAs and FMSs. I will follow up by seeing if we can relax the use of mad(). Quick experiments say no, but less quick experiments may say otherwise. Change-Id: I5228811cfbf11cccc0d715672a464fd1e1cea3b0 Reviewed-on: https://skia-review.googlesource.com/9136 Reviewed-by: Mike Klein <mtklein@chromium.org> Commit-Queue: Mike Klein <mtklein@chromium.org>
* SkJumper: upgrade to Clang 3.9Gravatar Mike Klein2017-03-01
| | | | | | | | | | | Mostly I think this will help me handle the AVX tails better. But there are some wins here already, particularly in AVX and ARM code. Change-Id: Ie79b4c2c4ab455277c313f15d360cbf8e4bb7836 Reviewed-on: https://skia-review.googlesource.com/9126 Reviewed-by: Mike Klein <mtklein@chromium.org> Reviewed-by: Herb Derby <herb@google.com> Commit-Queue: Mike Klein <mtklein@chromium.org>
* SkJumper: perspective matrixGravatar Mike Klein2017-02-24
| | | | | | | Change-Id: I2c63e0996e4689950f8f3b82da0fb07941c26044 Reviewed-on: https://skia-review.googlesource.com/8952 Reviewed-by: Mike Klein <mtklein@chromium.org> Commit-Queue: Mike Klein <mtklein@chromium.org>
* SkJumper: tiling modesGravatar Mike Klein2017-02-24
| | | | | | | | | | | | | | Slight changes to clamp to make it look more like the other two. Mirror gets a fun new SSE/AVX abs() that requires no constants: abs(v) = v & (0-v) Change-Id: Iab4a61e39a7d28b47d9a10e7283df58b5e5a034e Reviewed-on: https://skia-review.googlesource.com/8950 Reviewed-by: Mike Klein <mtklein@chromium.org> Reviewed-by: Herb Derby <herb@google.com> Commit-Queue: Mike Klein <mtklein@chromium.org>
* SkJumper: a8Gravatar Mike Klein2017-02-24
| | | | | | | Change-Id: I123caaee0bb8e3967c0a1f2acf1d80bcf0f41758 Reviewed-on: https://skia-review.googlesource.com/8944 Reviewed-by: Mike Klein <mtklein@chromium.org> Commit-Queue: Mike Klein <mtklein@chromium.org>
* SkJumper: scales and lerpsGravatar Mike Klein2017-02-24
| | | | | | | Change-Id: I6057ba3e9243641fecbc6b78f6f83ee3265ad3d4 Reviewed-on: https://skia-review.googlesource.com/8941 Reviewed-by: Mike Klein <mtklein@chromium.org> Commit-Queue: Mike Klein <mtklein@chromium.org>
* SkJumper: 565Gravatar Mike Klein2017-02-23
| | | | | | | Change-Id: Icbd41e3dde9b39a61ccbe8e7622334ae53e5212a Reviewed-on: https://skia-review.googlesource.com/8922 Reviewed-by: Mike Klein <mtklein@chromium.org> Commit-Queue: Mike Klein <mtklein@chromium.org>
* SkJumper: fill in AVX f16 stages, turn on AVXGravatar Mike Klein2017-02-23
| | | | | | | | | As far as I can tell, this draws identically to the SSE4.1 backend. Change-Id: Id650db59a84d779b84d45f42e60321732e28d803 Reviewed-on: https://skia-review.googlesource.com/8913 Reviewed-by: Mike Klein <mtklein@chromium.org> Commit-Queue: Mike Klein <mtklein@chromium.org>
* SkJumper: reformat .S filesGravatar Mike Klein2017-02-23
| | | | | | | | | | | | Decimal byte encoding makes more horizontal space for comments, which are the only thing you really want to read. No code change here. Change-Id: I674d78c898976063b0d89b747af41c62dc294303 Reviewed-on: https://skia-review.googlesource.com/8899 Reviewed-by: Mike Klein <mtklein@chromium.org> Commit-Queue: Mike Klein <mtklein@chromium.org>
* Add AVX to the SkJumper mix.Gravatar Mike Klein2017-02-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | AVX is a nice little halfway point between SSE4.1 and HSW, in terms of instructions available, performance, and availability. Intel chips have had AVX since ~2011, compared to ~2013 for HSW and ~2007 for SSE4.1. Like HSW it's got 8-wide 256-bit float vectors, but integer (and double) operations are essentially still only 128-bit. It also doesn't have F16 conversion or FMA instructions. It doesn't look like this is going to be a burden to maintain, and only adds a few KB of code size. In exchange, we now run 8x wide on 45% to 70% of x86 machines, depending on the OS. In my brief testing, speed eerily resembles exact geometric progression: SSE4.1: 1x speed (baseline) AVX: ~sqrt(2)x speed HSW: ~2x speed This adds all the basic plumbing for AVX but leaves it disabled. I'll flip it on once I've implemented the f16 TODOs. Change-Id: I1c378dabb8a06386646371bf78ade9e9432b006f Reviewed-on: https://skia-review.googlesource.com/8898 Reviewed-by: Mike Klein <mtklein@chromium.org> Commit-Queue: Mike Klein <mtklein@chromium.org>
* SkJumper: set_rgb and swap_rbGravatar Mike Klein2017-02-22
| | | | | | | | | | swap_rb is a big limiting factor on Windows and Linux. set_rgb just happened to be nearby and easy. Change-Id: Ic529c7578eeb278476821090127fa8fb1f70c04f Reviewed-on: https://skia-review.googlesource.com/8859 Reviewed-by: Herb Derby <herb@google.com> Commit-Queue: Mike Klein <mtklein@chromium.org>
* SkJumper: implement lerp_u8Gravatar Mike Klein2017-02-22
| | | | | | | | | | Going to start filling these in in biggest-bang-for-the-buck order. lerp_u8 (i.e. text drawing) is number 1 right now. Change-Id: If58eaf8ddbb93a6b954c3700fa1a476dca94a809 Reviewed-on: https://skia-review.googlesource.com/8856 Reviewed-by: Herb Derby <herb@google.com> Commit-Queue: Mike Klein <mtklein@chromium.org>
* Whoops, forgot to re-run build_stages.Gravatar Mike Klein2017-02-21
| | | | | | | Change-Id: I723ae1ecaebf43e84bf47163e44e7899faf31c8a Reviewed-on: https://skia-review.googlesource.com/8824 Reviewed-by: Mike Klein <mtklein@chromium.org> Commit-Queue: Mike Klein <mtklein@chromium.org>
* Move looping logic into start_pipeline().Gravatar Mike Klein2017-02-21
| | | | | | | | | | | | | | | | This should be a big win on Windows, but I haven't timed there yet. On my Mac, it's a solid 2% speedup. PS1 was insufficiently ambitious, but was this for posterity: No need to vzeroupper twice on Windows. On Windows start_pipeline() will vzeroupper, so no need to do it in just_return(). Change-Id: I099320b95da85900a60ce96fdb7a216a36db1858 Reviewed-on: https://skia-review.googlesource.com/8821 Commit-Queue: Mike Klein <mtklein@chromium.org> Reviewed-by: Herb Derby <herb@google.com>
* SkJumper: WindowsGravatar Mike Klein2017-02-21
| | | | | | | | | | | | | - Compile stages with -DWIN to pick up MS-specific start_pipeline(). - Add SkJumper_generated_win.S with MS-specific assembly. - Add a minimal asm tool to our GN Windows toolchain. The SkRasterPipeline_f16 benchmark run ~4x faster on my desktop. Change-Id: Ia45afb4ecb6a055e2c0e43f0f54f59e081c23b7f Reviewed-on: https://skia-review.googlesource.com/8778 Reviewed-by: Mike Klein <mtklein@chromium.org> Commit-Queue: Mike Klein <mtklein@chromium.org>
* SkJumper: aarch64 and armv7Gravatar Mike Klein2017-02-18
Change-Id: Ie356b062372af3516a437d27bafa20d98e28edd6 Reviewed-on: https://skia-review.googlesource.com/8678 Commit-Queue: Mike Klein <mtklein@chromium.org> Reviewed-by: Mike Klein <mtklein@chromium.org>