trim another instruction off SkRasterPipeline overhead - skia

diff options

author	Mike Klein <mtklein@chromium.org>	2016-12-29 11:06:34 -0500
committer	Skia Commit-Bot <skia-commit-bot@chromium.org>	2017-01-03 18:13:21 +0000
commit	e61c40707e70a2be9e32227a929173864f7895e1 (patch)
tree	a12726f4f2f2d1d3788f7cf99a8477ceae8191dd /PRESUBMIT.py
parent	7551898f8eba322acb04c74ae12aae1ed3548105 (diff)

trim another instruction off SkRasterPipeline overhead

The overhead of a stage today is 3 x86 instructions, typically looking something like this: - movq (%rdi), %rax // Load the next stage function pointer. - addq $0x10, %rdi // Step our progress ahead 16 bytes to that next stage. - jmpq *%rax // Transfer control to that stage. But if we make sure the pointer's in esi/rsi, we can use lodsd/lodsq to do those first two steps in one instruction: - lodsq (%rsi), %rax (≈ movq (%rdi), %rax; addq $0x8, %rsi). - jmpq *%rax This CL rearranges things so that we can take advantage of this and generally trim off an instruction of overhead. Instead of a vector of {Fn, ctx} pairs, we'll flatten it down into a single interlaced program vector of void*, basically just ommitting any null context pointers. We pass the pointer to program as the second argument to Fn, putting it in rsi. These two changes together make getting the next Fn to call or the current context the same cheap lodsq instruction, encapsulated as load_and_increment(). Here's how the simple "modulate" blend stage changes: vmulps %ymm4, %ymm0, %ymm0 vmulps %ymm5, %ymm1, %ymm1 vmulps %ymm6, %ymm2, %ymm2 vmulps %ymm7, %ymm3, %ymm3 movq (%rdi), %rax addq $0x10, %rdi jmpq *%rax ~~~~~~~~> vmulps %ymm4, %ymm0, %ymm0 vmulps %ymm5, %ymm1, %ymm1 vmulps %ymm6, %ymm2, %ymm2 vmulps %ymm7, %ymm3, %ymm3 lodsq (%rsi), %rax jmpq *%rax This does make getting the current context a one-time, destructive operation. It's switched from referring to ctx as a void* directly to using ctx() as a thunk that returns a void*. No stage so far has ever referred to ctx twice, and it all appears to inline, so this seems harmless. "matrix_2x3" is a good example of what stages that use context pointers end up looking like: lodsq (%rsi), %rax vbroadcastss (%rax), %ymm9 vbroadcastss 0x8(%rax), %ymm10 vbroadcastss 0x10(%rax), %ymm8 vfmadd231ps %ymm10, %ymm1, %ymm8 vfmadd231ps %ymm9, %ymm0, %ymm8 vbroadcastss 0x4(%rax), %ymm10 vbroadcastss 0xc(%rax), %ymm11 vbroadcastss 0x14(%rax), %ymm9 vfmadd231ps %ymm11, %ymm1, %ymm9 vfmadd231ps %ymm10, %ymm0, %ymm9 lodsq (%rsi), %rax vmovaps %ymm8, %ymm0 vmovaps %ymm9, %ymm1 jmpq *%rax We can't do this with MSVC, as there's no intrinsic for it I can find, and they disallow inline assembly, and rsi is not used to pass arguments to functions there anyway. ARM doesn't need it... it does this in two instructions naturally anyway. We could do this for 32-bit x86 but I'd just rather focus on x86-64. It's unclear to me that this makes things any faster, but doesn't appear to make things any slower, and makes I think both the code and disassembly simpler. CQ_INCLUDE_TRYBOTS=skia.primary:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD Change-Id: Ia7b543a6718c75a33095371924003c5402b3445a Reviewed-on: https://skia-review.googlesource.com/6271 Reviewed-by: Herb Derby <herb@google.com> Commit-Queue: Mike Klein <mtklein@chromium.org>

Diffstat (limited to 'PRESUBMIT.py')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: