diff options
author | 2017-08-03 00:04:12 -0400 | |
---|---|---|
committer | 2017-08-03 13:24:46 +0000 | |
commit | e7f89fc257a5ddd83a314e7bbdd23cb17a461ae5 (patch) | |
tree | 77b99ba0f8714c42f56f1638c39d68fc2fd5e9f9 /gm/gradients.cpp | |
parent | 698edfecef121d8575eee6af207ce8a9525032ee (diff) |
improve HSW 16->8 bit pack
__builtin_convertvector(..., U8x4) is producing a fairly long
sequence of code to convert U16x4 to U8x4 on HSW:
vextracti128 $0x1,%ymm2,%xmm3
vmovdqa 0x1848(%rip),%xmm4
vpshufb %xmm4,%xmm3,%xmm3
vpshufb %xmm4,%xmm2,%xmm2
vpunpcklqdq %xmm3,%xmm2,%xmm2
vextracti128 $0x1,%ymm0,%xmm3
vpshufb %xmm4,%xmm3,%xmm3
vpshufb %xmm4,%xmm0,%xmm0
vpunpcklqdq %xmm3,%xmm0,%xmm0
vinserti128 $0x1,%xmm2,%ymm0,%ymm0
We can do much better with _mm256_packus_epi16:
vinserti128 $0x1,%xmm0,%ymm2,%ymm3
vperm2i128 $0x31,%ymm0,%ymm2,%ymm0
vpackuswb %ymm0,%ymm3,%ymm0
vpackuswb packs the values in a somewhat surprising order,
which the first two instructions get us lined up for.
This is a pretty noticeable speedup, 7-8% on some benchmarks.
The same sort of change could be made for SSE2 and SSE4.1 also
using _mm_packus_epi16, but the difference for that change is
much less dramatic. Might as well stick to focusing on HSW.
Change-Id: I0d6765bd67e0d024d658a61d19e6f6826b4d392c
Reviewed-on: https://skia-review.googlesource.com/30420
Reviewed-by: Florin Malita <fmalita@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
Diffstat (limited to 'gm/gradients.cpp')
0 files changed, 0 insertions, 0 deletions