3-15% speedup to HardLight / Overlay xfermodes.

While investigating my bug (skia:4052) I saw this TODO and figured it'd make me feel better about an otherwise unsuccessful investigation. This speeds up HardLight and Overlay (same code) by about 15% with SSE, mostly by rewriting the logic from 1 cheap comparison and 2 expensive div255() calls to 2 cheap comparisons and 1 expensive div255(). NEON speeds up by a more modest ~3%. BUG=skia: Review URL: https://codereview.chromium.org/1230663005
author: mtklein <mtklein@chromium.org> 2015-07-14 10:54:19 -0700
committer: Commit bot <commit-bot@chromium.org> 2015-07-14 10:54:19 -0700
commit: 4be181e304d2b280c6801bd13369cfba236d1a66 (patch)
tree: ae0510f8a6504c3333582fa004e961a8771a2d99 /src/opts/Sk4px_NEON.h
parent: a5517e2b190a8083b38964972b031c13e99f1012 (diff)
1 files changed, 6 insertions, 0 deletions
diff --git a/src/opts/Sk4px_NEON.h b/src/opts/Sk4px_NEON.h
index 9401864697..cd6dea9979 100644
--- a/src/opts/Sk4px_NEON.h
+++ b/src/opts/Sk4px_NEON.h
@@ -40,6 +40,12 @@ inline Sk4px::Wide Sk4px::widenHi() const {
                  vshll_n_u8(vget_high_u8(this->fVec), 8));
 }
 
+inline Sk4px::Wide Sk4px::widenLoHi() const {
+    auto zipped = vzipq_u8(this->fVec, this->fVec);
+    return Sk16h((uint16x8_t)zipped.val[0],
+                 (uint16x8_t)zipped.val[1]);
+}
+
 inline Sk4px::Wide Sk4px::mulWiden(const Sk16b& other) const {
     return Sk16h(vmull_u8(vget_low_u8 (this->fVec), vget_low_u8 (other.fVec)),
                  vmull_u8(vget_high_u8(this->fVec), vget_high_u8(other.fVec)));
author	mtklein <mtklein@chromium.org>	2015-07-14 10:54:19 -0700
committer	Commit bot <commit-bot@chromium.org>	2015-07-14 10:54:19 -0700
commit	4be181e304d2b280c6801bd13369cfba236d1a66 (patch)
tree	ae0510f8a6504c3333582fa004e961a8771a2d99 /src/opts/Sk4px_NEON.h
parent	a5517e2b190a8083b38964972b031c13e99f1012 (diff)