better NEON div255

We were doing (x+127)/255 = ((x+128) + (x+128)>>8)>>8 in three instructions: 1) x += 128 2) shift x right 8 bits 3) add x and x>>8 together, then shift right more 8 bits Now do it as two instructions: 1) shift (x+128) right 8 bits 2) add x and (x+128)>>8 and 128 all together, then shift right 8 more bits On ARM this will be a 5-10% speedup for SrcATop, DstATop, Xor, Multiply, Difference, HardLight, Darken, and Lighten xfermodes. When we have a mask (e.g. text), *all* xfermodes except Plus will get a similar boost. This should mean now that (a*b).div255() is the same speed as a.approxMulDiv255(b) on both x86 and ARM, and of course it's perfect instead of approximate. So we should eliminate approxMulDiv255(), but I'll leave it to another CL, as it'll need Blink rebaselines. This CL should not change GMs or Blink. https://gold.skia.org/search2?issue=1502843002&unt=true&query=source_type%3Dgm&master=false BUG=skia: CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot;client.skia.android:Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Debug-Trybot,Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot Review URL: https://codereview.chromium.org/1502843002
author: mtklein <mtklein@chromium.org> 2015-12-07 08:21:11 -0800
committer: Commit bot <commit-bot@chromium.org> 2015-12-07 08:21:11 -0800
commit: 9d344069c5d45a936823d3f84999898383a026cd (patch)
tree: 7bd971814ae504d4d5df3dc9aee7a182cd78f42c /src
parent: a544eda5ddea215037b1ada6ba5cfc98f6c8ee15 (diff)
1 files changed, 3 insertions, 3 deletions
diff --git a/src/opts/Sk4px_NEON.h b/src/opts/Sk4px_NEON.h
index c27bb13764..c8abff24b1 100644
--- a/src/opts/Sk4px_NEON.h
+++ b/src/opts/Sk4px_NEON.h
@@ -58,9 +58,9 @@ inline Sk4px Sk4px::Wide::addNarrowHi(const Sk16h& other) const {
 }
 
 inline Sk4px Sk4px::Wide::div255() const {
-    // Calculated as ((x+128) + ((x+128)>>8)) >> 8.
-    auto v = *this + Sk16h(128);
-    return v.addNarrowHi(v>>8);
+    // Calculated as (x + (x+128)>>8 +128) >> 8.  The 'r' in each instruction provides each +128.
+    return Sk16b(vcombine_u8(vraddhn_u16(this->fLo.fVec, vrshrq_n_u16(this->fLo.fVec, 8)),
+                             vraddhn_u16(this->fHi.fVec, vrshrq_n_u16(this->fHi.fVec, 8))));
 }
 
 inline Sk4px Sk4px::alphas() const {
author	mtklein <mtklein@chromium.org>	2015-12-07 08:21:11 -0800
committer	Commit bot <commit-bot@chromium.org>	2015-12-07 08:21:11 -0800
commit	9d344069c5d45a936823d3f84999898383a026cd (patch)
tree	7bd971814ae504d4d5df3dc9aee7a182cd78f42c /src
parent	a544eda5ddea215037b1ada6ba5cfc98f6c8ee15 (diff)