Rework SSE and NEON Color32 algorithms to be more correct and faster.

This algorithm changes the blend math, guarded by SK_LEGACY_COLOR32_MATH. The new math is more correct: it's never off by more than 1, and correct in all the interesting 0x00 and 0xFF edge cases, where the old math was never off by more than 2, and not always correct on the edges. If you look at tests/BlendTest.cpp, the old code was using the `blend_256_plus1_trunc` algorithm, while the new code uses `blend_256_round_alt`. Neither uses `blend_perfect`, which is about ~35% slower than `blend_256_round_alt`. This will require an unfathomable number of rebaselines, first to Skia, then to Blink when I remove the guard. I plan to follow up with some integer SIMD abstractions that can unify these two implementations into a single algorithm. This was originally what I was working on here, but the correctness gains seem to be quite compelling. The only places these two algorithms really differ greatly now is the kernel function, and even there they can really both be expressed abstractly as: - multiply 8-bits and 8-bits producing 16-bits - add 16-bits to 16-bits, returning the top 8 bits. All the constants are the same, except SSE is a little faster to keep 8 16-bit inverse alphas, NEON's a little faster to keep 8 8-bit inverse alphas. I may need to take this small speed win back to unify the two. We should expect a ~25% speedup on Intel (mostly from unrolling to 8 pixels) and a ~20% speedup on ARM (mostly from using vaddhn to add `color`, round, and narrow back down to 8-bit all into one instruction. (I am probably missing several more related bugs here.) BUG=skia:3738,skia:420,chromium:111470 Review URL: https://codereview.chromium.org/1092433002
author: mtklein <mtklein@chromium.org> 2015-04-17 11:00:54 -0700
committer: Commit bot <commit-bot@chromium.org> 2015-04-17 11:00:55 -0700
commit: afe2ffb8ba5e7362a2ee6f4e1540c9ab22df2c1e (patch)
tree: 7416e7410276c509dd66730abbd6173eb1992f95 /src/core/SkBlitRow_D32.cpp
parent: 9d911d5a9323bda1e4a77c46a0c28708dcc2ad38 (diff)
1 files changed, 28 insertions, 18 deletions
diff --git a/src/core/SkBlitRow_D32.cpp b/src/core/SkBlitRow_D32.cpp
index 509eeeb1a0..ac01e427bf 100644
--- a/src/core/SkBlitRow_D32.cpp
+++ b/src/core/SkBlitRow_D32.cpp
@@ -140,27 +140,37 @@ SkBlitRow::Proc32 SkBlitRow::ColorProcFactory() {
     return proc;
 }
 
+#define SK_SUPPORT_LEGACY_COLOR32_MATHx
+
+// Color32 and its SIMD specializations use the blend_256_round_alt algorithm
+// from tests/BlendTest.cpp.  It's not quite perfect, but it's never wrong in the
+// interesting edge cases, and it's quite a bit faster than blend_perfect.
+//
+// blend_256_round_alt is our currently blessed algorithm.  Please use it or an analogous one.
 void SkBlitRow::Color32(SkPMColor* SK_RESTRICT dst,
                         const SkPMColor* SK_RESTRICT src,
                         int count, SkPMColor color) {
-    if (count > 0) {
-        if (0 == color) {
-            if (src != dst) {
-                memcpy(dst, src, count * sizeof(SkPMColor));
-            }
-            return;
-        }
-        unsigned colorA = SkGetPackedA32(color);
-        if (255 == colorA) {
-            sk_memset32(dst, color, count);
-        } else {
-            unsigned scale = 256 - SkAlpha255To256(colorA);
-            do {
-                *dst = color + SkAlphaMulQ(*src, scale);
-                src += 1;
-                dst += 1;
-            } while (--count);
-        }
+    switch (SkGetPackedA32(color)) {
+        case   0: memmove(dst, src, count * sizeof(SkPMColor)); return;
+        case 255: sk_memset32(dst, color, count);               return;
+    }
+
+    unsigned invA = 255 - SkGetPackedA32(color);
+#ifdef SK_SUPPORT_LEGACY_COLOR32_MATH  // blend_256_plus1_trunc, busted
+    unsigned round = 0;
+#else                          // blend_256_round_alt, good
+    invA += invA >> 7;
+    unsigned round = (128 << 16) + (128 << 0);
+#endif
+
+    while (count --> 0) {
+        // Our math is 16-bit, so we can do a little bit of SIMD in 32-bit registers.
+        const uint32_t mask = 0x00FF00FF;
+        uint32_t rb = (((*src >> 0) & mask) * invA + round) >> 8,  // _r_b
+                 ag = (((*src >> 8) & mask) * invA + round) >> 0;  // a_g_
+        *dst = color + ((rb & mask) | (ag & ~mask));
+        src++;
+        dst++;
     }
 }
author	mtklein <mtklein@chromium.org>	2015-04-17 11:00:54 -0700
committer	Commit bot <commit-bot@chromium.org>	2015-04-17 11:00:55 -0700
commit	afe2ffb8ba5e7362a2ee6f4e1540c9ab22df2c1e (patch)
tree	7416e7410276c509dd66730abbd6173eb1992f95 /src/core/SkBlitRow_D32.cpp
parent	9d911d5a9323bda1e4a77c46a0c28708dcc2ad38 (diff)