Rearrange SkRasterPipeline scanline tail handling.

We used to step at a 4-pixel stride as long as possible, then run up to 3 times, one pixel at a time. Now replace those 1-at-a-time runs with a single tail stamp if there are 1-3 remaining pixels. This style is simply more efficient: e.g. we'll blend and lerp once for 3 pixels instead of 3 times. This should make short blits significantly more efficient. It's also more future-oriented... AVX+ on Intel and SVE on ARM support masked loads and stores, so we can do the entire tail in one direct step. This also makes it possible to re-arrange the code a bit to encapsulate each stage better. I think generally this code reads more clearly than the old code, but YMMV. I've arranged things so you write one function, but it's compiled into two specializations, one for tail=0 (Body) and one for tail>0 (Tail). It's pretty tidy. For now I've just burned a register to pass around tail. It's 2 bits now, maybe soon 3 with AVX, and capped at 4 for even the craziest new toys, so there are plenty of places we can pack it if we want to get clever. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2717 Change-Id: I45852a3e5d4c5b5e9315302c46601aee0d32265f Reviewed-on: https://skia-review.googlesource.com/2717 Reviewed-by: Mike Reed <reed@google.com> Commit-Queue: Mike Klein <mtklein@chromium.org>
author: Mike Klein <mtklein@chromium.org> 2016-09-28 10:43:53 -0400
committer: Skia Commit-Bot <skia-commit-bot@chromium.org> 2016-09-28 15:28:24 +0000
commit: c8dd6bc3e7a4b01c848ba15b808ea6ffdf249b06 (patch)
tree: e58beeb3fd18659cf8ebfa11c11349c17675a46c /src/core/SkXfermode.cpp
parent: b37eb0e968c5082e021244d4baf9b7721e3f360a (diff)
1 files changed, 4 insertions, 4 deletions
diff --git a/src/core/SkXfermode.cpp b/src/core/SkXfermode.cpp
index 3e7b8bc7c2..2717fab7e9 100644
--- a/src/core/SkXfermode.cpp
+++ b/src/core/SkXfermode.cpp
@@ -1437,14 +1437,14 @@ static Sk4f inv(const Sk4f& x) { return 1.0f - x; }
 
 // Most of these modes apply the same logic kernel to each channel.
 template <Sk4f kernel(const Sk4f& s, const Sk4f& sa, const Sk4f& d, const Sk4f& da)>
-static void SK_VECTORCALL rgba(SkRasterPipeline::Stage* st, size_t x,
+static void SK_VECTORCALL rgba(SkRasterPipeline::Stage* st, size_t x, size_t tail,
                                Sk4f  r, Sk4f  g, Sk4f  b, Sk4f  a,
                                Sk4f dr, Sk4f dg, Sk4f db, Sk4f da) {
     r = kernel(r,a,dr,da);
     g = kernel(g,a,dg,da);
     b = kernel(b,a,db,da);
     a = kernel(a,a,da,da);
-    st->next(x, r,g,b,a, dr,dg,db,da);
+    st->next(x,tail, r,g,b,a, dr,dg,db,da);
 }
 
 #define KERNEL(name) static Sk4f name(const Sk4f& s, const Sk4f& sa, const Sk4f& d, const Sk4f& da)
@@ -1468,14 +1468,14 @@ KERNEL(xor_)     { return s*inv(da) + d*inv(sa); }
 // Most of the rest apply the same logic to each color channel, and srcover's logic to alpha.
 // (darken and lighten can actually go either way, but they're a little faster this way.)
 template <Sk4f kernel(const Sk4f& s, const Sk4f& sa, const Sk4f& d, const Sk4f& da)>
-static void SK_VECTORCALL rgb_srcover(SkRasterPipeline::Stage* st, size_t x,
+static void SK_VECTORCALL rgb_srcover(SkRasterPipeline::Stage* st, size_t x, size_t tail,
                                       Sk4f  r, Sk4f  g, Sk4f  b, Sk4f  a,
                                       Sk4f dr, Sk4f dg, Sk4f db, Sk4f da) {
     r = kernel(r,a,dr,da);
     g = kernel(g,a,dg,da);
     b = kernel(b,a,db,da);
     a = a + da*inv(a);
-    st->next(x, r,g,b,a, dr,dg,db,da);
+    st->next(x,tail, r,g,b,a, dr,dg,db,da);
 }
 
 KERNEL(colorburn) {
author	Mike Klein <mtklein@chromium.org>	2016-09-28 10:43:53 -0400
committer	Skia Commit-Bot <skia-commit-bot@chromium.org>	2016-09-28 15:28:24 +0000
commit	c8dd6bc3e7a4b01c848ba15b808ea6ffdf249b06 (patch)
tree	e58beeb3fd18659cf8ebfa11c11349c17675a46c /src/core/SkXfermode.cpp
parent	b37eb0e968c5082e021244d4baf9b7721e3f360a (diff)