SkRasterPipeline: 8x pipelines

Bench runtime changes: sRGB: 7194 -> 3735 = 1.93x faster F16: 6531 -> 2559 = 2.55x faster Instead of building 4x and 1-3x pipelines and then maybe 8x and 1-7x, instead build either the short ones or the long ones, but not both. If we just take care to use a compatible run_pipeline(), there's some cross-module type disagreement but everything works out in the end. Oddly, a few places that looked like they'd be faster using SkNx_fma() or Sk4f_round()/Sk8f_round() are actually faster the long way, e.g. multiply, add 0.5, truncate. Curious! In all the other places you see here that I've used SkNx_fma(), it's been a significant speedup. This folds in a couple refactors and cleanups that I've been meaning to do. Hope you don't mind... if find the new code considerably easier to read than the old code. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2990 CQ_INCLUDE_TRYBOTS=master.client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Change-Id: I1c82e5755d8e44cc0b9c6673d04b117f85d71a3a Reviewed-on: https://skia-review.googlesource.com/2990 Reviewed-by: Matt Sarett <msarett@google.com> Commit-Queue: Mike Klein <mtklein@chromium.org>
author: Mike Klein <mtklein@chromium.org> 2016-10-06 15:06:38 -0400
committer: Skia Commit-Bot <skia-commit-bot@chromium.org> 2016-10-07 12:52:29 +0000
commit: 1aebdaee0e2aa4324509fd3ad4c40c21703ae4a2 (patch)
tree: c5ffae6c59217f3d228891177e1d50d7f784801a /src/core/SkHalf.h
parent: 2766cc567d5c939730fadd2d865e4bdf05477263 (diff)
1 files changed, 29 insertions, 0 deletions
diff --git a/src/core/SkHalf.h b/src/core/SkHalf.h
index dd978a2347..e71cb8750a 100644
--- a/src/core/SkHalf.h
+++ b/src/core/SkHalf.h
@@ -11,6 +11,10 @@
 #include "SkNx.h"
 #include "SkTypes.h"
 
+#if !defined(_MSC_VER) && SK_CPU_SSE_LEVEL >= SK_CPU_SSE_LEVEL_AVX2
+    #include <x86intrin.h>
+#endif
+
 // 16-bit floating point value
 // format is 1 bit sign, 5 bits exponent, 10 bits mantissa
 // only used for storage
@@ -85,4 +89,29 @@ static inline Sk4h SkFloatToHalf_finite_ftz(const Sk4f& fs) {
 #endif
 }
 
+static inline Sk8f SkHalfToFloat_finite_ftz(const Sk8h& hs) {
+#if !defined(SKNX_NO_SIMD) && SK_CPU_SSE_LEVEL >= SK_CPU_SSE_LEVEL_AVX2
+    return _mm256_cvtph_ps(hs.fVec);
+
+#else
+    uint64_t parts[2];
+    hs.store(parts);
+    return SkNx_join(SkHalfToFloat_finite_ftz(parts[0]),
+                     SkHalfToFloat_finite_ftz(parts[1]));
+
+#endif
+}
+
+static inline Sk8h SkFloatToHalf_finite_ftz(const Sk8f& fs) {
+#if !defined(SKNX_NO_SIMD) && SK_CPU_SSE_LEVEL >= SK_CPU_SSE_LEVEL_AVX2
+    return _mm256_cvtps_ph(fs.fVec, _MM_FROUND_CUR_DIRECTION);
+
+#else
+    uint64_t parts[2];
+    SkFloatToHalf_finite_ftz(fs.fLo).store(parts+0);
+    SkFloatToHalf_finite_ftz(fs.fHi).store(parts+1);
+    return Sk8h::Load(parts);
+#endif
+}
+
 #endif
author	Mike Klein <mtklein@chromium.org>	2016-10-06 15:06:38 -0400
committer	Skia Commit-Bot <skia-commit-bot@chromium.org>	2016-10-07 12:52:29 +0000
commit	1aebdaee0e2aa4324509fd3ad4c40c21703ae4a2 (patch)
tree	c5ffae6c59217f3d228891177e1d50d7f784801a /src/core/SkHalf.h
parent	2766cc567d5c939730fadd2d865e4bdf05477263 (diff)