Skia's New Approach to SIMD =========================== Most hot software paths in Skia are implemented with processor-specific SIMD instructions. For graphics performance, the parallelism from SIMD is essential: there is simply no realistic way to eek the same performance out of portable C++ code as we can from the SSE family of instruction sets on x86 or from NEON on ARM or from MIPS32's DSP instructions. Depending on the particular code path and math involved, we see 2, 4, 8, or even ~16x performance increases over portable code when really exploiting the processor-specific SIMD instructions. But the SIMD code we've piled up over the years has some serious problems. It's often quite low-level, with poor factoring leading to verbose, bug prone, and difficult to read code. SIMD instrinsic types and functions take a good long while to get used to reading, let alone writing, and assembly is generally just a complete non-starter. SIMD coverage of Skia methods is not dense: a particular drawing routine might be specialized for NEON but not for SSE, or might have a MIPS DSP implementation but no NEON. Even when we have full instruction set coverage, the implementations of these specialized routines may not produce identical results, either when compared with each other or with our portable fallback code. The SIMD implementations are often simply incorrect, but the code is so fragile and difficult to understand, we can't fix it. There are long lived bugs in our tracker involving crashes and buffer under- and overflows that we simply cannot fix because no one on the team understands the code involved. And finally, to top it all off, the code isn't always even really that fast. This all needs to change. I want Skia developers to be able to write correct, clear, and fast code, and in software rendering, SIMD is the only way to get "fast". This document outlines a new vision for how Skia will use SIMD instructions with no compromises, writing clear code _once_ that runs quickly on all platforms we support. The Plan -------- We're going to wrap low-level platform-specific instrinsics with zero-cost abstractions with interfaces matching Skia's higher-level-but-still-quite-low-level use cases. Skia code will write to this interface _once_, which then compiles to efficient SSE, NEON, or portable code (MIPS is quite TBD, for now group it conceptually under portable code) via platform-specific backends. The key here is to find the right sweet spot of abstraction that allows us to express the graphical concepts we want in Skia while allowing each of those platform-specific backends flexibility to implement those concepts as efficiently as possible. While Skia uses a mix of float, 32-bit, 16-bit, and 8-bit integer SIMD instructions, 32-bit integers fall quite behind the rest in usage. Since we tend to operate on 8888 ARGB values, 8-bit SIMD tends to be the most natural and fastest approach, but when multiplication gets involved (essentially all the time), 16-bit SIMD inevitably gets tangled in there. For some operations like division, square roots, or math with high range or precision requirements, we expand our 8-bit pixel components up to floats, and working with a single pixel as a 4-float vector becomes most natural. This plan focuses on how we'll deal with these majority cases: floats, and 8- and 16-bit integers. `SkNf` for floats --------------- Wrapping floats with an API that allows efficient implementation on SSE and NEON is by far the easiest task involved here. Both SSE and NEON naturally work with 128-bit vectors of 4 floats, and they have a near 1-to-1 correspondence between operations. Indeed, the correspondence is so close that it's tempting to solve this problem by picking one set of intrinsics, e.g. NEON, and just `#define`ing portable and SSE implementations of NEON: #define float32x4_t __m128 #define vmulq_f32 _mm_mul_ps #define vaddq_f32 _mm_add_ps #define vld1q_f32 _mm_loadu_ps #define vst1q_f32 _mm_storeu_ps ... This temptation starts to break down when you notice: - there are operations that don't quite correspond, e.g. `_mm_movemask_ps`; and - math written with either SSE or NEON instrinsics is still very hard to read; and - sometimes we want to work with 4 floats, but sometimes 2, maybe even 8, etc. So we use a wrapper class `SkNf`, parameterized on N, how many floats the vector contains, constrained at compile time to be a power of 2. `SkNf` provides all the methods you'd expect on a vector of N floats: loading and storing from float arrays, all the usual arithmetic operators, min and max, low and high precision reciprocal and sqrt, all the usual comparison operators, and a `.thenElse()` method acting as a non-branching ternary `?:` operator. To support Skia's main graphic needs, `SkNf` can also load and store from a vector of N _bytes_, converting up to a float when loading and rounding down to [0,255] when storing. As a convenience, `SkNf` has two default implementations: `SkNf<1>` performs all these operations on a single float, and the generic `SkNf` simply recurses onto two `SkNf`. This allows our different backends to inject specialiations where most natural: the portable backend does nothing, so all `SkNf` recurse down to the default `SkNf<1>`; the NEON backend specializes `SkNf<2>` with `float32x2_t` and 64-bit SIMD methods, and `SkNf<4>` with `float32x4_t` and 128-bit SIMD methods; the SSE backend specializes both `SkNf<4>` and `SkNf<2>` to use the full or lower half of an `__m128` vector, respectively. A future AVX backend could simply drop in an `SkNf<8>` specialization. Our most common float use cases are working with 2D coordinates and with 4-float-component pixels. Since these are so common, we've made simple typedefs for these two use cases, `Sk2f` and `Sk4f`, and also versions reminding you that it can work with vectors of `SkScalar` (a Skia-specific float typedef) too: `Sk2s`, `Sk4s`. `SkNf` in practice ---------------- To date we have implemented several parts of Skia using `Sk4f`: 1. `SkColorMatrixFilter` 2. `SkRadialGradient` 3. `SkColorCubeFilter` 4. Three complicated `SkXfermode` subclasses: `ColorBurn`, `ColorDodge`, and `SoftLight`. In all these cases, we have been able to write a single implementation, producing the same results cross-platform. The first three of those sites using `Sk4f` are entirely newly vectorized, and run much faster than the previous portable implementations. The 3 `Sk4f` transfermodes replaced portable, SSE, and NEON implementations which all produced different results, and the `Sk4f` versions are all faster than their predecessors. `SkColorCubeFilter` stands out as a particularly good example of how and why to use `Sk4f` over custom platform-specific intrinsics. Starting from some portable code and a rather slow SSE-only sketch, a Google Chromium dev, an Intel contributor, and I worked together to write an `Sk4f` version that's more than twice as fast as the original, and runs fast on _both_ x86 and ARM. `SkPx` for 8- and 16-bit fixed point math ---------------------------------------- Building an abstraction layer over 8- and 16-bit fixed point math has proven to be quite a challenge. In fixed point, NEON and SSE again have some overlap, and they could probably be implemented in terms of each other if you were willing to sacrifice performance on SSE in favor of NEON or vice versa. But unlike with floats, where `SkNf` is really a pretty thin veneer over very similar operations, to really get the best performance out of each fixed point instruction set you need to work in rather different idioms. `SkPx`, our latest approach (there have been alpha `Sk16b` and beta `Sk4px` predecessors) to 8- and 16-bit SIMD tries to abstract over those idioms to again allow Skia developers to write one piece of clear graphics code that different backends can translate into their native intrinsics idiomatically. `SkPx` is really a family of three related types: 1. `SkPx` itself represents between 1 and `SkPx::N` 8888 ARGB pixels, where `SkPx::N` is a backend-specific compile-time power of 2. 2. `SkPx::Wide` represents those same pixels, but with 16-bits of space per component. 3. `SkPx::Alpha` represents the alpha channels of those same pixels. `SkPx`, `Wide` and `Alpha` create a somewhat complicated algebra of operations entirely motivated by the graphical operations we need to perform. Here are some examples: SkPx::LoadN(const uint32_t*) -> SkPx // Load full cruising-speed SkPx. SkPx::Load(n, const uint32_t*) -> SkPx // For the 0 SkPx SkPx - SkPx -> SkPx SkPx.saturatedAdd(SkPx) -> SkPx SkPx.alpha() -> Alpha // Extract alpha channels. Alpha::LoadN(const uint8_t*) -> Alpha // Like SkPx loads, in 8-bit steps. Alpha::Load(n, const uint8_t*) -> Alpha SkPx.widenLo() -> Wide // argb -> 0a0r0g0b SkPx.widenHi() -> Wide // argb -> a0r0g0b0 SkPx.widenLoHi() -> Wide // argb -> aarrggbb Wide + Wide -> Wide Wide - Wide -> Wide Wide << bits -> Wide Wide >> bits -> Wide SkPx * Alpha -> Wide // 8 x 8 -> 16 bit Wide.div255() -> SkPx // 16-bit -> 8 bit // A faster approximation of (SkPx * Alpha).div255(). SkPx.approxMulDiv255(Alpha) -> SkPx We allow each `SkPx` backend to choose how it physically represents `SkPx`, `SkPx::Wide`, and `SkPx::Alpha` and to choose any power of two as its `SkPx::N` sweet spot. Code working with `SkPx` typically runs a loop like this: while (n >= SkPx::N) { // Apply some_function() to SkPx::N pixels. some_function(SkPx::LoadN(src), SkPx::LoadN(dst)).storeN(dst); src += SkPx::N; dst += SkPx::N; n -= SkPx::N; } if (n > 0) { // Finish up the tail of 0