skia - 2D graphics library

	Commit message (Collapse)	Author	Age
*	Sk4px: Difference and Exclusion	mtklein	2015-05-15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This will cause minor (off-by-one) diffs due to a little lost precision: colortype_xfermodes mixed_xfermodes xfermodes2 xfermodeimagefilter xfermodes3 xfermodes Desktop: Xfermode_Difference_aa 9.77ms -> 7.32ms 0.75x Xfermode_Exclusion_aa 8.49ms -> 6.21ms 0.73x Xfermode_Difference 17ms -> 7.54ms 0.44x Xfermode_Exclusion 13.5ms -> 5.09ms 0.38x N7: Xfermode_Difference_aa 32.2ms -> 27.6ms 0.86x Xfermode_Difference 43.9ms -> 32ms 0.73x Xfermode_Exclusion_aa 40.5ms -> 26.7ms 0.66x Xfermode_Exclusion 71.5ms -> 23.9ms 0.33x This wraps up the xfermodes implemented in Sk4f. BUG=skia: Review URL: https://codereview.chromium.org/1141213002
*	add Min to SkNi, specialized for u8 and u16 on SSE and NEON	mtklein	2015-05-14
\| \| \| \| \| \| \| \| \| \| \|	0x8001 / 0x7fff don't seem to work, but we were close: 0x8000 does. I plan to use this to implement the Difference xfermode, and it seems generally handy. BUG=skia: Review URL: https://codereview.chromium.org/1133933004
*	Sk4px: alphas() and Load[24]Alphas()	mtklein	2015-05-13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	alphas() extracts the 4 alphas from an existing Sk4px as another Sk4px. LoadNAlphas() constructs an Sk4px from N packed alphas. In both cases, we end up with 4x repeated alphas aligned with their pixels. alphas() A0 R0 G0 B0 A1 R1 G1 B1 A2 R2 G2 B2 A3 R3 G3 B3 -> A0 A0 A0 A0 A1 A1 A1 A1 A2 A2 A2 A2 A3 A3 A3 A3 Load4Alphas() A0 A1 A2 A3 -> A0 A0 A0 A0 A1 A1 A1 A1 A2 A2 A2 A2 A3 A3 A3 A3 Load2Alphas() A0 A1 -> A0 A0 A0 A0 A1 A1 A1 A1 0 0 0 0 0 0 0 0 This is a 5-10% speedup for AA on Intel, and wash on ARM. AA is still mostly dominated by the final lerp. alphas() isn't used yet, but it's similar enough to Load[24]Alphas() that it was easier to write all at once. BUG=skia: Review URL: https://codereview.chromium.org/1138333003
*	Turn on Sk4px xfermodes when we have NEON too.	mtklein	2015-05-13
\| \| \| \| \| \| \| \| \| \| \|	For SSE, Sk4px is better than Sk4f is better than SkXfermodes_opts_SSE2 (where implemented). For NEON, Sk4px is better than SkXfermodes_opts_arm_neon is better than Sk4f (where implemented). This is a 1.6-1.9x speedup for Plus,Modulate, and Screen for NEON. BUG=skia: Review URL: https://codereview.chromium.org/1128053004
*	Plus xfermode using Sk4px.	mtklein	2015-05-12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Xfermode_Plus runs 4-5x faster. We expect mixed_xfermodes to have a small diff. This is because kFoldCoverageIntoSrcAlpha was incorrectly set to true. This implementation handily beats the Sk4f impl, the portable impl, and the existing SSE2 impl. Reading the SkXfermodes_opts_SSE2.cpp file, I'm pretty confident that we'll be able to beat all SSE2 impls. I believe this impl will beat or match the existing NEON impl too, but that may not be true for more complicated xfermodes. They can take advantage of transposing ARGBARGB... to AAAARRRR.... cheaply and I haven't figured out an abstraction for that yet that doesn't screw SSE. Adds: - MapDstSrc() to Sk4px - saturatedAdd() to SkNi (only implemented as far as it's used). - div255Narrow() BUG=skia: Review URL: https://codereview.chromium.org/1138893002
*	Sk4px	mtklein	2015-05-12
\| \| \| \| \| \| \| \| \| \|	Xfermode_SrcOver: SSE: 2.08ms -> 2.03ms (~2% faster) NEON: my N5 is noisy, but there appears to be no perf change BUG=skia: Review URL: https://codereview.chromium.org/1132273004
*	We don't use boxBlurY.	mtklein	2015-05-07
\| \| \| \| \| \| \| \|	Also noticed nobody sets SK_DISABLE_BLUR_DIVISION_OPTIMIZATION. BUG=skia: Review URL: https://codereview.chromium.org/1134513003
*	Really use SSE4 (and SSSE3) in SkBlurImage_SSE4	mtklein	2015-05-06
\| \| \| \| \| \| \| \| \| \| \| \|	We don't seem to be making good use of the available instruction set. SSE4.1 gives us an easy way to unpack a pixel into an __m128i, and SSSE3 gave us an easy way to do the reverse. This should be bit-perfect and about a 10% speedup. BUG=skia: Review URL: https://codereview.chromium.org/1123263003
*	De-proc Color32	mtklein	2015-04-27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Also strips SK_SUPPORT_LEGACY_COLOR32_MATH, which is no longer needed. Seems handy to have SkTypes include the relevant intrinsics when we know we've got them, but I'm not married to it. Locally this looks like a pointlessly small perf win, but I'm mostly keen to get all the code together. BUG=skia: Committed: https://skia.googlesource.com/skia/+/376e9bc206b69d9190f38dfebb132a8769bbd72b Committed: https://skia.googlesource.com/skia/+/d65dc0cedd5b50dd407b6ff8fdc39123f11511cc CQ_EXTRA_TRYBOTS=client.skia.compile:Build-Ubuntu-GCC-Mips-Debug-Android-Trybot Review URL: https://codereview.chromium.org/1104183004
*	Revert of De-proc Color32 (patchset #5 id:80001 of ↵	mtklein	2015-04-27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	https://codereview.chromium.org/1104183004/) Reason for revert: duh Original issue's description: > De-proc Color32 > > Also strips SK_SUPPORT_LEGACY_COLOR32_MATH, > which is no longer needed. > > Seems handy to have SkTypes include the relevant intrinsics when > we know we've got them, but I'm not married to it. > > Locally this looks like a pointlessly small perf win, but I'm mostly > keen to get all the code together. > > BUG=skia: > > Committed: https://skia.googlesource.com/skia/+/376e9bc206b69d9190f38dfebb132a8769bbd72b > > Committed: https://skia.googlesource.com/skia/+/d65dc0cedd5b50dd407b6ff8fdc39123f11511cc TBR=reed@google.com,mtklein@chromium.org NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG=skia: Review URL: https://codereview.chromium.org/1102363006
*	De-proc Color32	mtklein	2015-04-27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Also strips SK_SUPPORT_LEGACY_COLOR32_MATH, which is no longer needed. Seems handy to have SkTypes include the relevant intrinsics when we know we've got them, but I'm not married to it. Locally this looks like a pointlessly small perf win, but I'm mostly keen to get all the code together. BUG=skia: Committed: https://skia.googlesource.com/skia/+/376e9bc206b69d9190f38dfebb132a8769bbd72b Review URL: https://codereview.chromium.org/1104183004
*	Split rsqrt into rsqrt{0,1,2}, with increasing cost and precision on ARM	mtklein	2015-04-27
\| \| \| \| \| \| \| \| \| \| \| \|	This is a logical no-op. Everything was using the equivalent of rsqrt1() before, and is now after. BUG=skia: Committed: https://skia.googlesource.com/skia/+/9de16283fdc8cc0d31a84f503578d0ecea4e8297 CQ_EXTRA_TRYBOTS=client.skia.compile:Build-Ubuntu-GCC-Arm64-Debug-Android-Trybot Review URL: https://codereview.chromium.org/1109913002
*	Revert of De-proc Color32 (patchset #4 id:60001 of ↵	mtklein	2015-04-27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	https://codereview.chromium.org/1104183004/) Reason for revert: MIPS Original issue's description: > De-proc Color32 > > Also strips SK_SUPPORT_LEGACY_COLOR32_MATH, > which is no longer needed. > > Seems handy to have SkTypes include the relevant intrinsics when > we know we've got them, but I'm not married to it. > > Locally this looks like a pointlessly small perf win, but I'm mostly > keen to get all the code together. > > BUG=skia: > > Committed: https://skia.googlesource.com/skia/+/376e9bc206b69d9190f38dfebb132a8769bbd72b TBR=reed@google.com,mtklein@chromium.org NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG=skia: Review URL: https://codereview.chromium.org/1108163002
*	De-proc Color32	mtklein	2015-04-27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Also strips SK_SUPPORT_LEGACY_COLOR32_MATH, which is no longer needed. Seems handy to have SkTypes include the relevant intrinsics when we know we've got them, but I'm not married to it. Locally this looks like a pointlessly small perf win, but I'm mostly keen to get all the code together. BUG=skia: Review URL: https://codereview.chromium.org/1104183004
*	Revert of Split rsqrt into rsqrt{0,1,2}, with increasing cost and precision ↵	mtklein	2015-04-27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	on ARM (patchset #2 id:20001 of https://codereview.chromium.org/1109913002/) Reason for revert: arm64 typos Original issue's description: > Split rsqrt into rsqrt{0,1,2}, with increasing cost and precision on ARM > > This is a logical no-op. Everything was using the equivalent of rsqrt1() before, and is now after. > > BUG=skia: > > Committed: https://skia.googlesource.com/skia/+/9de16283fdc8cc0d31a84f503578d0ecea4e8297 TBR=reed@google.com,mtklein@chromium.org NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG=skia: Review URL: https://codereview.chromium.org/1105233003
*	Split rsqrt into rsqrt{0,1,2}, with increasing cost and precision on ARM	mtklein	2015-04-27
\| \| \| \| \| \| \| \|	This is a logical no-op. Everything was using the equivalent of rsqrt1() before, and is now after. BUG=skia: Review URL: https://codereview.chromium.org/1109913002
*	Mike's radial gradient CL with better float -> int.	mtklein	2015-04-27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	patch from issue 1072303005 at patchset 40001 (http://crrev.com/1072303005#ps40001) This looks quite launchable. radial_gradient3, min of 100 samples: N5: 985µs -> 946µs MBP: 395µs -> 279µs On my MBP, most of the meat looks like it's now in reading the cache and writing to dst one color at a time. Is that something we could do in float math rather than with a lookup table? BUG=skia: CQ_EXTRA_TRYBOTS=client.skia.compile:Build-Mac10.8-Clang-Arm7-Debug-Android-Trybot,Build-Ubuntu-GCC-Arm7-Release-Android_NoNeon-Trybot Committed: https://skia.googlesource.com/skia/+/abf6c5cf95e921fae59efb487480e5b5081cf0ec Review URL: https://codereview.chromium.org/1109643002
*	Revert of Mike's radial gradient CL with better float -> int. (patchset #7 ↵	mtklein	2015-04-27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	id:120001 of https://codereview.chromium.org/1109643002/) Reason for revert: compile failures. Original issue's description: > Mike's radial gradient CL with better float -> int. > > patch from issue 1072303005 at patchset 40001 (http://crrev.com/1072303005#ps40001) > > This looks quite launchable. radial_gradient3, min of 100 samples: > N5: 985µs -> 946µs > MBP: 395µs -> 279µs > > On my MBP, most of the meat looks like it's now in reading the cache and writing to dst one color at a time. Is that something we could do in float math rather than with a lookup table? > > BUG=skia: > > CQ_EXTRA_TRYBOTS=client.skia.android:Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Debug-Trybot,Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Debug-Trybot > > Committed: https://skia.googlesource.com/skia/+/abf6c5cf95e921fae59efb487480e5b5081cf0ec TBR=reed@google.com,robertphillips@google.com,mtklein@chromium.org NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG=skia: Review URL: https://codereview.chromium.org/1109883003
*	Mike's radial gradient CL with better float -> int.	mtklein	2015-04-27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	patch from issue 1072303005 at patchset 40001 (http://crrev.com/1072303005#ps40001) This looks quite launchable. radial_gradient3, min of 100 samples: N5: 985µs -> 946µs MBP: 395µs -> 279µs On my MBP, most of the meat looks like it's now in reading the cache and writing to dst one color at a time. Is that something we could do in float math rather than with a lookup table? BUG=skia: CQ_EXTRA_TRYBOTS=client.skia.android:Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Debug-Trybot,Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Debug-Trybot Review URL: https://codereview.chromium.org/1109643002
*	Update more directories under src/ to follow C++11 style rule for ↵	tfarina	2015-04-27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	{virtual,override}. The Google style guide states that only one of {virtual,override,final} should be used for each declaration, since override implies virtual and final implies both virtual and override. The entries were found using the following command line: $ find src/ -iname ".h" -o -iname ".cpp" \| xargs pcregrep -M "[^\n/]+virtual\ [^;{]+\ [a-zA-Z0-9_]+\([^;{]+\ override[ \n]*[;{]" The regex was a courtesy of nick@chromium.org BUG=None R=mtklein@google.com NOPRESUBMIT=true Review URL: https://codereview.chromium.org/1086143003
*	Revert of Convert Color32 code to perfect blend. (patchset #6 id:100001 of ↵	mtklein	2015-04-21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	https://codereview.chromium.org/1098913002/) Reason for revert: Xfermode_SrcOver not looking encouraging. Up to 50% regressions. https://perf.skia.org/#3242 Original issue's description: > Convert Color32 code to perfect blend. > > Before we commit to blend_256_round_alt, let's make sure blend_perfect is > really slower in practice (i.e. regresses on perf.skia.org). > > blend_perfect is really the most desirable algorithm if we can afford it. Not > only is it correct, but it's easy to think about and break into correct pieces: > for instance, its div255() doesn't require any coordination with the multiply. > > This looks like a 30% hit according to microbenches. That said, microbenches > said my previous change would be a 20-25% perf improvement, but it didn't end > up showing a significant effect at a high level. > > As for correctness, I see a bunch of off-by-1 compared to blend_256_round_alt > (exactly what we'd expect), and one off-by-3 in a GM that looks like it has a > bunch of overdraw. > > BUG=skia: > > Committed: https://skia.googlesource.com/skia/+/61221e7f87a99765b0e034020e06bb018e2a08c2 TBR=reed@google.com,fmalita@chromium.org,mtklein@chromium.org NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG=skia: Review URL: https://codereview.chromium.org/1083923006
*	Convert Color32 code to perfect blend.	mtklein	2015-04-20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Before we commit to blend_256_round_alt, let's make sure blend_perfect is really slower in practice (i.e. regresses on perf.skia.org). blend_perfect is really the most desirable algorithm if we can afford it. Not only is it correct, but it's easy to think about and break into correct pieces: for instance, its div255() doesn't require any coordination with the multiply. This looks like a 30% hit according to microbenches. That said, microbenches said my previous change would be a 20-25% perf improvement, but it didn't end up showing a significant effect at a high level. As for correctness, I see a bunch of off-by-1 compared to blend_256_round_alt (exactly what we'd expect), and one off-by-3 in a GM that looks like it has a bunch of overdraw. BUG=skia: Review URL: https://codereview.chromium.org/1098913002
*	Rework SSE and NEON Color32 algorithms to be more correct and faster.	mtklein	2015-04-17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This algorithm changes the blend math, guarded by SK_LEGACY_COLOR32_MATH. The new math is more correct: it's never off by more than 1, and correct in all the interesting 0x00 and 0xFF edge cases, where the old math was never off by more than 2, and not always correct on the edges. If you look at tests/BlendTest.cpp, the old code was using the `blend_256_plus1_trunc` algorithm, while the new code uses `blend_256_round_alt`. Neither uses `blend_perfect`, which is about ~35% slower than `blend_256_round_alt`. This will require an unfathomable number of rebaselines, first to Skia, then to Blink when I remove the guard. I plan to follow up with some integer SIMD abstractions that can unify these two implementations into a single algorithm. This was originally what I was working on here, but the correctness gains seem to be quite compelling. The only places these two algorithms really differ greatly now is the kernel function, and even there they can really both be expressed abstractly as: - multiply 8-bits and 8-bits producing 16-bits - add 16-bits to 16-bits, returning the top 8 bits. All the constants are the same, except SSE is a little faster to keep 8 16-bit inverse alphas, NEON's a little faster to keep 8 8-bit inverse alphas. I may need to take this small speed win back to unify the two. We should expect a ~25% speedup on Intel (mostly from unrolling to 8 pixels) and a ~20% speedup on ARM (mostly from using vaddhn to add `color`, round, and narrow back down to 8-bit all into one instruction. (I am probably missing several more related bugs here.) BUG=skia:3738,skia:420,chromium:111470 Review URL: https://codereview.chromium.org/1092433002
*	Sk4h and Sk8h for SSE	mtklein	2015-04-14
\| \| \| \| \| \| \| \| \| \|	These will underly the SkPMFloat-like class for uint16_t components. Sk4h will back a single-pixel version, and Sk8h any larger number than that. BUG=skia: Review URL: https://codereview.chromium.org/1088883005
*	Rename SkNi to SkNb.	mtklein	2015-04-14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	As used today, SkNi is used in bool-y contexts. This keeps that, but under a new name, SkNb. This makes room for a new SkNi that's focused on integer-y things like loads, stores, arithmetic, etc. The main reason to split these is that we want different specializations for each use case: for bools, it's important for us to specialize 32- and 64-bit to support efficient float- and double- comparisons, but for integer work we're more likely to be looking at 8- and 16- bit lanes. Keeping these use cases siloed helps me manage the compexity of the backend NEON and SSE code. BUG=skia: Review URL: https://codereview.chromium.org/1083123002
*	Replace NEON assembly memset16 and memset32 with intrinsic versions.	mtklein	2015-04-10
\| \| \| \| \| \| \| \| \| \| \|	According to bench/MemsetBench.cpp, I've got them somewhere between 10% slower and a percent or two faster than the old assembly. BUG=skia: CQ_EXTRA_TRYBOTS=client.skia.android:Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Debug-Trybot Review URL: https://codereview.chromium.org/1075003002
*	Remove ARM assembly memsets.	mtklein	2015-04-09
\| \| \| \| \| \| \| \| \| \| \|	Step 1 of a zillion in the quest for NEON on iOS, and step 1 of a different zillion in the Great Assembly Purge. ios, arm, arm64, arm_v7, arm_v7_neon all build. BUG=skia: Review URL: https://codereview.chromium.org/1072063002
*	Code's more readable when SkPMFloat is an Sk4f.	mtklein	2015-04-03
\| \| \| \| \| \| \| \| \| \| \| \| \|	#floats BUG=skia: BUG=skia:3592 Committed: https://skia.googlesource.com/skia/+/6b5dab889579f1cc9e1b5278f4ecdc4c63fe78c9 CQ_EXTRA_TRYBOTS=client.skia.compile:Build-Ubuntu-GCC-Arm64-Debug-Android-Trybot Review URL: https://codereview.chromium.org/1061603002
*	Revert of Code's more readable when SkPMFloat is an Sk4f. (patchset #3 ↵	mtklein	2015-04-03
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	id:40001 of https://codereview.chromium.org/1061603002/) Reason for revert: missed some neon code Original issue's description: > Code's more readable when SkPMFloat is an Sk4f. > #floats > > BUG=skia: > BUG=skia:3592 > > Committed: https://skia.googlesource.com/skia/+/6b5dab889579f1cc9e1b5278f4ecdc4c63fe78c9 TBR=reed@google.com,mtklein@chromium.org NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG=skia: Review URL: https://codereview.chromium.org/1056143004
*	Code's more readable when SkPMFloat is an Sk4f.	mtklein	2015-04-03
\| \| \| \| \| \| \| \| \|	#floats BUG=skia: BUG=skia:3592 Review URL: https://codereview.chromium.org/1061603002
*	New names for SkPMFloat methods.	mtklein	2015-04-03
\| \| \| \| \| \|	BUG=skia: Review URL: https://codereview.chromium.org/1055123002
*	Use switch operator[](int) to kth<int>() so we can use vget_lane.	mtklein	2015-04-03
\| \| \| \| \| \| \| \| \|	#floats BUG=skia: BUG=skia:3592 Review URL: https://codereview.chromium.org/1059743002
*	I suspect S32A_D565_Opaque_neon for Daisy problems.	Mike Klein	2015-04-02
\| \| \| \| \| \| \| \| \| \| \|	I don't see any color-order handling logic in the 32-bit code. BUG=skia:1843 CQ_EXCLUDE_TRYBOTS=client.skia.compile:Build-Win-MSVC-x86-Debug-Trybot,Build-Win-MSVC-x86_64-Debug-Trybot R=mtklein@google.com Review URL: https://codereview.chromium.org/1051683003
*	SkPMFloat: fewer internal this->isValid() assertions.	mtklein	2015-04-02
\| \| \| \| \| \| \| \| \| \| \| \| \|	Each of these conversion functions now only asserts is output is valid. For SkPMColor -> SkPMFloat, we assert isValid(). For SkPMFloat -> SkPMColor, we SkPMColorAssert. #floats BUG=skia: BUG=skia:3592 Review URL: https://codereview.chromium.org/1055093002
*	back to Sk4f for SkPMColor	mtklein	2015-03-31
\| \| \| \| \| \| \| \| \|	#floats BUG=skia: BUG=skia:3592 Review URL: https://codereview.chromium.org/1047823002
*	Refactor Sk2x<T> + Sk4x<T> into SkNf<N,T> and SkNi<N,T>	mtklein	2015-03-30
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The primary feature this delivers is SkNf and SkNd for arbitrary power-of-two N. Non-specialized types or types larger than 128 bits should now Just Work (and we can drop in a specialization to make them faster). Sk4s is now just a typedef for SkNf<4, SkScalar>; Sk4d is SkNf<4, double>, Sk2f SkNf<2, float>, etc. This also makes implementing new specializations easier and more encapsulated. We're now using template specialization, which means the specialized versions don't have to leak out so much from SkNx_sse.h and SkNx_neon.h. This design leaves us room to grow up, e.g to SkNf<8, SkScalar> == Sk8s, and to grown down too, to things like SkNi<8, uint16_t> == Sk8h. To simplify things, I've stripped away most APIs (swizzles, casts, reinterpret_casts) that no one's using yet. I will happily add them back if they seem useful. You shouldn't feel bad about using any of the typedef Sk4s, Sk4f, Sk4d, Sk2s, Sk2f, Sk2d, Sk4i, etc. Here's how you should feel: - Sk4f, Sk4s, Sk2d: feel awesome - Sk2f, Sk2s, Sk4d: feel pretty good No public API changes. TBR=reed@google.com BUG=skia:3592 Review URL: https://codereview.chromium.org/1048593002
*	SkPMFloat::trunc()	mtklein	2015-03-26
\| \| \| \| \| \| \| \| \| \| \| \| \|	Add and test trunc(), which is what get() used to be before rounding. Using trunc() is a ~40% speedup on our linear gradient bench. #neon #floats BUG=skia:3592 #n5 #n9 CQ_INCLUDE_TRYBOTS=client.skia.android:Test-Android-Nexus5-Adreno330-Arm7-Debug-Trybot;client.skia.android:Test-Android-Nexus9-TegraK1-Arm64-Release-Trybot Review URL: https://codereview.chromium.org/1032243002
*	C++11 override should now be supported by all of {bots,Chrome,Android,Mozilla}	mtklein	2015-03-25
\| \| \| \| \| \| \| \| \|	NOPRESUBMIT=true BUG=skia: DOCS_PREVIEW= https://skia.org/?cl=1037793002 Review URL: https://codereview.chromium.org/1037793002
*	Update 4-at-a-time APIs.	mtklein	2015-03-25
\| \| \| \| \| \| \| \| \| \| \|	There is no reason to require the 4 SkPMFloats (registers) to be adjacent. The only potential win in loads and stores comes from the SkPMColors being adjacent. Makes no difference to existing bench. BUG=skia: Review URL: https://codereview.chromium.org/1035583002
*	aacc + bbdd	mtklein	2015-03-24
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	SkMatrix::mapPts() using aacc/bbdd was always worse than using badc(): - On Intel, it was faster than exisiting swizzle, but badc() is 10% faster still (one pshufd instead of two). - On ARM, existing swizzle < badc() < aacc()+bbdd(), even though aacc() then bbdd() is really a single vtrn instruction. I will revert SkMatrix.cpp before submitting. Just thought you might like to look. Will think more and try to gear up Instruments on ARM. BUG=skia: Review URL: https://codereview.chromium.org/1012573003
*	Start fresh on swizzles	mtklein	2015-03-23
\| \| \| \| \| \| \| \| \|	This removes all the existing Sk4x swizzles and adds badc(), which is both fast on all implementations and currently useful. BUG=skia: Review URL: https://codereview.chromium.org/997353005
*	Replace _mm_cvtps_epi32(x) with _mm_cvttps_epi32(_mm_add_ps(0.5f), x).	mtklein	2015-03-23
\| \| \| \| \| \| \| \| \| \| \| \| \|	We don't have control over which way _mm_cvtps_epi32 rounds. - This makes the SSE SkPMFloat rounding consistent with _neon and _none. - Sk4f::cast<Sk4i>() is closer to (int)float's behavior. (Correct when >=0). Add tests that would fail at head. BUG=skia: Review URL: https://codereview.chromium.org/1029163002
*	Sk2x::invert() and Sk2x::approxInvert()	mtklein	2015-03-20
\| \| \| \| \| \|	BUG=skia: Review URL: https://codereview.chromium.org/1024993002
*	Add divide to Sk2x, use native vdiv and vsqrt on ARM 64.	mtklein	2015-03-20
\| \| \| \| \| \| \| \| \| \|	Tests pass on N7 + N9. BUG=skia: CQ_EXTRA_TRYBOTS=client.skia.compile:Build-Mac10.7-Clang-Arm7-Debug-iOS-Trybot,Build-Ubuntu-GCC-Arm64-Release-Android-Trybot Review URL: https://codereview.chromium.org/1027753003
*	Specialize Sk2d for ARM64	mtklein	2015-03-20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The implementation is nearly identical to Sk2f, with these changes: - float32x2_t -> float64x2_t - vfoo -> vfooq - one extra Newton's method step in sqrt(). Also, generally fix NEON detection to be defined(SK_ARM_HAS_NEON). SK_ARM_HAS_NEON is not being set on ARM64 bots right now (nor does the compiler seem to set __ARM_NEON__), so this CL fixes everything up. BUG=skia: Committed: https://skia.googlesource.com/skia/+/e57b5cab261a243dcbefa74c91c896c28959bf09 CQ_EXTRA_TRYBOTS=client.skia.compile:Build-Mac10.7-Clang-Arm7-Debug-iOS-Trybot,Build-Ubuntu-GCC-Arm64-Release-Android-Trybot Review URL: https://codereview.chromium.org/1020963002
*	Replace SSE optimization of Color32A_D565	henrik.smiding	2015-03-20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Adds an SSE2 version of the Color32A_D565 function, to replace the existing SSE4 version. Also does some minor cleanup. Performance improvement in the following Skia benchmarks. Measured on Atom Silvermont: Xfermode_SrcOver - x3 luma_colorfilter_large - x4.6 luma_colorfilter_small - x2 tablebench - ~15% chart_bw - ~10% Measured on Corei7 Haswell: luma_colorfilter_large running SSE2 - x2 luma_colorfilter_large running SSE4 - x2.3 Also improves performance in WPS Office application and 2D subtest of 0xbenchmark on Android. Signed-off-by: Henrik Smiding <henrik.smiding@intel.com> Review URL: https://codereview.chromium.org/923523002
*	Revert of Specialize Sk2d for ARM64 (patchset #3 id:40001 of ↵	mtklein	2015-03-20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	https://codereview.chromium.org/1020963002/) Reason for revert: https://uberchromegw.corp.google.com/i/client.skia.compile/builders/Build-Mac10.7-Clang-Arm7-Debug-iOS/builds/2441/steps/build%20most/logs/stdio https://uberchromegw.corp.google.com/i/client.skia.compile/builders/Build-Mac10.7-Clang-Arm7-Release-iOS/builds/2424/steps/build%20most/logs/stdio https://uberchromegw.corp.google.com/i/client.skia.compile/builders/Build-Ubuntu-GCC-Arm64-Release-Android/builds/8/steps/build%20most/logs/stdio Original issue's description: > Specialize Sk2d for ARM64 > > The implementation is nearly identical to Sk2f, with these changes: > - float32x2_t -> float64x2_t > - vfoo -> vfooq > - one extra Newton's method step in sqrt(). > > Also, generally fix NEON detection to be defined(SK_ARM_HAS_NEON). > SK_ARM_HAS_NEON is not being set on ARM64 bots right now (nor does the compiler > seem to set __ARM_NEON__), so this CL fixes everything up. > > BUG=skia: > > Committed: https://skia.googlesource.com/skia/+/e57b5cab261a243dcbefa74c91c896c28959bf09 TBR=msarett@google.com,reed@google.com,mtklein@chromium.org NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true BUG=skia: Review URL: https://codereview.chromium.org/1028523003
*	Specialize Sk2d for ARM64	mtklein	2015-03-20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The implementation is nearly identical to Sk2f, with these changes: - float32x2_t -> float64x2_t - vfoo -> vfooq - one extra Newton's method step in sqrt(). Also, generally fix NEON detection to be defined(SK_ARM_HAS_NEON). SK_ARM_HAS_NEON is not being set on ARM64 bots right now (nor does the compiler seem to set __ARM_NEON__), so this CL fixes everything up. BUG=skia: Review URL: https://codereview.chromium.org/1020963002
*	Sk2x for NEON	mtklein	2015-03-19
\| \| \| \| \| \| \| \| \| \| \|	Also decreases the precision of Sk4f::rsqrt() for speed, keeping Sk4f::sqrt() the same: instead of doing two estimation steps in rsqrt(), do one there and one more in sqrt(). Tests pass on my Nexus 7. float64x2_t is still a TODO for when I get a hold of a Nexus 9. BUG=skia: Review URL: https://codereview.chromium.org/1018423003
*	Sk2x	mtklein	2015-03-19
\| \| \| \| \| \| \| \|	This adds an API, an SSE impl, a portable impl, and some tests for Sk2f/Sk2d/Sk2s. BUG=skia: Review URL: https://codereview.chromium.org/1025463002