| Commit message (Collapse) | Author | Age |
... | |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
https://codereview.chromium.org/1104183004/)
Reason for revert:
duh
Original issue's description:
> De-proc Color32
>
> Also strips SK_SUPPORT_LEGACY_COLOR32_MATH,
> which is no longer needed.
>
> Seems handy to have SkTypes include the relevant intrinsics when
> we know we've got them, but I'm not married to it.
>
> Locally this looks like a pointlessly small perf win, but I'm mostly
> keen to get all the code together.
>
> BUG=skia:
>
> Committed: https://skia.googlesource.com/skia/+/376e9bc206b69d9190f38dfebb132a8769bbd72b
>
> Committed: https://skia.googlesource.com/skia/+/d65dc0cedd5b50dd407b6ff8fdc39123f11511cc
TBR=reed@google.com,mtklein@chromium.org
NOPRESUBMIT=true
NOTREECHECKS=true
NOTRY=true
BUG=skia:
Review URL: https://codereview.chromium.org/1102363006
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Also strips SK_SUPPORT_LEGACY_COLOR32_MATH,
which is no longer needed.
Seems handy to have SkTypes include the relevant intrinsics when
we know we've got them, but I'm not married to it.
Locally this looks like a pointlessly small perf win, but I'm mostly
keen to get all the code together.
BUG=skia:
Committed: https://skia.googlesource.com/skia/+/376e9bc206b69d9190f38dfebb132a8769bbd72b
Review URL: https://codereview.chromium.org/1104183004
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is a logical no-op. Everything was using the equivalent of rsqrt1() before, and is now after.
BUG=skia:
Committed: https://skia.googlesource.com/skia/+/9de16283fdc8cc0d31a84f503578d0ecea4e8297
CQ_EXTRA_TRYBOTS=client.skia.compile:Build-Ubuntu-GCC-Arm64-Debug-Android-Trybot
Review URL: https://codereview.chromium.org/1109913002
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
https://codereview.chromium.org/1104183004/)
Reason for revert:
MIPS
Original issue's description:
> De-proc Color32
>
> Also strips SK_SUPPORT_LEGACY_COLOR32_MATH,
> which is no longer needed.
>
> Seems handy to have SkTypes include the relevant intrinsics when
> we know we've got them, but I'm not married to it.
>
> Locally this looks like a pointlessly small perf win, but I'm mostly
> keen to get all the code together.
>
> BUG=skia:
>
> Committed: https://skia.googlesource.com/skia/+/376e9bc206b69d9190f38dfebb132a8769bbd72b
TBR=reed@google.com,mtklein@chromium.org
NOPRESUBMIT=true
NOTREECHECKS=true
NOTRY=true
BUG=skia:
Review URL: https://codereview.chromium.org/1108163002
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Also strips SK_SUPPORT_LEGACY_COLOR32_MATH,
which is no longer needed.
Seems handy to have SkTypes include the relevant intrinsics when
we know we've got them, but I'm not married to it.
Locally this looks like a pointlessly small perf win, but I'm mostly
keen to get all the code together.
BUG=skia:
Review URL: https://codereview.chromium.org/1104183004
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
on ARM (patchset #2 id:20001 of https://codereview.chromium.org/1109913002/)
Reason for revert:
arm64 typos
Original issue's description:
> Split rsqrt into rsqrt{0,1,2}, with increasing cost and precision on ARM
>
> This is a logical no-op. Everything was using the equivalent of rsqrt1() before, and is now after.
>
> BUG=skia:
>
> Committed: https://skia.googlesource.com/skia/+/9de16283fdc8cc0d31a84f503578d0ecea4e8297
TBR=reed@google.com,mtklein@chromium.org
NOPRESUBMIT=true
NOTREECHECKS=true
NOTRY=true
BUG=skia:
Review URL: https://codereview.chromium.org/1105233003
|
|
|
|
|
|
|
|
| |
This is a logical no-op. Everything was using the equivalent of rsqrt1() before, and is now after.
BUG=skia:
Review URL: https://codereview.chromium.org/1109913002
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
patch from issue 1072303005 at patchset 40001 (http://crrev.com/1072303005#ps40001)
This looks quite launchable. radial_gradient3, min of 100 samples:
N5: 985µs -> 946µs
MBP: 395µs -> 279µs
On my MBP, most of the meat looks like it's now in reading the cache and writing to dst one color at a time. Is that something we could do in float math rather than with a lookup table?
BUG=skia:
CQ_EXTRA_TRYBOTS=client.skia.compile:Build-Mac10.8-Clang-Arm7-Debug-Android-Trybot,Build-Ubuntu-GCC-Arm7-Release-Android_NoNeon-Trybot
Committed: https://skia.googlesource.com/skia/+/abf6c5cf95e921fae59efb487480e5b5081cf0ec
Review URL: https://codereview.chromium.org/1109643002
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
id:120001 of https://codereview.chromium.org/1109643002/)
Reason for revert:
compile failures.
Original issue's description:
> Mike's radial gradient CL with better float -> int.
>
> patch from issue 1072303005 at patchset 40001 (http://crrev.com/1072303005#ps40001)
>
> This looks quite launchable. radial_gradient3, min of 100 samples:
> N5: 985µs -> 946µs
> MBP: 395µs -> 279µs
>
> On my MBP, most of the meat looks like it's now in reading the cache and writing to dst one color at a time. Is that something we could do in float math rather than with a lookup table?
>
> BUG=skia:
>
> CQ_EXTRA_TRYBOTS=client.skia.android:Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Debug-Trybot,Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Debug-Trybot
>
> Committed: https://skia.googlesource.com/skia/+/abf6c5cf95e921fae59efb487480e5b5081cf0ec
TBR=reed@google.com,robertphillips@google.com,mtklein@chromium.org
NOPRESUBMIT=true
NOTREECHECKS=true
NOTRY=true
BUG=skia:
Review URL: https://codereview.chromium.org/1109883003
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
patch from issue 1072303005 at patchset 40001 (http://crrev.com/1072303005#ps40001)
This looks quite launchable. radial_gradient3, min of 100 samples:
N5: 985µs -> 946µs
MBP: 395µs -> 279µs
On my MBP, most of the meat looks like it's now in reading the cache and writing to dst one color at a time. Is that something we could do in float math rather than with a lookup table?
BUG=skia:
CQ_EXTRA_TRYBOTS=client.skia.android:Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Debug-Trybot,Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Debug-Trybot
Review URL: https://codereview.chromium.org/1109643002
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
{virtual,override}.
The Google style guide states that only one of {virtual,override,final}
should be used for each declaration, since override implies virtual and
final implies both virtual and override.
The entries were found using the following command line:
$ find src/ -iname "*.h" -o -iname "*.cpp" | xargs pcregrep -M
"[^\n/]+virtual\ [^;{]+\ [a-zA-Z0-9_]+\([^;{]+\ override[ \n]*[;{]"
The regex was a courtesy of nick@chromium.org
BUG=None
R=mtklein@google.com
NOPRESUBMIT=true
Review URL: https://codereview.chromium.org/1086143003
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
https://codereview.chromium.org/1098913002/)
Reason for revert:
Xfermode_SrcOver not looking encouraging. Up to 50% regressions.
https://perf.skia.org/#3242
Original issue's description:
> Convert Color32 code to perfect blend.
>
> Before we commit to blend_256_round_alt, let's make sure blend_perfect is
> really slower in practice (i.e. regresses on perf.skia.org).
>
> blend_perfect is really the most desirable algorithm if we can afford it. Not
> only is it correct, but it's easy to think about and break into correct pieces:
> for instance, its div255() doesn't require any coordination with the multiply.
>
> This looks like a 30% hit according to microbenches. That said, microbenches
> said my previous change would be a 20-25% perf improvement, but it didn't end
> up showing a significant effect at a high level.
>
> As for correctness, I see a bunch of off-by-1 compared to blend_256_round_alt
> (exactly what we'd expect), and one off-by-3 in a GM that looks like it has a
> bunch of overdraw.
>
> BUG=skia:
>
> Committed: https://skia.googlesource.com/skia/+/61221e7f87a99765b0e034020e06bb018e2a08c2
TBR=reed@google.com,fmalita@chromium.org,mtklein@chromium.org
NOPRESUBMIT=true
NOTREECHECKS=true
NOTRY=true
BUG=skia:
Review URL: https://codereview.chromium.org/1083923006
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Before we commit to blend_256_round_alt, let's make sure blend_perfect is
really slower in practice (i.e. regresses on perf.skia.org).
blend_perfect is really the most desirable algorithm if we can afford it. Not
only is it correct, but it's easy to think about and break into correct pieces:
for instance, its div255() doesn't require any coordination with the multiply.
This looks like a 30% hit according to microbenches. That said, microbenches
said my previous change would be a 20-25% perf improvement, but it didn't end
up showing a significant effect at a high level.
As for correctness, I see a bunch of off-by-1 compared to blend_256_round_alt
(exactly what we'd expect), and one off-by-3 in a GM that looks like it has a
bunch of overdraw.
BUG=skia:
Review URL: https://codereview.chromium.org/1098913002
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This algorithm changes the blend math, guarded by SK_LEGACY_COLOR32_MATH. The new math is more correct: it's never off by more than 1, and correct in all the interesting 0x00 and 0xFF edge cases, where the old math was never off by more than 2, and not always correct on the edges.
If you look at tests/BlendTest.cpp, the old code was using the `blend_256_plus1_trunc` algorithm, while the new code uses `blend_256_round_alt`. Neither uses `blend_perfect`, which is about ~35% slower than `blend_256_round_alt`.
This will require an unfathomable number of rebaselines, first to Skia, then to Blink when I remove the guard.
I plan to follow up with some integer SIMD abstractions that can unify these two implementations into a single algorithm. This was originally what I was working on here, but the correctness gains seem to be quite compelling. The only places these two algorithms really differ greatly now is the kernel function, and even there they can really both be expressed abstractly as:
- multiply 8-bits and 8-bits producing 16-bits
- add 16-bits to 16-bits, returning the top 8 bits.
All the constants are the same, except SSE is a little faster to keep 8 16-bit inverse alphas, NEON's a little faster to keep 8 8-bit inverse alphas. I may need to take this small speed win back to unify the two.
We should expect a ~25% speedup on Intel (mostly from unrolling to 8 pixels) and a ~20% speedup on ARM (mostly from using vaddhn to add `color`, round, and narrow back down to 8-bit all into one instruction.
(I am probably missing several more related bugs here.)
BUG=skia:3738,skia:420,chromium:111470
Review URL: https://codereview.chromium.org/1092433002
|
|
|
|
|
|
|
|
|
|
| |
These will underly the SkPMFloat-like class for uint16_t components.
Sk4h will back a single-pixel version, and Sk8h any larger number than that.
BUG=skia:
Review URL: https://codereview.chromium.org/1088883005
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
As used today, SkNi is used in bool-y contexts. This keeps that, but under a
new name, SkNb. This makes room for a new SkNi that's focused on integer-y
things like loads, stores, arithmetic, etc.
The main reason to split these is that we want different specializations for
each use case: for bools, it's important for us to specialize 32- and 64-bit to
support efficient float- and double- comparisons, but for integer work we're
more likely to be looking at 8- and 16- bit lanes. Keeping these use cases
siloed helps me manage the compexity of the backend NEON and SSE code.
BUG=skia:
Review URL: https://codereview.chromium.org/1083123002
|
|
|
|
|
|
|
|
|
|
|
| |
According to bench/MemsetBench.cpp, I've got them somewhere between 10% slower
and a percent or two faster than the old assembly.
BUG=skia:
CQ_EXTRA_TRYBOTS=client.skia.android:Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Debug-Trybot
Review URL: https://codereview.chromium.org/1075003002
|
|
|
|
|
|
|
|
|
|
|
| |
Step 1 of a zillion in the quest for NEON on iOS,
and step 1 of a different zillion in the Great Assembly Purge.
ios, arm, arm64, arm_v7, arm_v7_neon all build.
BUG=skia:
Review URL: https://codereview.chromium.org/1072063002
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
#floats
BUG=skia:
BUG=skia:3592
Committed: https://skia.googlesource.com/skia/+/6b5dab889579f1cc9e1b5278f4ecdc4c63fe78c9
CQ_EXTRA_TRYBOTS=client.skia.compile:Build-Ubuntu-GCC-Arm64-Debug-Android-Trybot
Review URL: https://codereview.chromium.org/1061603002
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
id:40001 of https://codereview.chromium.org/1061603002/)
Reason for revert:
missed some neon code
Original issue's description:
> Code's more readable when SkPMFloat is an Sk4f.
> #floats
>
> BUG=skia:
> BUG=skia:3592
>
> Committed: https://skia.googlesource.com/skia/+/6b5dab889579f1cc9e1b5278f4ecdc4c63fe78c9
TBR=reed@google.com,mtklein@chromium.org
NOPRESUBMIT=true
NOTREECHECKS=true
NOTRY=true
BUG=skia:
Review URL: https://codereview.chromium.org/1056143004
|
|
|
|
|
|
|
|
|
| |
#floats
BUG=skia:
BUG=skia:3592
Review URL: https://codereview.chromium.org/1061603002
|
|
|
|
|
|
| |
BUG=skia:
Review URL: https://codereview.chromium.org/1055123002
|
|
|
|
|
|
|
|
|
| |
#floats
BUG=skia:
BUG=skia:3592
Review URL: https://codereview.chromium.org/1059743002
|
|
|
|
|
|
|
|
|
|
|
| |
I don't see any color-order handling logic in the 32-bit code.
BUG=skia:1843
CQ_EXCLUDE_TRYBOTS=client.skia.compile:Build-Win-MSVC-x86-Debug-Trybot,Build-Win-MSVC-x86_64-Debug-Trybot
R=mtklein@google.com
Review URL: https://codereview.chromium.org/1051683003
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Each of these conversion functions now only asserts is output is valid.
For SkPMColor -> SkPMFloat, we assert isValid().
For SkPMFloat -> SkPMColor, we SkPMColorAssert.
#floats
BUG=skia:
BUG=skia:3592
Review URL: https://codereview.chromium.org/1055093002
|
|
|
|
|
|
|
|
|
| |
#floats
BUG=skia:
BUG=skia:3592
Review URL: https://codereview.chromium.org/1047823002
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The primary feature this delivers is SkNf and SkNd for arbitrary power-of-two N. Non-specialized types or types larger than 128 bits should now Just Work (and we can drop in a specialization to make them faster). Sk4s is now just a typedef for SkNf<4, SkScalar>; Sk4d is SkNf<4, double>, Sk2f SkNf<2, float>, etc.
This also makes implementing new specializations easier and more encapsulated. We're now using template specialization, which means the specialized versions don't have to leak out so much from SkNx_sse.h and SkNx_neon.h.
This design leaves us room to grow up, e.g to SkNf<8, SkScalar> == Sk8s, and to grown down too, to things like SkNi<8, uint16_t> == Sk8h.
To simplify things, I've stripped away most APIs (swizzles, casts, reinterpret_casts) that no one's using yet. I will happily add them back if they seem useful.
You shouldn't feel bad about using any of the typedef Sk4s, Sk4f, Sk4d, Sk2s, Sk2f, Sk2d, Sk4i, etc. Here's how you should feel:
- Sk4f, Sk4s, Sk2d: feel awesome
- Sk2f, Sk2s, Sk4d: feel pretty good
No public API changes.
TBR=reed@google.com
BUG=skia:3592
Review URL: https://codereview.chromium.org/1048593002
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add and test trunc(), which is what get() used to be before rounding.
Using trunc() is a ~40% speedup on our linear gradient bench.
#neon #floats
BUG=skia:3592
#n5
#n9
CQ_INCLUDE_TRYBOTS=client.skia.android:Test-Android-Nexus5-Adreno330-Arm7-Debug-Trybot;client.skia.android:Test-Android-Nexus9-TegraK1-Arm64-Release-Trybot
Review URL: https://codereview.chromium.org/1032243002
|
|
|
|
|
|
|
|
|
| |
NOPRESUBMIT=true
BUG=skia:
DOCS_PREVIEW= https://skia.org/?cl=1037793002
Review URL: https://codereview.chromium.org/1037793002
|
|
|
|
|
|
|
|
|
|
|
| |
There is no reason to require the 4 SkPMFloats (registers) to be adjacent.
The only potential win in loads and stores comes from the SkPMColors being adjacent.
Makes no difference to existing bench.
BUG=skia:
Review URL: https://codereview.chromium.org/1035583002
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
SkMatrix::mapPts() using aacc/bbdd was always worse than using badc():
- On Intel, it was faster than exisiting swizzle, but badc() is 10% faster still (one pshufd instead of two).
- On ARM, existing swizzle < badc() < aacc()+bbdd(), even though aacc() then bbdd() is really a single vtrn instruction.
I will revert SkMatrix.cpp before submitting. Just thought you might like to look.
Will think more and try to gear up Instruments on ARM.
BUG=skia:
Review URL: https://codereview.chromium.org/1012573003
|
|
|
|
|
|
|
|
|
| |
This removes all the existing Sk4x swizzles and adds badc(), which is
both fast on all implementations and currently useful.
BUG=skia:
Review URL: https://codereview.chromium.org/997353005
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We don't have control over which way _mm_cvtps_epi32 rounds.
- This makes the SSE SkPMFloat rounding consistent with _neon and _none.
- Sk4f::cast<Sk4i>() is closer to (int)float's behavior. (Correct when >=0).
Add tests that would fail at head.
BUG=skia:
Review URL: https://codereview.chromium.org/1029163002
|
|
|
|
|
|
| |
BUG=skia:
Review URL: https://codereview.chromium.org/1024993002
|
|
|
|
|
|
|
|
|
|
| |
Tests pass on N7 + N9.
BUG=skia:
CQ_EXTRA_TRYBOTS=client.skia.compile:Build-Mac10.7-Clang-Arm7-Debug-iOS-Trybot,Build-Ubuntu-GCC-Arm64-Release-Android-Trybot
Review URL: https://codereview.chromium.org/1027753003
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The implementation is nearly identical to Sk2f, with these changes:
- float32x2_t -> float64x2_t
- vfoo -> vfooq
- one extra Newton's method step in sqrt().
Also, generally fix NEON detection to be defined(SK_ARM_HAS_NEON).
SK_ARM_HAS_NEON is not being set on ARM64 bots right now (nor does the compiler
seem to set __ARM_NEON__), so this CL fixes everything up.
BUG=skia:
Committed: https://skia.googlesource.com/skia/+/e57b5cab261a243dcbefa74c91c896c28959bf09
CQ_EXTRA_TRYBOTS=client.skia.compile:Build-Mac10.7-Clang-Arm7-Debug-iOS-Trybot,Build-Ubuntu-GCC-Arm64-Release-Android-Trybot
Review URL: https://codereview.chromium.org/1020963002
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Adds an SSE2 version of the Color32A_D565 function, to replace
the existing SSE4 version. Also does some minor cleanup.
Performance improvement in the following Skia benchmarks.
Measured on Atom Silvermont:
Xfermode_SrcOver - x3
luma_colorfilter_large - x4.6
luma_colorfilter_small - x2
tablebench - ~15%
chart_bw - ~10%
Measured on Corei7 Haswell:
luma_colorfilter_large running SSE2 - x2
luma_colorfilter_large running SSE4 - x2.3
Also improves performance in WPS Office application and 2D subtest of 0xbenchmark on Android.
Signed-off-by: Henrik Smiding <henrik.smiding@intel.com>
Review URL: https://codereview.chromium.org/923523002
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
https://codereview.chromium.org/1020963002/)
Reason for revert:
https://uberchromegw.corp.google.com/i/client.skia.compile/builders/Build-Mac10.7-Clang-Arm7-Debug-iOS/builds/2441/steps/build%20most/logs/stdio
https://uberchromegw.corp.google.com/i/client.skia.compile/builders/Build-Mac10.7-Clang-Arm7-Release-iOS/builds/2424/steps/build%20most/logs/stdio
https://uberchromegw.corp.google.com/i/client.skia.compile/builders/Build-Ubuntu-GCC-Arm64-Release-Android/builds/8/steps/build%20most/logs/stdio
Original issue's description:
> Specialize Sk2d for ARM64
>
> The implementation is nearly identical to Sk2f, with these changes:
> - float32x2_t -> float64x2_t
> - vfoo -> vfooq
> - one extra Newton's method step in sqrt().
>
> Also, generally fix NEON detection to be defined(SK_ARM_HAS_NEON).
> SK_ARM_HAS_NEON is not being set on ARM64 bots right now (nor does the compiler
> seem to set __ARM_NEON__), so this CL fixes everything up.
>
> BUG=skia:
>
> Committed: https://skia.googlesource.com/skia/+/e57b5cab261a243dcbefa74c91c896c28959bf09
TBR=msarett@google.com,reed@google.com,mtklein@chromium.org
NOPRESUBMIT=true
NOTREECHECKS=true
NOTRY=true
BUG=skia:
Review URL: https://codereview.chromium.org/1028523003
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The implementation is nearly identical to Sk2f, with these changes:
- float32x2_t -> float64x2_t
- vfoo -> vfooq
- one extra Newton's method step in sqrt().
Also, generally fix NEON detection to be defined(SK_ARM_HAS_NEON).
SK_ARM_HAS_NEON is not being set on ARM64 bots right now (nor does the compiler
seem to set __ARM_NEON__), so this CL fixes everything up.
BUG=skia:
Review URL: https://codereview.chromium.org/1020963002
|
|
|
|
|
|
|
|
|
|
|
| |
Also decreases the precision of Sk4f::rsqrt() for speed, keeping Sk4f::sqrt() the same:
instead of doing two estimation steps in rsqrt(), do one there and one more in sqrt().
Tests pass on my Nexus 7. float64x2_t is still a TODO for when I get a hold of a Nexus 9.
BUG=skia:
Review URL: https://codereview.chromium.org/1018423003
|
|
|
|
|
|
|
|
| |
This adds an API, an SSE impl, a portable impl, and some tests for Sk2f/Sk2d/Sk2s.
BUG=skia:
Review URL: https://codereview.chromium.org/1025463002
|
|
|
|
|
|
|
|
|
|
| |
No real changes here, just moving files around:
- move impl files into src/opts
- rename _portable _none
BUG=skia:
Review URL: https://codereview.chromium.org/1021713004
|
|
|
|
|
|
| |
BUG=skia:
Review URL: https://codereview.chromium.org/1021583002
|
|
|
|
| |
Review URL: https://codereview.chromium.org/1020563002
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A store/load pair like this is a redundant no-op:
store simd_register_a, memory_address
load memory_address, simd_register_a
Everyone seems to be good at removing those when using SSE, but GCC and Clang
are pretty terrible at this for NEON. We end up issuing both redundant
commands, usually to and from the stack. That's slow. Let's not do that.
This CL unions in the native SIMD register type into SkPMFloat, so that we can
assign to and from it directly, which is generating a lot better NEON code. On
my Nexus 5, the benchmarks improve from 36ns to 23ns.
SSE is just as fast either way, but I paralleled the NEON code for consistency.
It's a little terser. And because it needed the platform headers anyway, I
moved all includes into SkPMFloat.h, again only for consistency.
I'd union in Sk4f too to make its conversion methods a little clearer,
but MSVC won't let me (it has a copy constructor... they're apparently not up
to speed with C++11 unrestricted unions).
BUG=skia:
Review URL: https://codereview.chromium.org/1015083004
|
|
|
|
|
|
|
|
|
| |
clone (+rebase) of https://codereview.chromium.org/1009183002/
BUG=skia:
TBR=scroggo@google.com
Review URL: https://codereview.chromium.org/1014533004
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Clamping 4 at a time is now about 15% faster than 1 at a time with SSSE3.
Clamping 4 at a time is now about 20% faster with SSE2,
and this applies to non-clamping too (we still just clamp there).
In all cases, 4 at a time is never worse than 1 at a time,
and not clamping is never slower than clamping.
Here's all the bench results, with the numbers for portable code as a fun point
of reference:
SSSE3:
maxrss loops min median mean max stddev samples config bench
10M 2291 4.66ns 4.66ns 4.66ns 4.68ns 0% ▆█▁▁▁▇▁▇▁▃ nonrendering SkPMFloat_get_1x
10M 2040 5.29ns 5.3ns 5.3ns 5.32ns 0% ▃▆▃▃▁▁▆▃▃█ nonrendering SkPMFloat_clamp_1x
10M 7175 4.62ns 4.62ns 4.62ns 4.63ns 0% ▁▄▃████▃▄▇ nonrendering SkPMFloat_get_4x
10M 5801 4.89ns 4.89ns 4.89ns 4.91ns 0% █▂▄▃▁▃▄█▁▁ nonrendering SkPMFloat_clamp_4x
SSE2:
maxrss loops min median mean max stddev samples config bench
10M 1601 6.02ns 6.05ns 6.04ns 6.08ns 0% █▅▄▅▄▂▁▂▂▂ nonrendering SkPMFloat_get_1x
10M 2918 6.05ns 6.06ns 6.05ns 6.06ns 0% ▂▇▁▇▇▁▇█▇▂ nonrendering SkPMFloat_clamp_1x
10M 3569 5.43ns 5.45ns 5.44ns 5.45ns 0% ▄█▂██▇▁▁▇▇ nonrendering SkPMFloat_get_4x
10M 4168 5.43ns 5.43ns 5.43ns 5.44ns 0% █▄▇▁▇▄▁▁▁▁ nonrendering SkPMFloat_clamp_4x
Portable:
maxrss loops min median mean max stddev samples config bench
10M 500 27.8ns 28.1ns 28ns 28.2ns 0% ▃█▆▃▇▃▆▁▇▂ nonrendering SkPMFloat_get_1x
10M 770 40.1ns 40.2ns 40.2ns 40.3ns 0% ▅▁▃▂▆▄█▂▅▂ nonrendering SkPMFloat_clamp_1x
10M 1269 28.4ns 28.8ns 29.1ns 32.7ns 4% ▂▂▂█▂▁▁▂▁▁ nonrendering SkPMFloat_get_4x
10M 1439 40.2ns 40.4ns 40.4ns 40.5ns 0% ▆▆▆█▁▆▅█▅▆ nonrendering SkPMFloat_clamp_4x
SkPMFloat_neon.h is still one big TODO as far as 4-at-a-time APIs go.
BUG=skia:
Review URL: https://codereview.chromium.org/982123002
|
|
|
|
|
|
|
|
|
|
|
| |
Instead of set(SkPMColor), add a constructor SkPMFloat(SkPMColor).
Replace setA(), setR(), etc. with a 4 float constructor.
And, promise to stick to SkPMColor order.
BUG=skia:
Review URL: https://codereview.chromium.org/977773002
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
With SSSE3, we can use the Swiss Army Knife byte shuffler pshufb,
a.k.a. _mm_shuffle_epi8(), to jump directly between 32 and 128 bits.
In microbench isolation, this looks like an additional 10-15% speedup:
SkPMFloat_get: 2.35ns -> 1.98ns
SkPMFloat_clamp: 2.35ns -> 2.18ns
Before this CL, get() and clamp() were identical code. The _get benchmark improves because both set() and get() become faster; the _clamp benchmark shows the improvement from set() getting faster with clamp() staying the same.
BUG=skia:
Review URL: https://codereview.chromium.org/976493002
|
|
|
|
|
|
|
|
|
|
|
| |
SSE rounds for free (that was a happy accident: they also have a truncating version).
NEON does not, nor obviously the portable code, so they add 0.5 before truncating.
NOPRESUBMIT=true
BUG=skia:
Review URL: https://codereview.chromium.org/974643002
|