| Commit message (Collapse) | Author | Age |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Improves decode performance for RGBA encoded PNGs.
Swizzle Time on Nexus 9 (with clang):
SwapPremul 0.44x
Premul 0.44x
Decode Time On Nexus 9 (with clang):
ZeroInit Decodes 0.85x
Regular Decodes 0.86x
Swizzle Time on Nexus 6P (with clang)
SwapPremul 0.14x
Premul 0.14x
Decode Time On Nexus 6P (with clang):
ZeroInit Decodes 0.93x
Regular Decodes 0.95x
Notes:
ZeroInit means memory is zero initialized, and we do not write to
memory for large sections of zero pixels (memory use opt for Android).
A profile on Nexus 9 shows that the premultiplication step of PNG
decoding is now ~5% of decode time (down from ~20%).
BUG=skia:4767
GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1577703006
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
Review URL: https://codereview.chromium.org/1577703006
|
|
|
|
|
|
|
|
|
|
|
| |
This gets rid of those unsightly lambdas,
and makes the file more consistent both with itself and with Sk4px.
BUG=skia:4765
GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1569373002
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
Review URL: https://codereview.chromium.org/1569373002
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It seems that MSVC + __vectorcall don't play well together,
so back ourselves out into a situation where we don't need it.
- Inline transfermode functions. This removes the need for SK_VECTORCALL.
- Remove 565 destination specializations.
Blending into 565 is not speed-critical enough to merit the code bloat.
- Removing 565 specializations means a bunch of Sk4px code is now dead.
8888 xfermodes generally speed up a bit from inlining, smoothly ranging from no change down to 0.65x for the fastest functions like Plus or Modulate.
565 xfermodes generally slow down because we're doing 565 -> 8888 and 8888->565 conversion serially[1] and using the stack, smoothly ranging from no change up to 2x slower for the fastest functions like Plus and Modulate.
[1] the 565->8888 conversion is actually being autovectorized
BUG=skia:4765,skia:4776
GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1565223002
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
No public API changes.
TBR=reed@google.com
Review URL: https://codereview.chromium.org/1565223002
|
|
|
|
|
|
|
|
|
|
| |
This is dead after removing shadeSpan16().
BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1553233004
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
Review URL: https://codereview.chromium.org/1553233004
|
|
|
|
|
|
|
|
| |
BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1556003003
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
Review URL: https://codereview.chromium.org/1556003003
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
https://codereview.chromium.org/1550893002/ )
Reason for revert:
Can't use on XP. :(
Original issue's description:
> Try using std::call_once
>
> Now that we've got std library support, perhaps we should start using it.
> This CL acts as a little canary, and may help fix the linked bug.
>
> I'm not really sure what's going on in the linked bug, but using
> std::call_once over homegrown atomics has to be the right answer...
>
> BUG=chromium:418041
> GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1550893002
> CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
>
> Going to land this ahead of review while the tree is quiet to see how it rolls.
> TBR=herb@google.com
>
> Committed: https://skia.googlesource.com/skia/+/8895b72f789e5dc8bb99cb9727875439005fc919
TBR=herb@google.com,mtklein@chromium.org
NOPRESUBMIT=true
NOTREECHECKS=true
NOTRY=true
BUG=chromium:418041
Review URL: https://codereview.chromium.org/1552333003
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Now that we've got std library support, perhaps we should start using it.
This CL acts as a little canary, and may help fix the linked bug.
I'm not really sure what's going on in the linked bug, but using
std::call_once over homegrown atomics has to be the right answer...
BUG=chromium:418041
GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1550893002
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
Going to land this ahead of review while the tree is quiet to see how it rolls.
TBR=herb@google.com
Review URL: https://codereview.chromium.org/1550893002
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This piece of code is already 64-bit only, so we don't need to think about ARMv7.
Hopefully this shuts up the warnings. They were harmless.
If this doesn't work (it's relatively new modifier, so maybe some compilers barf), an alternative is to cast count to a size_t.
BUG=skia:4686
GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1527123003
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
Review URL: https://codereview.chromium.org/1527123003
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is mainly warmup for an AVX2 version.
The machine I'm typing this on just doesn't support AVX2.
This strategy should translate easily down to SSSE3 and SSE2.
Xfermode_SrcOver: 2.73ms -> 2.62ms (0.96x) (That's Color32.)
Xfermode_SrcOver_aa: 3.48ms -> 3.09ms (0.89x) (That's BlitMask_D32_A8.)
AA text blits (text_16_AA_{88,FF,WT,BK}) show speedups in the range of 5 to 20%.
Unlike previous versions of this code, all the div255() are exactly (x+127)/255.
This won't fix any major bugs, but it does correct our bias in the middle.
There will be many diffs, all minor.
I've punted for now on pmaddubsw for lerping. I do intend to try that,
but I want this (relatively simple) code as my basis for comparison.
BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1526883004
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
Review URL: https://codereview.chromium.org/1526883004
|
|
|
|
|
|
|
|
|
|
| |
Given the autovectorization we've seen, I wouldn't expect big speedups
from this, but it does give us a point of control over what's going on.
BUG=skia:
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
Review URL: https://codereview.chromium.org/1526923003
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- one base case and one N=1 case instead of two each (or three with doubles)
- use SkNx_cast instead of FromBytes/toBytes
- 4-at-a-time Sk4f::ToBytes becomes a special standalone Sk4f_ToBytes
If I did everything right, this'll be perf- and pixel- neutral.
https://gold.skia.org/search2?issue=1526523003&unt=true&query=source_type%3Dgm&master=false
BUG=skia:
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
Review URL: https://codereview.chromium.org/1526523003
|
|
|
|
|
|
|
| |
BUG=skia:
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
Review URL: https://codereview.chromium.org/1521623003
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We were doing (x+127)/255 = ((x+128) + (x+128)>>8)>>8 in three instructions:
1) x += 128
2) shift x right 8 bits
3) add x and x>>8 together, then shift right more 8 bits
Now do it as two instructions:
1) shift (x+128) right 8 bits
2) add x and (x+128)>>8 and 128 all together, then shift right 8 more bits
On ARM this will be a 5-10% speedup for SrcATop, DstATop, Xor, Multiply, Difference, HardLight, Darken, and Lighten xfermodes. When we have a mask (e.g. text), *all* xfermodes except Plus will get a similar boost.
This should mean now that (a*b).div255() is the same speed as a.approxMulDiv255(b) on both x86 and ARM, and of course it's perfect instead of approximate. So we should eliminate approxMulDiv255(), but I'll leave it to another CL, as it'll need Blink rebaselines.
This CL should not change GMs or Blink.
https://gold.skia.org/search2?issue=1502843002&unt=true&query=source_type%3Dgm&master=false
BUG=skia:
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot;client.skia.android:Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Debug-Trybot,Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot
Review URL: https://codereview.chromium.org/1502843002
|
|
|
|
|
|
|
|
|
|
|
| |
Also remove the SK_SUPPORT_LEGACY_LINEAR_GRADIENT_TABLE guard since it is no
longer used in Chromium.
BUG=chromium:563492
R=reed@google.com,mtklein@google.com
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
Review URL: https://codereview.chromium.org/1489233005
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is a big speedup for float -> byte. E.g. gradient_linear_clamp_3color:
x86-64 147µs -> 103µs (Broadwell MBP)
arm64 2.03ms -> 648µs (Galaxy S6)
armv7 1.12ms -> 489µs (Galaxy S6, same device!)
BUG=skia:
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot;client.skia.android:Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Debug-Trybot
Review URL: https://codereview.chromium.org/1483953002
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
SkNx_cast() can cast between any of our vector types,
provided they have the same number of elements.
Any types should work with the default implementation,
and we can drop in specializations as needed, like the
SSE and NEON Sk4f -> Sk4i I included here as an example.
To make this work, I made some internal name changes:
SkNi<N,T> -> SkNx<N, T>
SkNf<N> -> SkNx<N, float>
User aliases (Sk4f, Sk16b, etc.) stay the same.
We can land this first (it's PS1) if that makes things easier.
BUG=skia:
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
Review URL: https://codereview.chromium.org/1464623002
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Generally this was a performance win, even on devices without AVX due
to unrolling, but on ARM+NEON it looks like that unrolling hurt a bit.
while (...) { blend a pixel }
~~~>
while (...) { blend two pixels }
if (n % 2) { blend last pixel }
BUG=chromium:555278
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
Review URL: https://codereview.chromium.org/1465483002
|
|
|
|
|
|
|
|
|
| |
SkPx has triggered a bunch of small (2-9%) regressions on NEON devices.
BUG=skia:
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
Review URL: https://codereview.chromium.org/1462783002
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
_mm_mulhi_epu16 makes the (...*257)>>16 part simple.
This seems to speed up every transfermode that uses div255(),
in the 7-25% range.
It even appears to obviate the need for approxMulDiv255() on SSE.
I'm not sure about NEON yet, so I'll keep approxMulDiv255() for now.
Should be no pixels change:
https://gold.skia.org/search2?issue=1452903004&unt=true&query=source_type%3Dgm&master=false
BUG=skia:
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
Review URL: https://codereview.chromium.org/1452903004
|
|
|
|
|
|
|
|
|
|
|
|
| |
- extract alpha from a pixel: 5 1-cycle ops to 4 1-cycle ops
- load alphas: drop 4 unnecessary ops
Should be no pixel diffs.
BUG=skia:
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
Review URL: https://codereview.chromium.org/1447273004
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Xfermode_ColorDodge_aa 10.3ms -> 7.85ms 0.76x
Xfermode_SoftLight_aa 13.8ms -> 10.2ms 0.74x
Xfermode_ColorBurn_aa 10.7ms -> 7.82ms 0.73x
Xfermode_SoftLight 33.6ms -> 23.2ms 0.69x
Xfermode_ColorDodge 25ms -> 16.5ms 0.66x
Xfermode_ColorBurn 26.1ms -> 16.6ms 0.63x
Ought to be no pixel diffs:
https://gold.skia.org/search2?issue=1432903002&unt=true&query=source_type%3Dgm&master=false
Incidental stuff:
I made the SkNx(T) constructors implicit to make writing math expressions simpler.
This allows us to write expressions like
Sk4f v;
...
v = v*4;
rather than
Sk4f v;
...
v = v * Sk4f(4);
As written it only works when the constant is on the right-hand side,
so expressions like `(Sk4f(1) - da)` have to stay for now. I plan on
following up with a CL that lets those become `(1 - da)` too.
BUG=skia:4117
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
Review URL: https://codereview.chromium.org/1432903002
|
|
|
|
|
|
|
|
|
|
| |
This is a pure refactor. No behavior change.
I'm just getting tired of typing out the names...
BUG=skia:4117
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
Review URL: https://codereview.chromium.org/1436513002
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- remove float -> int conversion, keeping float -> byte
- remove support for doubles
I was thinking of specializing Sk8f for AVX. This will help keep the complexity down.
This may cause minor diffs in radial gradients: toBytes() rounds where castTrunc() truncated. But I don't see any diffs in Gold.
https://gold.skia.org/search2?issue=1411563008&unt=true&query=source_type%3Dgm&master=false
BUG=skia:4117
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
Review URL: https://codereview.chromium.org/1411563008
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
SkPx is like Sk4px, except each platform implementation of SkPx can declare
a different sweet spot of N pixels, with extra loads and stores to handle the
ragged edge of 0<n<N pixels.
In this case, _sse's sweet spot remains 4 pixels. _neon jumps up to 8 so
we can now use NEON's transposing loads and stores, and _none is just 1.
This makes operations involving alpha considerably more efficient on NEON,
as alpha is its own distinct 8x8 bit plane that's easy to toss around.
This incorporates a few other improvements I've been wanting:
- no requirement that we're dealing with SkPMColor. SkColor works too.
- no anonymous namespace hack to differentiate implementations.
Codegen and perf look good on Clang/x86-64 and GCC/ARMv7.
The NEON code looks very similar to the old NEON code, as intended.
No .skp or GM diffs on my laptop. Don't expect any.
I intend this to replace Sk4px. Plan after landing:
- port SkXfermode_opts.h
- port Color32 in SkBlitRow_D32.cpp (and move to SkBlitRow_opts.h like other
SkOpts code)
- delete all Sk4px-related code
- clean up evolutionary dead ends in SkNx (Sk16b, Sk16h, Sk4i, Sk4d, etc.)
leaving Sk2f, Sk4f (and Sk2s, Sk4s).
- find a machine with AVX2 to work on, write SkPx_avx2.h handling 8 pixels
at a time.
In the end we'll have Sk4f for float pixels, SkPx for fixed-point pixels.
BUG=skia:4117
Committed: https://skia.googlesource.com/skia/+/82c93b45ed6ac0b628adb8375389c202d1f586f9
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot;client.skia.compile:Build-Mac10.8-Clang-Arm7-Debug-Android-Trybot
Committed: https://skia.googlesource.com/skia/+/a7627dc5cc2bf5d9a95d883d20c40d477ecadadf
Review URL: https://codereview.chromium.org/1317233005
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
https://codereview.chromium.org/1317233005/ )
Reason for revert:
master-skia unhappy:
https://android-build.storage.googleapis.com/builds/git_master-skia-linux-volantis-userdebug/2404853/e6c439e806fb0bd0f872a3d7a5cf0637d4ad11bfaa89e9bc18b651dc65f0a36b/logs/build_error.log?GoogleAccessId=701025073339-mqn0q2nvir9iurm6q5d00tdv7blbgvjr%40developer.gserviceaccount.com&Signature=WOqQO7xHkv83SmC4h5tNUIp%2BREaYULqK11hNTWlhj1XXo0NAOQd7GNSIHl775uRRZpBw2LkHeb2Ups3LsgRPrldqymposFtDa%2BUEW0Jv2NWAr%2F1Cqt6lwWsfknvJLN9NiEGfpCCye3Q%2FEYx9bU1ozMBG6h2DRHJUMRS%2FjstkJg0%3D&Expires=1446838937
Original issue's description:
> SkPx: new approach to fixed-point SIMD
>
> SkPx is like Sk4px, except each platform implementation of SkPx can declare
> a different sweet spot of N pixels, with extra loads and stores to handle the
> ragged edge of 0<n<N pixels.
>
> In this case, _sse's sweet spot remains 4 pixels. _neon jumps up to 8 so
> we can now use NEON's transposing loads and stores, and _none is just 1.
> This makes operations involving alpha considerably more efficient on NEON,
> as alpha is its own distinct 8x8 bit plane that's easy to toss around.
>
> This incorporates a few other improvements I've been wanting:
> - no requirement that we're dealing with SkPMColor. SkColor works too.
> - no anonymous namespace hack to differentiate implementations.
>
> Codegen and perf look good on Clang/x86-64 and GCC/ARMv7.
> The NEON code looks very similar to the old NEON code, as intended.
> No .skp or GM diffs on my laptop. Don't expect any.
>
> I intend this to replace Sk4px. Plan after landing:
> - port SkXfermode_opts.h
> - port Color32 in SkBlitRow_D32.cpp (and move to SkBlitRow_opts.h like other
> SkOpts code)
> - delete all Sk4px-related code
> - clean up evolutionary dead ends in SkNx (Sk16b, Sk16h, Sk4i, Sk4d, etc.)
> leaving Sk2f, Sk4f (and Sk2s, Sk4s).
> - find a machine with AVX2 to work on, write SkPx_avx2.h handling 8 pixels
> at a time.
>
> In the end we'll have Sk4f for float pixels, SkPx for fixed-point pixels.
>
> BUG=skia:4117
>
> Committed: https://skia.googlesource.com/skia/+/82c93b45ed6ac0b628adb8375389c202d1f586f9
>
> CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot;client.skia.compile:Build-Mac10.8-Clang-Arm7-Debug-Android-Trybot
>
> Committed: https://skia.googlesource.com/skia/+/a7627dc5cc2bf5d9a95d883d20c40d477ecadadf
TBR=msarett@google.com,mtklein@chromium.org
NOPRESUBMIT=true
NOTREECHECKS=true
NOTRY=true
BUG=skia:4117
Review URL: https://codereview.chromium.org/1409843005
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
SkPx is like Sk4px, except each platform implementation of SkPx can declare
a different sweet spot of N pixels, with extra loads and stores to handle the
ragged edge of 0<n<N pixels.
In this case, _sse's sweet spot remains 4 pixels. _neon jumps up to 8 so
we can now use NEON's transposing loads and stores, and _none is just 1.
This makes operations involving alpha considerably more efficient on NEON,
as alpha is its own distinct 8x8 bit plane that's easy to toss around.
This incorporates a few other improvements I've been wanting:
- no requirement that we're dealing with SkPMColor. SkColor works too.
- no anonymous namespace hack to differentiate implementations.
Codegen and perf look good on Clang/x86-64 and GCC/ARMv7.
The NEON code looks very similar to the old NEON code, as intended.
No .skp or GM diffs on my laptop. Don't expect any.
I intend this to replace Sk4px. Plan after landing:
- port SkXfermode_opts.h
- port Color32 in SkBlitRow_D32.cpp (and move to SkBlitRow_opts.h like other
SkOpts code)
- delete all Sk4px-related code
- clean up evolutionary dead ends in SkNx (Sk16b, Sk16h, Sk4i, Sk4d, etc.)
leaving Sk2f, Sk4f (and Sk2s, Sk4s).
- find a machine with AVX2 to work on, write SkPx_avx2.h handling 8 pixels
at a time.
In the end we'll have Sk4f for float pixels, SkPx for fixed-point pixels.
BUG=skia:4117
Committed: https://skia.googlesource.com/skia/+/82c93b45ed6ac0b628adb8375389c202d1f586f9
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot;client.skia.compile:Build-Mac10.8-Clang-Arm7-Debug-Android-Trybot
Review URL: https://codereview.chromium.org/1317233005
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
SkBlurImageFilter can currently only process a source image
which is larger than or equal to the destination rect. If
the source image (or crop rect) is smaller, it is padded
out to dest size with transparent black via the 6-param
version of applyCropRect().
Fixing this requires modifying all the flavours of RGBA
box_blur() to accept a src crop rect.
BUG=skia:4502, skia:4526
CQ_EXTRA_TRYBOTS=client.skia.android:Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot;client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
Committed: https://skia.googlesource.com/skia/+/1b82ceb737c73327412f2e8a91748481e1aec9e4
Review URL: https://codereview.chromium.org/1415653003
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
#16 id:400001 of https://codereview.chromium.org/1415653003/ )
Reason for revert:
ASAN failures (see https://codereview.chromium.org/1415653003/)
Original issue's description:
> Make SkBlurImageFilter capable of cropping during blur (raster path)
>
> SkBlurImageFilter can currently only process a source image
> which is larger than or equal to the destination rect. If
> the source image (or crop rect) is smaller, it is padded
> out to dest size with transparent black via the 6-param
> version of applyCropRect().
>
> Fixing this requires modifying all the flavours of RGBA
> box_blur() to accept a src crop rect.
>
> BUG=skia:4502, skia:4526
> CQ_EXTRA_TRYBOTS=client.skia.android:Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot;client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
>
> Committed: https://skia.googlesource.com/skia/+/1b82ceb737c73327412f2e8a91748481e1aec9e4
TBR=mtklein@google.com,reed@google.com
NOPRESUBMIT=true
NOTREECHECKS=true
NOTRY=true
BUG=skia:4502, skia:4526
Review URL: https://codereview.chromium.org/1428053002
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
SkBlurImageFilter can currently only process a source image
which is larger than or equal to the destination rect. If
the source image (or crop rect) is smaller, it is padded
out to dest size with transparent black via the 6-param
version of applyCropRect().
Fixing this requires modifying all the flavours of RGBA
box_blur() to accept a src crop rect.
BUG=skia:4502, skia:4526
CQ_EXTRA_TRYBOTS=client.skia.android:Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot;client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
Review URL: https://codereview.chromium.org/1415653003
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Stop leaning so hard on the branch predictor, and pull the conditionals
out of the loops for box_blur_double() (NEON). This is conceptually
the same change as https://codereview.chromium.org/1426583004/ for
the NEON double-pixel loop.
R=mtklein@google.com
BUG=skia:4526
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
Review URL: https://codereview.chromium.org/1412793009
|
|
|
|
|
|
|
|
|
|
|
| |
This gives ~15% improvement on blur_image on Linux Z620,
and should allow me to implement cropping without
incurring a perf hit.
BUG=skia:
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
Review URL: https://codereview.chromium.org/1426583004
|
|
|
|
|
|
|
|
|
| |
no public API changes
TBR=reed@google.com
BUG=skia:
Review URL: https://codereview.chromium.org/1419573011
|
|
|
|
|
|
|
|
|
|
|
|
| |
Refactor box_blur() into a single driver function which
SSE*, NEON and generic code paths can use. I've used macros
to do this in order to keep debug performance reasonable,
but it's fairly ugly. I'm open to other suggestions.
BUG=skia:
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
Review URL: https://codereview.chromium.org/1408003007
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
https://codereview.chromium.org/1317233005/ )
Reason for revert:
http://build.chromium.org/p/client.skia.compile/builders/Build-Mac10.8-Clang-Arm7-Debug-Android/builds/4627
Original issue's description:
> SkPx: new approach to fixed-point SIMD
>
> SkPx is like Sk4px, except each platform implementation of SkPx can declare
> a different sweet spot of N pixels, with extra loads and stores to handle the
> ragged edge of 0<n<N pixels.
>
> In this case, _sse's sweet spot remains 4 pixels. _neon jumps up to 8 so
> we can now use NEON's transposing loads and stores, and _none is just 1.
> This makes operations involving alpha considerably more efficient on NEON,
> as alpha is its own distinct 8x8 bit plane that's easy to toss around.
>
> This incorporates a few other improvements I've been wanting:
> - no requirement that we're dealing with SkPMColor. SkColor works too.
> - no anonymous namespace hack to differentiate implementations.
>
> Codegen and perf look good on Clang/x86-64 and GCC/ARMv7.
> The NEON code looks very similar to the old NEON code, as intended.
> No .skp or GM diffs on my laptop. Don't expect any.
>
> I intend this to replace Sk4px. Plan after landing:
> - port SkXfermode_opts.h
> - port Color32 in SkBlitRow_D32.cpp (and move to SkBlitRow_opts.h like other
> SkOpts code)
> - delete all Sk4px-related code
> - clean up evolutionary dead ends in SkNx (Sk16b, Sk16h, Sk4i, Sk4d, etc.)
> leaving Sk2f, Sk4f (and Sk2s, Sk4s).
> - find a machine with AVX2 to work on, write SkPx_avx2.h handling 8 pixels
> at a time.
>
> In the end we'll have Sk4f for float pixels, SkPx for fixed-point pixels.
>
> BUG=skia:4117
>
> Committed: https://skia.googlesource.com/skia/+/82c93b45ed6ac0b628adb8375389c202d1f586f9
TBR=mtklein@google.com,msarett@google.com
NOPRESUBMIT=true
NOTREECHECKS=true
NOTRY=true
BUG=skia:4117
Review URL: https://codereview.chromium.org/1336423002
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
SkPx is like Sk4px, except each platform implementation of SkPx can declare
a different sweet spot of N pixels, with extra loads and stores to handle the
ragged edge of 0<n<N pixels.
In this case, _sse's sweet spot remains 4 pixels. _neon jumps up to 8 so
we can now use NEON's transposing loads and stores, and _none is just 1.
This makes operations involving alpha considerably more efficient on NEON,
as alpha is its own distinct 8x8 bit plane that's easy to toss around.
This incorporates a few other improvements I've been wanting:
- no requirement that we're dealing with SkPMColor. SkColor works too.
- no anonymous namespace hack to differentiate implementations.
Codegen and perf look good on Clang/x86-64 and GCC/ARMv7.
The NEON code looks very similar to the old NEON code, as intended.
No .skp or GM diffs on my laptop. Don't expect any.
I intend this to replace Sk4px. Plan after landing:
- port SkXfermode_opts.h
- port Color32 in SkBlitRow_D32.cpp (and move to SkBlitRow_opts.h like other
SkOpts code)
- delete all Sk4px-related code
- clean up evolutionary dead ends in SkNx (Sk16b, Sk16h, Sk4i, Sk4d, etc.)
leaving Sk2f, Sk4f (and Sk2s, Sk4s).
- find a machine with AVX2 to work on, write SkPx_avx2.h handling 8 pixels
at a time.
In the end we'll have Sk4f for float pixels, SkPx for fixed-point pixels.
BUG=skia:4117
Review URL: https://codereview.chromium.org/1317233005
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
id:40001 of https://codereview.chromium.org/1333983002/ )
Reason for revert:
Unexpected perf impact, and a whole bunch of new images in gold (mostly invisibly different).
Original issue's description:
> use new shuffle to speed up affine matrix mappts
>
> sse: 25 -> 18
> neon: 95 -> 86
>
> BUG=skia:
>
> Committed: https://skia.googlesource.com/skia/+/e70afc9f48d00828ee6b707899a8ff542b0e8b98
TBR=reed@google.com,mtklein@chromium.org
NOPRESUBMIT=true
NOTREECHECKS=true
NOTRY=true
BUG=skia:
Review URL: https://codereview.chromium.org/1335003002
|
|
|
|
|
|
|
|
|
| |
sse: 25 -> 18
neon: 95 -> 86
BUG=skia:
Review URL: https://codereview.chromium.org/1333983002
|
|
|
|
|
|
|
|
|
|
|
| |
This allows us to express shuffles more directly in code while also giving us a
convenient point to platform-specify particular shuffles for particular types.
No specializations yet. Everyone just uses the (pretty good) default option.
BUG=skia:
Review URL: https://codereview.chromium.org/1301413006
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
No changes to the code, just moved around.
This will have the effect of enabling vectorized code on ARMv7.
Should be no effect on ARMv8 or x86, which would have been vectorized already.
nanobench --match mappoints changes on Nexus 5 (ARMv7):
_affine: 132 -> 95
_scale: 118 -> 47
_trans: 60 -> 37
A teaser:
We should next look at the ABCD->BADC shuffle we've noted that we need in _affine. A quick hack showed doing that optimally is another ~35% speedup on x86. Got to figure out how to do it best on ARM though: that same quick hack was a 2x slowdown there. Good reason to resurrect that SkNx_shuffle() CL!
(I believe the answers are vrev64q_f32(v) and _mm_shuffle_ps(v,v, _MM_SHUFFLE(2,3,0,1), but we should probably find out in another CL.)
BUG=skia:4117
Review URL: https://codereview.chromium.org/1320673014
|
|
|
|
|
|
|
|
|
|
| |
This was a pre-SkOpts attempt that we can bring under its wing now.
This should be a perf no-op, deo volente.
BUG=skia:4117
Review URL: https://codereview.chromium.org/1314863006
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This gives SkOncePtr a non-trivial destructor that uses std::default_delete
by default. This is overrideable, as seen in SkColorTable.
SK_DECLARE_STATIC_ONCE_PTR still just leaves its pointers hanging at EOP.
BUG=skia:
No public API changes.
TBR=reed@google.com
Committed: https://skia.googlesource.com/skia/+/a1254acdb344174e761f5061c820559dab64a74c
Review URL: https://codereview.chromium.org/1322933005
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
https://codereview.chromium.org/1322933005/ )
Reason for revert:
Breaks Chrome roll.
obj/skia/ext/skia_chrome.skia_memory_dump_provider.o
does not have -I include/private on its include path, but transitively includes SkMessageBus.h.
Original issue's description:
> Port uses of SkLazyPtr to SkOncePtr.
>
> This gives SkOncePtr a non-trivial destructor that uses std::default_delete
> by default. This is overrideable, as seen in SkColorTable.
>
> SK_DECLARE_STATIC_ONCE_PTR still just leaves its pointers hanging at EOP.
>
> BUG=skia:
>
> No public API changes.
> TBR=reed@google.com
>
> Committed: https://skia.googlesource.com/skia/+/a1254acdb344174e761f5061c820559dab64a74c
TBR=herb@google.com,mtklein@chromium.org
NOPRESUBMIT=true
NOTREECHECKS=true
NOTRY=true
BUG=skia:
Review URL: https://codereview.chromium.org/1334523002
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This gives SkOncePtr a non-trivial destructor that uses std::default_delete
by default. This is overrideable, as seen in SkColorTable.
SK_DECLARE_STATIC_ONCE_PTR still just leaves its pointers hanging at EOP.
BUG=skia:
No public API changes.
TBR=reed@google.com
Review URL: https://codereview.chromium.org/1322933005
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
As you'll see from the BUG line, we have a strong indication that the new Sk4px
methods regress some devices. This restores the old code back as literally as possible
while still fitting in SkOpts framework.
This is ideally temporary breathing room.
We should get an early indication of if those bugs will improve by watching https://perf.skia.org/#4004
BUG=skia:4117,525844,519596,524149
Review URL: https://codereview.chromium.org/1312763009
|
|
|
|
|
|
|
|
| |
(Sk4f(float) is statically initializable, unlike the old SkPMFlor(SkPMColor).)
BUG=skia:4117
Review URL: https://codereview.chromium.org/1317593007
|
|
|
|
|
|
|
|
| |
BUG=skia:4117
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot;client.skia.android:Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Release-Trybot
Review URL: https://codereview.chromium.org/1312053004
|
|
|
|
|
|
|
|
|
|
|
|
| |
This switches over SkXfermodes_opts.h and SkColorMatrixFilter to use Sk4f,
and converts the SkPMFloat benches to Sk4f benches.
No pixels should change here, and no code beyond the Sk4f_ benches should change speed.
The benches are faster than the old versions.
BUG=skia:4117
Review URL: https://codereview.chromium.org/1324743002
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This lets us avoid conversions to [0.0, 1.0] space and rounding that aren't necessary
for SkColorCubeFilter_opts.h.
Dropping rounding on the way back to bytes means we'll see a bunch of off-by-1 diffs.
Rough perf effect:
SSSE3: 110 -> 93 (~15%)
NEON: 465 -> 375 (~20%)
This is the beginning of the end for SkPMFloat as an entity distinct from Sk4f.
I've kept it for now so I can convert sites one by one and think about how things
that really want to keep PM color order will work.
BUG=skia:4117
Review URL: https://codereview.chromium.org/1319413003
|
|
|
|
|
|
| |
DOCS_PREVIEW= https://skia.org/?cl=1316233002
Review URL: https://codereview.chromium.org/1316233002
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
SkPMFloat(0) and SkPMFloat(0,0,0,0) end up with the same value,
but the first goes through math to get there. The second is a lot more
transparent to the compiler, and should compile all the way down to
just `xorps xmmN,xmmN` or even be optimized away.
Didn't measure any additional benefit from hoisting the zero outside
the loop and writing `SkPMFloat color = zero;`.
Perf win is <2%.
BUG=skia:
Review URL: https://codereview.chromium.org/1314763007
|