| Commit message (Collapse) | Author | Age |
|
|
|
|
|
|
|
|
|
|
|
| |
Imported from GitHub PR https://github.com/abseil/abseil-cpp/pull/1662
Merge 4b2c6c909b573d31a1cccba7cb72d4d8badeef8b into cba31a956209e68e4d4049e8a9bc03b1fd67320a
Merging this change closes #1662
COPYBARA_INTEGRATE_REVIEW=https://github.com/abseil/abseil-cpp/pull/1662 from pps83:crc-add 4b2c6c909b573d31a1cccba7cb72d4d8badeef8b
PiperOrigin-RevId: 631470883
Change-Id: I4a72be643ed341ddf0e0007418ab4a613a03db4b
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Imported from GitHub PR https://github.com/abseil/abseil-cpp/pull/1653
CRC32_u64 returns uint32_t, no need to cast returned result to uint32_t
Merge 90e7b063f39c6b1559a21832d764e500e1cdd40c into 9a61b00dde4031f17ed4fa4bdc0e0e9ad8859846
Merging this change closes #1653
COPYBARA_INTEGRATE_REVIEW=https://github.com/abseil/abseil-cpp/pull/1653 from pps83:CRC32_u64-cast 90e7b063f39c6b1559a21832d764e500e1cdd40c
PiperOrigin-RevId: 626462347
Change-Id: I748a2da5fcc66eb6aa07aaf0fbc7eca927fcbb16
|
|
|
|
|
|
|
| |
This removes redundant vector-vector moves and results in Extend being up to 3% faster.
PiperOrigin-RevId: 621948170
Change-Id: Id82816aa6e294d34140ff591103cb20feac79d9a
|
|
|
|
|
|
|
|
|
|
| |
This change will allow the AVX version of non-temporal memcpy to be compiled
even if the compiler isn't run with AVX support. This allows runtime dispatch
to select the AVX implementation for CPUs that are known to be compatible with
AVX instructions.
PiperOrigin-RevId: 619594422
Change-Id: Ia7d92404ef8d10d152030b29b71948ed954f28f5
|
|
|
|
|
| |
PiperOrigin-RevId: 615504707
Change-Id: Ia0e8211bd3c3d28fd0715c8f296ec50f6a700757
|
|
|
|
|
| |
PiperOrigin-RevId: 615160537
Change-Id: I29070c898104c55e6563eed0eef7397441bef1d7
|
|
|
|
|
| |
PiperOrigin-RevId: 615017130
Change-Id: I73277de8ece31d6a35b47dbdb205b473324b74a2
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Imported from GitHub PR https://github.com/abseil/abseil-cpp/pull/1617
The intrinsics used aren't available on `x86_64` processors while running in 32-bit mode. See:
- list of 64-bit intrinsics (https://learn.microsoft.com/en-us/cpp/intrinsics/x64-amd64-intrinsics-list?view=msvc-170)
- list of 32-bit intrinsics (https://learn.microsoft.com/en-us/cpp/intrinsics/x86-intrinsics-list?view=msvc-170)
- list of predefined MSVC macros (https://learn.microsoft.com/en-us/cpp/preprocessor/predefined-macros?view=msvc-170)
The error message in question:
```console
F:\dev\opentrack-depends\onnxruntime-build\msvc\_deps\abseil_cpp-src\absl/crc/internal/crc32_x86_arm_combined_simd.h(145,32): error C3861: '_mm_crc32_u64': identifier not found
return static_cast<uint32_t>(_mm_crc32_u64(crc, v));
^
F:\dev\opentrack-depends\onnxruntime-build\msvc\_deps\abseil_cpp-src\absl/crc/internal/crc32_x86_arm_combined_simd.h(193,50): error C3861: '_mm_cvtsi128_si64': identifier not found
inline int64_t V128_Low64(const V128 l) { return _mm_cvtsi128_si64(l); }
```
Merge 06f5832108a2b01e0a900db51e1c870f7069a1f2 into 797501d12ea767dabdc8d36674e083869e62ee7d
Merging this change closes #1617
COPYBARA_INTEGRATE_REVIEW=https://github.com/abseil/abseil-cpp/pull/1617 from sthalik:pr/fix-msvc-32-bit-avx 06f5832108a2b01e0a900db51e1c870f7069a1f2
PiperOrigin-RevId: 607483370
Change-Id: Id2a6f6dd33c2707fe7ffe134e7335916f3fb9da3
|
|
|
|
|
|
|
| |
Note that this only changes how we allocate the empty state, and reference countings of `empty` stay the same.
PiperOrigin-RevId: 599526339
Change-Id: I2c6aaf875c144c947e17fe8f69692b1195b55dd7
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
According to https://stackoverflow.com/a/68939636 it is safe to use
__m128i instead.
https://learn.microsoft.com/en-us/cpp/intrinsics/x86-intrinsics-list?view=msvc-170 also uses this type instead
__m128i_u is just __m128i with a looser alignment requirement, but
simply calling _mm_loadu_si128() instead of _mm_load_si128() is enough to
tell the compiler when a pointer is unaligned.
Fixes #1552
PiperOrigin-RevId: 576931936
Change-Id: I7c3530001149b360c12a1786c7e1832754d0e35c
|
|
|
|
|
| |
PiperOrigin-RevId: 571430428
Change-Id: I4777c37c5287d26a75f37fe059324ac218878f0e
|
|
|
|
|
|
|
| |
Siryn's crc32 instruction seems to have latency 3 and throughput 1, which makes the optimal ratio of pmull and crc streams close to that of tested x86 machines. Up to +120% faster for large inputs.
PiperOrigin-RevId: 568645559
Change-Id: I86b85b1b2a5d4fb3680c516c4c9044238b20fe61
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is a temporary workaround for an apparent compiler bug with pmull(2) instructions. The current hot loop looks like this:
mov w14, #0xef02,
lsl x15, x15, #6,
mov x13, xzr,
movk w14, #0x740e, lsl #16,
sub x15, x15, #0x40,
ldr q4, [x16, #0x4e0],
_LOOP_START:
add x16, x9, x13,
add x17, x12, x13,
fmov d19, x14, <--------- This is Loop invariant and expensive
add x13, x13, #0x40,
cmp x15, x13,
prfm pldl1keep, [x16, #0x140],
prfm pldl1keep, [x17, #0x140],
ldp x18, x0, [x16, #0x40],
crc32cx w10, w10, x18,
ldp x2, x18, [x16, #0x50],
crc32cx w10, w10, x0,
crc32cx w10, w10, x2,
ldp x0, x2, [x16, #0x60],
crc32cx w10, w10, x18,
ldp x18, x16, [x16, #0x70],
pmull2 v5.1q, v1.2d, v4.2d,
pmull2 v6.1q, v0.2d, v4.2d,
pmull2 v7.1q, v2.2d, v4.2d,
pmull2 v16.1q, v3.2d, v4.2d,
ldp q17, q18, [x17, #0x40],
crc32cx w10, w10, x0,
pmull v1.1q, v1.1d, v19.1d,
crc32cx w10, w10, x2,
pmull v0.1q, v0.1d, v19.1d,
crc32cx w10, w10, x18,
pmull v2.1q, v2.1d, v19.1d,
crc32cx w10, w10, x16,
pmull v3.1q, v3.1d, v19.1d,
ldp q20, q21, [x17, #0x60],
eor v1.16b, v17.16b, v1.16b,
eor v0.16b, v18.16b, v0.16b,
eor v1.16b, v1.16b, v5.16b,
eor v2.16b, v20.16b, v2.16b,
eor v0.16b, v0.16b, v6.16b,
eor v3.16b, v21.16b, v3.16b,
eor v2.16b, v2.16b, v7.16b,
eor v3.16b, v3.16b, v16.16b,
b.ne _LOOP_START
There is a redundant fmov that moves the same constant into a Neon register every loop iteration to be used in the PMULL instructions. The PMULL2 instructions already have this constant loaded into Neon registers. After this change, both the PMULL and PMULL2 instructions use the values in q4, and they are not reloaded every iteration. This fmov was expensive because it contends for execution units with crc32cx instructions. This is up to 20% faster for large inputs.
PiperOrigin-RevId: 567391972
Change-Id: I4c8e49750cfa5cc5730c3bb713bd9fd67657804a
|
|
|
|
|
| |
PiperOrigin-RevId: 565662176
Change-Id: I18d5d9eb444b0090e3f4ab8f66ad214a67344268
|
|
|
|
|
|
|
| |
This is a rename only with no other changes.
PiperOrigin-RevId: 563428969
Change-Id: Iefc184bf9a233cb72649bc20b8555f6b662cac6d
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This CL rolls forward a previous change which we rolled back temporarily due to
compilation errors on x86 when PCLMUL intrinsics were unavailable.
*** Original change description ***
This change replaces inline x86 intrinsics with generic versions that compile
for both x86 and ARM depending on the target arch.
This change does not enable the accelerated crc memcpy engine on ARM. That will
be done in a subsequent change after the optimal number of vector and integer
regions for different CPUs is determined.
***
PiperOrigin-RevId: 563416413
Change-Id: Iee630a15ed83c26659adb0e8a03d3f3d3a46d688
|
|
|
|
|
|
|
|
| |
In some configurations this change causes compilation errors. We will roll this
forward again after those issue are addressed.
PiperOrigin-RevId: 562810916
Change-Id: I45b2a8d456273e9eff188f36da8f11323c4dfe66
|
|
|
|
|
|
|
|
|
|
|
|
| |
This change replaces inline x86 intrinsics with generic versions that compile
for both x86 and ARM depending on the target arch.
This change does not enable the accelerated crc memcpy engine on ARM. That will
be done in a subsequent change after the optimal number of vector and integer
regions for different CPUs is determined.
PiperOrigin-RevId: 562785420
Change-Id: I8ba4aa8de17587cedd92532f03767059a481f159
|
|
|
|
|
|
|
|
|
|
|
|
| |
This bug does not affect any users currently since AcceleratedCrcMemcpyEngine
is never configured with a single region currently.
Before this CL, if the number of regions for the AcceleratedCrcMemcpyEngine was
set to one, the CRC for the sole region would be incorrectly concatenated onto
itself and corrupted.
PiperOrigin-RevId: 561663848
Change-Id: Ibfc596306ab07db906d2e3ecf6eea3f6cb9f1b2b
|
|
|
|
|
| |
PiperOrigin-RevId: 561444259
Change-Id: I205ba9f11f4d41163ce74ae9cfa417fe500ccab3
|
|
|
|
|
| |
PiperOrigin-RevId: 561119886
Change-Id: Ia1483fdb237f4b211068c7ad1f780ab3e6b81eca
|
|
|
|
|
| |
PiperOrigin-RevId: 561108037
Change-Id: Idff65e288384cb55ce69f789db2d9374ae781d3d
|
|
|
|
|
|
|
|
|
| |
Using the non-temporal AVX engine for unknown CPU types looks like a mistake to
me, and the default built into the switch case is to use the fallback engine. I
don't think this is causing issues now, but it might once we add ARM support.
PiperOrigin-RevId: 561097994
Change-Id: I7f0edd447017c09acd49e4ea11476e32740d630a
|
|
|
|
|
| |
PiperOrigin-RevId: 554936252
Change-Id: Idb2ffbbc11aa6c98414fdd1ec38873d4687ab5e7
|
|
|
|
|
|
|
| |
implementation.
PiperOrigin-RevId: 539749773
Change-Id: Iec83431ffd360a077b153cea00427580ae287d1f
|
|
|
|
|
|
|
|
|
|
|
|
| |
Imported from GitHub PR https://github.com/abseil/abseil-cpp/pull/1452
__cpuid is declared in intrin.h, but is excluded on non-Windows platforms.
We add this declaration to compensate.
Fixes #1358
PiperOrigin-RevId: 534449804
Change-Id: I91027f79d8d52c4da428d5c3a53e2cec00825c13
|
|
|
|
|
| |
PiperOrigin-RevId: 534213948
Change-Id: I56b897060b9afe9d3d338756c80e52f421653b55
|
|
|
|
|
| |
PiperOrigin-RevId: 534179290
Change-Id: I9ad24518cc6a336fbaf602269fb01319491c8b60
|
| |
|
|\
| |
| |
| |
| | |
PiperOrigin-RevId: 527066823
Change-Id: Ifa1e9a43c7490b34f9f4dbfa12d3acbed6b49777
|
|/ |
|
|
|
|
|
|
|
|
|
|
|
|
| |
https://google.github.io/styleguide/cppguide.html#Designated_initializers
recommends using designated initializers as does https://abseil.io/tips/172,
but apparently they are a non-standard extension prior to C++20.
For maximum compatibility, avoid using them here.
Fixes #1413
PiperOrigin-RevId: 516892890
Change-Id: Id7b7857891e39eb52132c3edf70e5bf4973755af
|
|
|
|
| |
This shows that these are member functions that do not modify a class's data.
|
|\
| |
| |
| |
| | |
PiperOrigin-RevId: 511271203
Change-Id: I1ed352e06265b705b62d401a50b4699d01f7f1d7
|
| |
| |
| |
| | |
These make the changed constructors match closer to the other ones that are default.
|
|/
|
|
| |
This also helps a lot with dealing with conversions and data structure creation under the hood.
|
|
|
|
|
| |
PiperOrigin-RevId: 507790741
Change-Id: I347357f9a2d698510f29b7d1b065ef73f9289292
|
|
|
|
|
| |
PiperOrigin-RevId: 505184961
Change-Id: I64482558a76abda6896bec4b2d323833b6cd7edf
|
|
|
|
|
| |
PiperOrigin-RevId: 501464530
Change-Id: I5a0929a2b88c1c158b1696634a65ffda9c4b8590
|
|
|
|
|
|
|
|
| |
This also ensures that there is only one definition of
GetArchSpecificEngines by moving the condition to a common place.
PiperOrigin-RevId: 500038304
Change-Id: If0c55d701dfdc11a1a9c8c1b34eb220435529ffb
|
|
|
|
|
|
|
|
| |
32-bit builds with SSE 4.2 do exist, and these builds do not work
without this patch.
PiperOrigin-RevId: 499498979
Change-Id: I0ade09068804655652c07d0f1ef13554464a1558
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We already prefetch in case of large inputs, do the same
for medium sized inputs as well. This is mostly neutral
for performance in most cases, so this also adds a new
bench with working size >> cache size to ensure that we
are seeing performance benefits of prefetch. Main benefits
are on AMD with hardware prefetchers turned off:
AMD prefetchers on:
name old time/op new time/op delta
BM_Calculate/0 2.43ns ± 1% 2.43ns ± 1% ~ (p=0.814 n=40+40)
BM_Calculate/1 2.50ns ± 2% 2.50ns ± 2% ~ (p=0.745 n=39+39)
BM_Calculate/100 9.17ns ± 1% 9.17ns ± 2% ~ (p=0.747 n=40+40)
BM_Calculate/10000 474ns ± 1% 474ns ± 2% ~ (p=0.749 n=40+40)
BM_Calculate/500000 22.8µs ± 1% 22.9µs ± 2% ~ (p=0.298 n=39+40)
BM_Extend/0 1.38ns ± 1% 1.38ns ± 1% ~ (p=0.651 n=40+40)
BM_Extend/1 1.53ns ± 2% 1.53ns ± 1% ~ (p=0.957 n=40+39)
BM_Extend/100 9.48ns ± 1% 9.48ns ± 2% ~ (p=1.000 n=40+40)
BM_Extend/10000 474ns ± 2% 474ns ± 1% ~ (p=0.928 n=40+40)
BM_Extend/500000 22.8µs ± 1% 22.9µs ± 2% ~ (p=0.331 n=40+40)
BM_Extend/100000000 4.79ms ± 1% 4.79ms ± 1% ~ (p=0.753 n=38+38)
BM_ExtendCacheMiss/10 25.5ms ± 2% 25.5ms ± 2% ~ (p=0.988 n=38+40)
BM_ExtendCacheMiss/100 23.1ms ± 2% 23.1ms ± 2% ~ (p=0.792 n=40+40)
BM_ExtendCacheMiss/1000 37.2ms ± 1% 28.6ms ± 2% -23.00% (p=0.000 n=38+40)
BM_ExtendCacheMiss/100000 7.77ms ± 2% 7.74ms ± 2% -0.45% (p=0.006 n=40+40)
AMD prefetchers off:
name old time/op new time/op delta
BM_Calculate/0 2.43ns ± 2% 2.43ns ± 2% ~ (p=0.351 n=40+39)
BM_Calculate/1 2.51ns ± 2% 2.51ns ± 1% ~ (p=0.535 n=40+40)
BM_Calculate/100 9.18ns ± 2% 9.15ns ± 2% ~ (p=0.120 n=38+39)
BM_Calculate/10000 475ns ± 2% 475ns ± 2% ~ (p=0.852 n=40+40)
BM_Calculate/500000 22.9µs ± 2% 22.8µs ± 2% ~ (p=0.396 n=40+40)
BM_Extend/0 1.38ns ± 2% 1.38ns ± 2% ~ (p=0.466 n=40+40)
BM_Extend/1 1.53ns ± 2% 1.53ns ± 2% ~ (p=0.914 n=40+39)
BM_Extend/100 9.49ns ± 2% 9.49ns ± 2% ~ (p=0.802 n=40+40)
BM_Extend/10000 475ns ± 2% 474ns ± 1% ~ (p=0.589 n=40+40)
BM_Extend/500000 22.8µs ± 2% 22.8µs ± 2% ~ (p=0.872 n=39+40)
BM_Extend/100000000 10.0ms ± 3% 10.0ms ± 4% ~ (p=0.355 n=40+40)
BM_ExtendCacheMiss/10 196ms ± 2% 196ms ± 2% ~ (p=0.698 n=40+40)
BM_ExtendCacheMiss/100 129ms ± 1% 129ms ± 1% ~ (p=0.602 n=36+37)
BM_ExtendCacheMiss/1000 88.6ms ± 1% 57.2ms ± 1% -35.49% (p=0.000 n=36+38)
BM_ExtendCacheMiss/100000 14.9ms ± 1% 14.9ms ± 1% ~ (p=0.888 n=39+40)
Intel skylake:
BM_Calculate/0 2.49ns ± 2% 2.44ns ± 4% -2.15% (p=0.001 n=31+34)
BM_Calculate/1 3.04ns ± 2% 2.98ns ± 9% -1.95% (p=0.003 n=31+35)
BM_Calculate/100 8.64ns ± 3% 8.53ns ± 5% ~ (p=0.065 n=31+35)
BM_Calculate/10000 290ns ± 3% 285ns ± 7% -1.80% (p=0.004 n=28+34)
BM_Calculate/500000 11.8µs ± 2% 11.6µs ± 8% -1.59% (p=0.003 n=26+34)
BM_Extend/0 1.56ns ± 1% 1.52ns ± 3% -2.44% (p=0.000 n=26+35)
BM_Extend/1 1.88ns ± 3% 1.83ns ± 6% -2.17% (p=0.001 n=27+35)
BM_Extend/100 9.31ns ± 3% 9.13ns ± 7% -1.92% (p=0.000 n=33+38)
BM_Extend/10000 290ns ± 3% 283ns ± 3% -2.45% (p=0.000 n=32+38)
BM_Extend/500000 11.8µs ± 2% 11.5µs ± 8% -1.80% (p=0.001 n=35+37)
BM_Extend/100000000 6.39ms ±10% 6.11ms ± 8% -4.34% (p=0.000 n=40+40)
BM_ExtendCacheMiss/10 36.2ms ± 7% 35.8ms ±14% ~ (p=0.281 n=33+37)
BM_ExtendCacheMiss/100 26.9ms ±15% 25.9ms ±12% -3.93% (p=0.000 n=40+40)
BM_ExtendCacheMiss/1000 23.8ms ± 5% 23.4ms ± 5% -1.68% (p=0.001 n=39+40)
BM_ExtendCacheMiss/100000 10.1ms ± 5% 10.0ms ± 4% ~ (p=0.051 n=39+39)
PiperOrigin-RevId: 495119444
Change-Id: I67bcf3b0282b5e1c43122de2837a24c16b8aded7
|
|
|
|
|
| |
PiperOrigin-RevId: 494749165
Change-Id: I8d855be9c508a9fdfb5f60e87471c0947057ecc9
|
|
|
|
|
| |
PiperOrigin-RevId: 494587777
Change-Id: I41504edca6fcf750d52602fa84a33bc7fe5fbb48
|
|\
| |
| |
| |
| | |
PiperOrigin-RevId: 493386604
Change-Id: I289cb38b4a3da5760ab7ef3976d402d165d7e10f
|
|/ |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
coverage of the accelerated CRC implementation and some differences
bewteen the internal and external implementation.
This change adds CI coverage to the
linux_clang-latest_libstdcxx_bazel.sh script assuming this script
always runs on machines of at least the Intel Haswell generation.
Fixes include:
* Remove the use of the deprecated xor operator on crc32c_t
* Remove #pragma unroll_completely, which isn't known by GCC or Clang:
https://godbolt.org/z/97j4vbacs
* Fixes for -Wsign-compare, -Wsign-conversion and -Wshorten-64-to-32
PiperOrigin-RevId: 491965029
Change-Id: Ic5e1f3a20f69fcd35fe81ebef63443ad26bf7931
|
|
|
|
|
|
|
|
|
|
| |
std::array has a special-case to allow this
https://en.cppreference.com/w/cpp/container/array
Fixes #1332
PiperOrigin-RevId: 491703960
Change-Id: Ib83a1f0865448314e463e8ebf39ae3b842f762ea
|
|
|
|
|
| |
PiperOrigin-RevId: 491681300
Change-Id: I4ecdd3bf359cda7592b6c392a2fbb61b8394f71b
|
|
|
|
|
|
|
|
|
| |
The motivation is to explicitly remove and document dangerous
operations like adding crc32c_t to a set, because equality is not
enough to guarantee uniqueness.
PiperOrigin-RevId: 491656425
Change-Id: I7b4dadc1a59ea9861e6ec7a929d64b5746467832
|