| Commit message (Collapse) | Author | Age |
... | |
| | | | |
|
| | |/ |
|
| | | |
|
| | | |
|
| | | |
|
| | | |
|
| |/ |
|
| | |
|
| | |
|
| | |
|
| | |
|
| |
| |
| |
| |
| |
| | |
- Move colamd implementation in its own namespace to avoid polluting the internal namespace with Ok, Status, etc.
- Fix signed/unsigned warning
- move some ugly free functions as member functions
|
| | |
|
| | |
|
| | |
|
| |
| |
| |
| | |
const definitions
|
| |
| |
| |
| |
| |
| | |
COLAMD_DEAD
to prevent conflicts with other libraries / code.
|
| | |
|
| | |
|
| | |
|
| | |
|
| | |
|
| |
| |
| |
| |
| |
| | |
casting, which broke
build with -march=native on Haswell/Skylake.
|
| | |
|
| |
| |
| |
| | |
arguments to log1p such that log1p(inf) = inf.
|
| | |
|
| |
| |
| |
| | |
than -1. Fix packet op accordingly.
|
| |
| |
| |
| | |
half to Core/arch/Default and move arch-specific packet ops to their respective sub-directories.
|
| |\
| | |
| | |
| | | |
Asynchronous parallelFor in Eigen ThreadPoolDevice
|
| | |
| | |
| | |
| | |
| | | |
Newlib in Native Client SDK does not provide ::random function.
Implement get_random_seed for NaCl using ::rand, similarly to Windows version.
|
| |/ |
|
| |\
| | |
| | |
| | | |
Fixes for Altivec/VSX and compilation with clang on PowerPC
|
| |\ \
| | | |
| | | |
| | | | |
Implement vectorized versions of log1p and expm1 in Eigen using Kahan's formulas, and change the scalar implementations to properly handle infinite arguments.
|
|/ / / |
|
| | | |
|
| | |
| | |
| | |
| | | |
This actually fixes an issue in unit-test packetmath_2 with pcmp_eq when it is compiled with clang. When pcmp_eq(Packet4f,Packet4f) is used instead of pcmp_eq(Packet2d,Packet2d), the unit-test does not pass due to NaN on ref vector.
|
|/ /
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
formulas, and change the scalar implementations to properly handle infinite arguments.
Depending on instruction set, significant speedups are observed for the vectorized path:
log1p wall time is reduced 60-93% (2.5x - 15x speedup)
expm1 wall time is reduced 0-85% (1x - 7x speedup)
The scalar path is slower by 20-30% due to the extra branch needed to handle +infinity correctly.
Full benchmarks measured on Intel(R) Xeon(R) Gold 6154 here: https://bitbucket.org/snippets/rmlarsen/MXBkpM
|
| |
| |
| |
| |
| |
| |
| |
| | |
The vec_vsx_ld/vec_vsx_st builtins were wrongly used for aligned load/store. In fact, they perform unaligned memory access and, even when the address is 16-byte aligned, they are much slower (at least 2x) than their aligned counterparts.
For double/Packet2d vec_xl/vec_xst should be prefered over vec_ld/vec_st, although the latter works when casted to float/Packet4f.
Silencing some weird warning with throw but some GCC versions. Such warning are not thrown by Clang.
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
If no offset is given, them it should be zero.
Also passes full address to vec_vsx_ld/st builtins.
Removes userless _EIGEN_ALIGNED_PTR & _EIGEN_MASK_ALIGNMENT.
Removes unnecessary casts.
|
|/
|
|
| |
Ignoring -Wc11-extensions warnings thrown by clang at Altivec/PacketMath.h
|
|
|
|
|
|
| |
each other.
Add specializations for complex types since std::log1p and std::exp1m do not support complex.
|
| |
|
|
|
|
| |
does not support this.
|
|
|
|
|
|
|
|
|
|
|
|
| |
blocks if the strides are known to be 1. Provides up to 20-25% speedup of the TF cross entropy op with AVX.
A few benchmark numbers:
name old time/op new time/op delta
BM_Xent_16_10000_cpu 448µs ± 3% 389µs ± 2% -13.21%
(p=0.008 n=5+5)
BM_Xent_32_10000_cpu 575µs ± 6% 454µs ± 3% -21.00% (p=0.008 n=5+5)
BM_Xent_64_10000_cpu 933µs ± 4% 712µs ± 1% -23.71% (p=0.008 n=5+5)
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
https://bitbucket.org/eigen/eigen/pull-requests/662.
The change caused the device struct to be copied for each expression evaluation, and caused, e.g., a 10% regression in the TensorFlow multinomial op on GPU:
Benchmark Time(ns) CPU(ns) Iterations
----------------------------------------------------------------------
BM_Multinomial_gpu_1_100000_4 128173 231326 2922 1.610G items/s
VS
Benchmark Time(ns) CPU(ns) Iterations
----------------------------------------------------------------------
BM_Multinomial_gpu_1_100000_4 146683 246914 2719 1.509G items/s
|
|
|
|
| |
intended to be part of the code.
|
| |
|
| |
|