eigen - C++ library for linear algebra

	Commit message (Collapse)	Author	Age
*	[SYCL] Rebasing the SYCL support branch on top of the Einge upstream master ↵	Mehdi Goli	2019-11-28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	branch. * Unifying all loadLocalTile from lhs and rhs to an extract_block function. * Adding get_tensor operation which was missing in TensorContractionMapper. * Adding the -D method missing from cmake for Disable_Skinny Contraction operation. * Wrapping all the indices in TensorScanSycl into Scan parameter struct. * Fixing typo in Device SYCL * Unifying load to private register for tall/skinny no shared * Unifying load to vector tile for tensor-vector/vector-tensor operation * Removing all the LHS/RHS class for extracting data from global * Removing Outputfunction from TensorContractionSkinnyNoshared. * Combining the local memory version of tall/skinny and normal tensor contraction into one kernel. * Combining the no-local memory version of tall/skinny and normal tensor contraction into one kernel. * Combining General Tensor-Vector and VectorTensor contraction into one kernel. * Making double buffering optional for Tensor contraction when local memory is version is used. * Modifying benchmark to accept custom Reduction Sizes * Disabling AVX optimization for SYCL backend on the host to allow SSE optimization to the host * Adding Test for SYCL * Modifying SYCL CMake
*	STYLE: Remove CMake-language block-end command arguments	Hans Johnson	2019-10-31
\| \| \| \| \| \|	Ancient versions of CMake required else(), endif(), and similar block termination commands to have arguments matching the command starting the block. This is no longer the preferred style.
*	bug #1281: fix AutoDiffScalar's make_coherent for nested expression of ↵	Gael Guennebaud	2019-11-14
\| \| \| \|	constant ADs.
*	Remove legacy block evaluation support	Eugene Zhulenev	2019-11-12
\|
*	Fix data race in css11_tensor_notification test.	Rasmus Munk Larsen	2019-11-08
\|
*	Add block evaluation V2 to TensorAsyncExecutor.	Rasmus Munk Larsen	2019-10-22
\| \| \| \|	Add async evaluation to a number of ops.
*	Drop support for c++03 in Eigen tensor. Get rid of some code used to emulate ↵	Rasmus Munk Larsen	2019-10-18
\| \| \| \|	c++11 functionality with older compilers.
*	Cleanup Tensor block destination and materialized block storage allocation	Eugene Zhulenev	2019-10-16
\|
*	TensorBroadcasting support for random/uniform blocks	Eugene Zhulenev	2019-10-16
\|
*	Block evaluation for TensorGenerator/TensorReverse/TensorShuffling	Eugene Zhulenev	2019-10-14
\|
*	Block evaluation for TensorGenerator + TensorReverse + fixed bug in tensor ↵	Eugene Zhulenev	2019-10-10
\| \| \| \|	reverse op
*	Block evaluation for TensorChipping + fixed bugs in TensorPadding and ↵	Eugene Zhulenev	2019-10-09
\| \| \| \|	TensorSlicing
*	Implement c++03 compatible fix for changeset ↵	Gael Guennebaud	2019-10-09
\| \| \| \|	7a43af1a335da2c0489b4119a33ee1cbff0c15d6
*	Fix compilation of FFTW unit test	Gael Guennebaud	2019-10-08
\|
*	Add block evaluation to TensorEvalTo and fix few small bugs	Eugene Zhulenev	2019-10-07
\|
*	Fix compilation warnings and errors with clang in TensorBlockV2 code and tests	Eugene Zhulenev	2019-10-04
\|
*	Add block evaluation to TensorReshaping/TensorCasting/TensorPadding/TensorSelect	Eugene Zhulenev	2019-10-02
\|
*	Fix cxx11_tensor_block_io test	Eugene Zhulenev	2019-09-25
\|
*	Fix compilation warnings and errors with clang in TensorBlockV2	Eugene Zhulenev	2019-09-25
\|
*	Add new TensorBlock api implementation + tests	Eugene Zhulenev	2019-09-24
\|
*	Tensor block evaluation V2 support for unary/binary/broadcsting	Eugene Zhulenev	2019-09-24
\|
*	Add support for asynchronous evaluation of tensor casting expressions.	Rasmus Munk Larsen	2019-09-19
\|
*	Merging eigen/eigen.	Srinivas Vasudevan	2019-09-16
\|\
* \|	Add Bessel functions to SpecialFunctions.	Srinivas Vasudevan	2019-09-14
\|/ \| \| \| \| \| \| \| \|	- Split SpecialFunctions files in to a separate BesselFunctions file. In particular add: - Modified bessel functions of the second kind k0, k1, k0e, k1e - Bessel functions of the first kind j0, j1 - Bessel functions of the second kind y0, y1
*	Fix for the HIP build+test errors introduced by the ndtri support.	Deven Desai	2019-09-06
\| \| \| \| \| \| \|	The fixes needed are * adding EIGEN_DEVICE_FUNC attribute to a couple of funcs (else HIPCC will error out when non-device funcs are called from global/device funcs) * switching to using ::<math_func> instead std::<math_func> (only for HIPCC) in cases where the std::<math_func> is not recognized as a device func by HIPCC * removing an errant "j" from a testcase (don't know how that made it in to begin with!)
*	Update ThreadLocal to use separate Initialize/Release callables	Eugene Zhulenev	2019-09-10
\|
*	ThreadLocal container that does not rely on thread local storage	Eugene Zhulenev	2019-09-09
\|
*	PR 681: Add ndtri function, the inverse of the normal distribution function.	Srinivas Vasudevan	2019-08-12
\|
*	Allow move-only done callback in TensorAsyncDevice	Eugene Zhulenev	2019-09-03
\|
*	Add test for const TensorMap underlying data mutation	Eugene Zhulenev	2019-09-03
\|
*	evalSubExprsIfNeededAsync + async TensorContractionThreadPool	Eugene Zhulenev	2019-08-30
\|
*	Asynchronous expression evaluation with TensorAsyncDevice	Eugene Zhulenev	2019-08-30
\|
*	Const correctness in TensorMap<const Tensor<T, ...>> expressions	Eugene Zhulenev	2019-08-28
\|
*	Remove XSMM support from Tensor module	Eugene Zhulenev	2019-08-19
\|
*	Disable tests for contraction with output kernels when using libxsmm, which ↵	Rasmus Munk Larsen	2019-08-07
\| \| \| \|	does not support this.
*	Merge with Eigen head	Eugene Zhulenev	2019-06-28
\|\
* \|	Add block access to TensorReverseOp and make sure that TensorForcedEval uses ↵	Eugene Zhulenev	2019-06-28
\| \| \| \| \| \| \| \|	block access when preferred
\| *	[SYCL] This PR adds the minimum modifications to the Eigen unsupported ↵	Mehdi Goli	2019-06-28
\|/ \| \| \| \| \| \| \| \| \|	module required to run it on devices supporting SYCL. * Abstracting the pointer type so that both SYCL memory and pointer can be captured. * Converting SYCL virtual pointer to SYCL device memory in Eigen evaluator class. * Binding SYCL placeholder accessor to command group handler by using bind method in Eigen evaluator node. * Adding SYCL macro for controlling loop unrolling. * Modifying the TensorDeviceSycl.h and SYCL executor method to adopt the above changes.
*	Minor build improvements	tra	2019-05-31
\| \| \| \| \| \| \| \|	* Allow specifying multiple GPU architectures. E.g.: cmake -DEIGEN_CUDA_COMPUTE_ARCH="60;70" * Pass CUDA SDK path to clang. Without it it will default to /usr/local/cuda which may not be the right location, if cmake was invoked with -DCUDA_TOOLKIT_ROOT_DIR=/some/other/CUDA/path
*	Merged in rmlarsen/eigen_threadpool (pull request PR-640)	Rasmus Larsen	2019-05-13
\|\ \| \| \| \| \| \| \| \| \| \|	Fix deadlocks in thread pool. Approved-by: Eugene Zhulenev <ezhulenev@google.com>
* \|	bug #1707: Fix deprecation warnings, or disable warnings when testing ↵	Christoph Hertzberg	2019-05-10
\| \| \| \| \| \| \| \|	deprecated functions
\| *	A) fix deadlocks in thread pool caused by EventCount	Rasmus Munk Larsen	2019-05-08
\|/ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This fixed 2 deadlocks caused by sloppiness in the EventCount logic. Both most likely were introduced by cl/236729920 which includes the new EventCount algorithm: https://github.com/eigenteam/eigen-git-mirror/commit/01da8caf003990967e42a2b9dc3869f154569538 bug #1 (Prewait): Prewait must not consume existing signals. Consider the following scenario. There are 2 thread pool threads (1 and 2) and 1 external thread (3). RunQueue is empty. Thread 1 checks the queue, calls Prewait, checks RunQueue again and now is going to call CommitWait. Thread 2 checks the queue and now is going to call Prewait. Thread 3 submits 2 tasks, EventCount signals is set to 1 because only 1 waiter is registered the second signal is discarded). Now thread 2 resumes and calls Prewait and takes away the signal. Thread 1 resumes and calls CommitWait, there are no pending signals anymore, so it blocks. As the result we have 2 tasks, but only 1 thread is running. bug #2 (CancelWait): CancelWait must not take away a signal if it's not sure that the signal was meant for this thread. When one thread blocks and another submits a new task concurrently, the EventCount protocol guarantees only the following properties (similar to the Dekker's algorithm): (a) the registered waiter notices presence of the new task and does not block (b) the signaler notices presence of the waiters and wakes it (c) both the waiter notices presence of the new task and signaler notices presence of the waiter [it's only that both of them do not notice each other must not be possible, because it would lead to a deadlock] CancelWait is called for cases (a) and (c). For case (c) it is OK to take the notification signal away, but it's not OK for (a) because nobody queued a signals for us and we take away a signal meant for somebody else. Consider: Thread 1 calls Prewait, checks RunQueue, it's empty, now it's going to call CommitWait. Thread 3 submits 2 tasks, EventCount signals is set to 1 because only 1 waiter is registered the second signal is discarded). Thread 2 calls Prewait, checks RunQueue, discovers the tasks, calls CancelWait and consumes the pending signal (meant for thread 1). Now Thread 1 resumes and calls CommitWait, since there are no signals it blocks. As the result we have 2 tasks, but only 1 thread is running. Both deadlocks are only a problem if the tasks require parallelism. Most computational tasks do not require parallelism, i.e. a single thread will run task 1, finish it and then dequeue and run task 2. This fix undoes some of the sloppiness in the EventCount that was meant to reduce CPU consumption by idle threads, because we now have more threads running in these corner cases. But we still don't have pthread_yield's and maybe the strictness introduced by this change will actually help to reduce tail latency because we will have threads running when we actually need them running. B) fix deadlock in thread pool caused by RunQueue This fixed a deadlock caused by sloppiness in the RunQueue logic. Most likely this was introduced with the non-blocking thread pool. The deadlock only affects workloads that require parallelism. Most computational tasks don't require parallelism. PopBack must not fail spuriously. If it does, it can effectively lead to single thread consuming several wake up signals. Consider 2 worker threads are blocked. External thread submits a task. One of the threads is woken. It tries to steal the task, but fails due to a spurious failure in PopBack (external thread submits another task and holds the lock). The thread executes blocking protocol again (it won't block because NonEmptyQueueIndex is precise and the thread will discover pending work, but it has called PrepareWait). Now external thread submits another task and signals EventCount again. The signal is consumed by the first thread again. But now we have 2 tasks pending but only 1 worker thread running. It may be possible to fix this in a different way: make EventCount::CancelWait forward wakeup signal to a blocked thread rather then consuming it. But this looks more complex and I am not 100% that it will fix the bug. It's also possible to have 2 versions of PopBack: one will do try_to_lock and another won't. Then worker threads could first opportunistically check all queues with try_to_lock, and only use the blocking version before blocking. But let's first fix the bug with the simpler change.
*	Block evaluation for TensorGeneratorOp	Eugene Zhulenev	2019-03-05
\|
*	Do not create Tensor<const T> in cxx11_tensor_forced_eval test	Eugene Zhulenev	2019-03-05
\|
*	Add tiled evaluation for TensorForcedEvalOp	Eugene Zhulenev	2019-03-04
\|
*	Improve EventCount used by the non-blocking threadpool.	Rasmus Munk Larsen	2019-02-22
\| \| \| \| \| \| \| \| \| \|	The current algorithm requires threads to commit/cancel waiting in order they called Prewait. Spinning caused by that serialization can consume lots of CPU time on some workloads. Restructure the algorithm to not require that serialization and remove spin waits from Commit/CancelWait. Note: this reduces max number of threads from 2^16 to 2^14 to leave more space for ABA counter (which is now 22 bits). Implementation details are explained in comments.
*	Avoid `I` as an identifier, since it may clash with the C-header complex.h	Christoph Hertzberg	2019-01-25
\|
*	Fix flaky test for tensor fft.	Rasmus Munk Larsen	2019-01-16
\|
*	bug #1654: fix compilation with cuda and no c++11	Gael Guennebaud	2019-01-09
\|
*	Various fixes in polynomial solver and its unit tests:	Gael Guennebaud	2018-12-09
\| \| \| \| \| \|	- cleanup noise in imaginary part of real roots - take into account the magnitude of the derivative to check roots. - use <= instead of < at appropriate places