tensorflow - machine learning framework

	Commit message (Collapse)	Author	Age
*	Allow the executor type for a function to be specified as an attr on a function.	Derek Murray	2018-10-10
\| \| \| \| \| \| \| \| \| \| \|	This change complements the existing `InstantiateOptions::executor_type` option, which takes precedence over the attr if both are provided. It enables the choice of executor to be separated from both the calling op implementation and the function definition, which simplifies the use of custom executors in operations that take a function as an attr (e.g.) `tf.data` and the functional control-flow ops. PiperOrigin-RevId: 216532778
*	Add a tracing::ScopedActivity event to track the duration of a Session::Run()	A. Unique TensorFlower	2018-10-08
\| \| \| \| \| \| \|	call for better xprof tracing. Also annotate synchronous op execution with the session-run id (or step_id) as metadata leveraging the support introduced in cl/215985561. This should enable highlighting the duration of a Session::Run and all the ops that ran in it for visualizing latency regressions in the case of CPU inference. PiperOrigin-RevId: 216284682
*	Partial support tfe.defun in tf.gradients.	Alexandre Passos	2018-10-08
\| \| \| \| \| \| \| \|	Doesn't attempt to deal with cases where we might have already generated the functiondef for the parent function as in that case we cannot easily modify the forward pass. PiperOrigin-RevId: 216243224
*	Make ExecutorState preserve the thread context.	A. Unique TensorFlower	2018-10-08
\| \| \| \|	PiperOrigin-RevId: 216187878
*	Copy device from If op to the lowered ops.	Saurabh Saxena	2018-10-05
\| \| \| \| \| \|	Enable GPU tests for cond_v2. PiperOrigin-RevId: 215956220
*	Revert constant folding to previous state.	Tong Shen	2018-10-05
\| \| \| \|	PiperOrigin-RevId: 215946205
*	When running a native/builtin op via eager C API, automatically fill in default	Mingsheng Hong	2018-10-05
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	attr values that are not overridden e.g. transpose_a in the matmul op). This is required for backward compatibility (a binary built via an older version of TF should still run on a newer version of TF, where some ops may have added attrs). For non-eager graph building, the default attr values of graph ops are added by tensorflow::AddDefaultsToNodeDef(). We ran into this issue when running the same S4TF test cases via eager APIs -- some tests failed due to "missing attrs", but are fixed by this patch. PiperOrigin-RevId: 215927271
*	Pin ops with small integer inputs (already on the cpu) to the cpu in eager.	Akshay Modi	2018-10-04
\| \| \| \| \| \|	An environment variable (TF_EAGER_ENABLE_SMALL_TENSOR_CPU_PINNING) is provided to turn this off if necessary (its on by default). PiperOrigin-RevId: 215821915
*	Roll forward change "Skip control flow functionalization if there is no ↵	Tong Shen	2018-10-04
\| \| \| \| \| \|	Switch or Merge node.". PiperOrigin-RevId: 215772272
*	collective_param_resolver_local.cc: delete DCHECK(!ir->out_mu.try_lock()); ↵	A. Unique TensorFlower	2018-10-04
\| \| \| \| \| \| \| \| \|	in a lambda UNLOCK_FUNCTION(ir->out_mu) annotates that the lock is held on entry. try_lock() should not be called. PiperOrigin-RevId: 215769341
*	[TF] Fail fast if there is no CPU kernel during constant tensor evaluation.	Peter Hawkins	2018-10-04
\| \| \| \| \| \|	Avoids LOG(ERROR) spam when the Executor is unable to find a CPU kernel. PiperOrigin-RevId: 215738481
*	Enable collective graph key test for GPU builds.	Ayush Dubey	2018-10-03
\| \| \| \| \| \| \|	In the process, properly place nodes on devices in the collective graph key test. PiperOrigin-RevId: 215616146
*	Merge pull request #22493 from Intel-tensorflow:cuixiaom_disable_MKL	TensorFlower Gardener	2018-10-03
\|\ \| \| \| \| \| \|	PiperOrigin-RevId: 215560522
\| *	Minor changes, hanged CHECK_GE to DCHECK_GE due to code policy change	Xiaoming (Jason) Cui	2018-10-01
\| \|
* \|	Fix Android builds when using --define=with_tflite_flex	Jared Duke	2018-10-01
\| \| \| \| \| \| \| \|	PiperOrigin-RevId: 215292521
* \|	Make cond_v2 If op lowering work in a defun + eager.	Skye Wanderman-Milne	2018-10-01
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Prior to this change, the lowering pass assumed that the If op functions would be available in the If op's graph. If the If op is defined in a defun and then called via eager execution, the functions will be in the eager context, but not in the defun's graph. This change makes the lowering pass correctly use the function library passed in by the caller via GraphOptimizationPassOptions. PiperOrigin-RevId: 215271990
\| *	Fixed format errors reported by clang-format	AG Ramesh	2018-09-29
\| \|
\| *	Lower the MKLCpuAllocator priority so that it can use default allocator when ↵	Xiaoming (Jason) Cui	2018-09-28
\| \| \| \| \| \| \| \|	MKL is disabled, and with some minor changes
\| *	Merge branch 'master' into cuixiaom_disable_MKL	Xiaoming (Jason) Cui	2018-09-28
\| \|\ \| \|/ \|/\|
* \|	Introduce the abstraction of RunHandler which each DirectSession can use for	A. Unique TensorFlower	2018-09-28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	the duration of a single RunInternal() call from RunHandlerPool. It is used for running inter-op closures with a global scheduler (which in the future) to improve both median and tail latency (for use-cases like CPU inference). In the case that global pools aren't used, this change should be a no-op. PiperOrigin-RevId: 214992852
\| *	Added the feature to disable MKL support of TensorFlow by environmental ↵	Xiaoming (Jason) Cui	2018-09-28
\|/ \| \| \|	variable TF_DISABLE_MKL=1
*	Support nested variants in CopyHostToDevice and CopyDeviceToHost.	Saurabh Saxena	2018-09-27
\| \| \| \|	PiperOrigin-RevId: 214853860
*	Automated rollback of commit 750466c6e6624d279de7f9a43accd682d487509c	Revan Sopher	2018-09-27
\| \| \| \|	PiperOrigin-RevId: 214853846
*	Merge pull request #22286 from Intel-tensorflow:nhasabni/unit-test-fixes	TensorFlower Gardener	2018-09-27
\|\ \| \| \| \| \| \|	PiperOrigin-RevId: 214821528
* \|	Introduce the abstraction of RunHandler which each DirectSession can use for	A. Unique TensorFlower	2018-09-27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	the duration of a single RunInternal() call from RunHandlerPool. We want to leverage this abstraction for improving the cross-session inter-op parallelism for lower latency inference in the future. In the case that global pools aren't used, this change should be a no-op. PiperOrigin-RevId: 214818187
* \|	Dynamic subdivisions in collective ring reduce.	Ayush Dubey	2018-09-27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Before this change, a CollectiveOp user was required to specify subdiv_offsets for the RingReduce algorithm. During ring reduction, we created chunks of the tensor to exchange between devices. If the chunks were too large, or if the hardware supported multiple data exchanges in parallel, the user could further subdivide the chunk by specifying more than 1 subdiv offset. Each subdiv offset corresponded to another subdivision of the chunk, so effectively the total number of tensor chunks is number of devices * number of subdivs. After this change, we can dynamically infer the number of subdivisions based on a target chunk size. In ring_reducer.cc, we start with 1 subdiv, and keep increasing until chunk size is less than MAX_CHUNK_SIZE. Currently, MAX_CHUNK_SIZE is set at 4 MB, although it may make sense to change this based on specific hardware. As a part of this change, a user can now provide an empty subdiv_offset list. If empty, we dynamically add subdivisions based on the above algorithm. If non-empty, we take the user-specified subdivions. PiperOrigin-RevId: 214815959
* \|	Clean up unused members in DirectSession and Executor.	Derek Murray	2018-09-27
\| \| \| \| \| \| \| \|	PiperOrigin-RevId: 214802032
* \|	Enable constant folding for device memory tensors.	Tong Shen	2018-09-26
\| \| \| \| \| \| \| \|	PiperOrigin-RevId: 214723970
* \|	Skip SymbolicGradientOp when doing constant folding in control flow ↵	Tong Shen	2018-09-26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	functionalization. If we want to evaluate SymbolicGradient op in constant folding, we need to construct Device object and attach it to FunctionLibraryRuntime. In graph rewriting pass, we do not have Device object created yet; it will only be created in XlaCompiler. PiperOrigin-RevId: 214702943
* \|	Misc. micro-optimizations in Grappler optimizers.	A. Unique TensorFlower	2018-09-26
\| \| \| \| \| \| \| \| \| \| \| \|	Make shape inference lazy in optimizers that may not trigger. PiperOrigin-RevId: 214669034
* \|	Adds a build flag to enable MKL (mkl_enabled=true).	A. Unique TensorFlower	2018-09-25
\| \| \| \| \| \| \| \|	PiperOrigin-RevId: 214557082
* \|	Do not assume Node.in_edges() is sorted by dst_input.	Tong Shen	2018-09-24
\| \| \| \| \| \| \| \|	PiperOrigin-RevId: 214380876
* \|	Inline kernel tracing logic into `ExecutorState::Process()`.	Derek Murray	2018-09-24
\| \| \| \| \| \| \| \| \| \| \| \|	All devices implement the same tracing logic in an override of `Device::Compute()`. However, that logic does not have access to the cached `NodeItem::kernel_is_expensive` bit for the kernel, so it must make a virtual call to `OpKernel::IsExpensive()`. By inlining the logic into `ExecutorState::Process()`, we avoid making an unnecessary virtual call on each kernel invocation (when a trace controller is attached). PiperOrigin-RevId: 214332492
* \|	[tf.data] Add `tf.contrib.data.Optional` support to `Structure`.	Derek Murray	2018-09-23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This change switches `tf.contrib.data.Optional` to use a `Structure` class to represent the structure of its value, instead of `output_types`, `output_shapes`, and `output_classes` properties. It adds support for nesting `Optional` objects and representing their structure. This change also makes a modification to the `Structure` class: `Structure.is_compatible_with(x)` now takes another `Structure` as the `x` argument, instead of a value. This makes it easier to work with nested structures (where we might not have a value readily available), and better matches the interface of other `is_compatible_with()` methods (e.g. in `tf.TensorShape` and `tf.DType`). Finally, in the process of making this change, I observed possible crash-failures when a DT_VARIANT tensor containing another DT_VARIANT tensor is copied between CPU and GPU. This change "fixes" the immediate problem by raising an UnimplementedError, but more work will be necessary to support the full range of use cases. PiperOrigin-RevId: 214198993
* \|	Add PinToHostOptimizer to grappler: force small ops to happen on CPU (instead of	A. Unique TensorFlower	2018-09-22
\| \| \| \| \| \| \| \| \| \| \| \|	GPU). This avoids many unnecessary CPU<->GPU memcpy and syncs. PiperOrigin-RevId: 214108484
* \|	Executor: Move `GetNodeAttr()` off the critical path for loop execution.	Derek Murray	2018-09-21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In `ExecutorState::PropagateOutputs()`, each time a loop enter node is processed, the node's attrs are consulted to determine if it is a "constant" or "non-constant" enter node. This entails a call to the protobuf library, followed by multiple string comparisons to find the attribute in the Node's NodeDef's attr map. The value of this property never changes after the executor is first constructed, so in this change we move it to a cached field on the `NodeItem` struct, and use that value. PiperOrigin-RevId: 214047449
* \|	Set device on resource touching ops before checking where to execute.	Akshay Modi	2018-09-21
\| \| \| \| \| \| \| \| \| \| \| \|	Thanks @alextp for finding the bug! PiperOrigin-RevId: 213999971
* \|	Merge pull request #22337 from byronyi:scoped_allocator_ops_fix	TensorFlower Gardener	2018-09-20
\|\ \ \| \| \| \| \| \| \| \| \|	PiperOrigin-RevId: 213906379
* \| \|	Implement TF graph capture.	Russell Power	2018-09-20
\| \| \| \| \| \| \| \| \| \| \| \|	PiperOrigin-RevId: 213875284
* \| \|	Merge pull request #20443 from naurril:master	TensorFlower Gardener	2018-09-20
\|\ \ \ \| \| \| \| \| \| \| \| \| \| \| \|	PiperOrigin-RevId: 213844688
* \| \| \|	Internal change.	A. Unique TensorFlower	2018-09-20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	PiperOrigin-RevId: 213770000
* \| \| \|	Allow setting a global override for the "allow_growth" GPU option via the ↵	A. Unique TensorFlower	2018-09-19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	TF_FORCE_GPU_ALLOW_GROWTH environment variable. PiperOrigin-RevId: 213728460
* \| \| \|	Merge pull request #21000 from ↵	TensorFlower Gardener	2018-09-19
\|\ \ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	ROCmSoftwarePlatform:upstream-staging-gpu-common-runtime-1 PiperOrigin-RevId: 213653830
* \| \| \| \|	Putting `NodeExecStatsWrapper` behind an interface and providing a ↵	Jiri Simsa	2018-09-18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	light-weight statistics collector for tf.data performance modeling. PiperOrigin-RevId: 213566889
* \| \| \| \|	Eliminate VisitableAllocator.	A. Unique TensorFlower	2018-09-18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The visitor pattern is used to allow pre-registration of memory for DMA access, e.g. for fast GPU/CPU i/o and for RDMA networking. The VisitableAllocator interface was introduced to support this use some time ago, prior to SubAllocators. Memory registration works best if it's done infrequently, on large pieces of memory, rather than on every piece that's dynamically allocated/freed. This usage pattern fits the SubAllocator better than a general Allocator. This change moves memory allocation visitor access to SubAllocator and eliminates the VisitableAllocator subclass of Allocator. This change also more rigorously enforces the requirement that all Visitors be declared prior to memory allocation begining. This is accomplished by requiring that Visitors be provided to the SubAllocator constructor. This refactoring will ease an upcoming CL introducing NUMA specific CPU devices. It also should fix some performance pitfalls (e.g. accidental use of PoolAllocator) introduced by an earlier refactoring of ProcessState that was also in preparation for NUMA. It restores the default use of the cpu_allocator() value (i.e. no SubAllocator) by model executions that don't use allocation visitors (since visitor registration must precede the first allocation, hence can be detected at that time). PiperOrigin-RevId: 213505655
\| \| \| * \|	Support scoped_allocator_ops for renamed device.	Bairen Yi	2018-09-18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This fixes #22274. Signed-off-by: Bairen Yi <byi@connect.ust.hk>
* \| \| \| \|	Automated rollback of commit 185aa89912376d4088c22615908696cd30f9951b	A. Unique TensorFlower	2018-09-17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	PiperOrigin-RevId: 213394522
* \| \| \| \|	Num elements fastpath for eager tensors.	Akshay Modi	2018-09-17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	PiperOrigin-RevId: 213377426
* \| \| \| \|	Eliminate VisitableAllocator.	A. Unique TensorFlower	2018-09-17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The visitor pattern is used to allow pre-registration of memory for DMA access, e.g. for fast GPU/CPU i/o and for RDMA networking. The VisitableAllocator interface was introduced to support this use some time ago, prior to SubAllocators. Memory registration works best if it's done infrequently, on large pieces of memory, rather than on every piece that's dynamically allocated/freed. This usage pattern fits the SubAllocator better than a general Allocator. This change moves memory allocation visitor access to SubAllocator and eliminates the VisitableAllocator subclass of Allocator. This change also more rigorously enforces the requirement that all Visitors be declared prior to memory allocation begining. This is accomplished by requiring that Visitors be provided to the SubAllocator constructor. This refactoring will ease an upcoming CL introducing NUMA specific CPU devices. It also should fix some performance pitfalls (e.g. accidental use of PoolAllocator) introduced by an earlier refactoring of ProcessState that was also in preparation for NUMA. It restores the default use of the cpu_allocator() value (i.e. no SubAllocator) by model executions that don't use allocation visitors (since visitor registration must precede the first allocation, hence can be detected at that time). PiperOrigin-RevId: 213371553
\| \| \| \| *	Addressing review comments: indentation	Niranjan Hasabnis	2018-09-17
\| \| \| \| \|