aboutsummaryrefslogtreecommitdiffhomepage
path: root/tensorflow/python/training/monitored_session.py
Commit message (Collapse)AuthorAge
* Wait for shared resources to initialize before initializing local resources.Gravatar A. Unique TensorFlower2018-10-08
| | | | | | shared resources are very similar to global variables functionally and they are initialized at the same time but since workers are only waiting for global variables being initialized, there is a race condition that sometimes the shared resource is not ready. PiperOrigin-RevId: 216208679
* Update error message upon a preemption error to highlight a potentialGravatar A. Unique TensorFlower2018-09-21
| | | | | | gRPC failure and suggest increasing the number of parameter servers. PiperOrigin-RevId: 214077622
* Add a warning when `tf.train.start_queue_runners()` is called with no queue ↵Gravatar Derek Murray2018-08-30
| | | | | | | | runners defined. This complements the deprecation warning that is printed when that function is called, and provides an actionable hint that the user can delete the call. PiperOrigin-RevId: 211012334
* tfdbg: Add adjustable limit to total bytes dumped to diskGravatar Shanqing Cai2018-08-28
| | | | | RELNOTES: tfdbg: Limit the total disk space occupied by dumped tensor data to 100 GBytes. Add environment variable `TFDBG_DISK_BYTES_LIMIT` to allow adjustment of this upper limit. PiperOrigin-RevId: 210648585
* Use distribution strategy to configure distribute coordinator.Gravatar Yuefeng Zhou2018-08-16
| | | | | | Add session_creator and a couple properties to worker context which then are used to configure monitored sessions. PiperOrigin-RevId: 209026599
* Automated rollback of commit 568727eed199dba04e37f500265b50f96fed455eGravatar Nick Felt2018-07-24
| | | | PiperOrigin-RevId: 205875586
* Add v2 summary support to Estimator.train() and MonitoredSession hooksGravatar Nick Felt2018-07-24
| | | | | | | | | | | | | This change makes Estimator.train() support v2 summaries (tf.contrib.summary.*) out-of-the-box, to match the support for v1 summaries. Estimator.train() will now handle the boilerplate necessary to initialize a file writer and enable summary writing every N steps, and will ensure that its own automatically exported summaries (for loss and global_step/sec) get written to the same underlying events file. As part of this change, tf.train.SummarySaverHook, tf.train.CheckpointSaverHook, tf.train.StepCounterHook, and tf.train.ProfilerHook have also been adapted to write summaries using the v2 summary system (via a compatibility layer), instead of using FileWriterCache. A couple additional smaller changes are: - the 'session' parameter to FileWriter() can now be a callable returning a tf.Session instance. - the introduction of tf.contrib.summary.record_summaries_if() which takes a boolean tensor for direct control of tf.contrib.summary.should_record_summaries(). - EstimatorSpec.train_op, besides a tf.Operation, is now allowed to be any Tensor-equivalent object rather than just a tf.Tensor. PiperOrigin-RevId: 205843986
* Provide the ability to specify, in tf.train.MonitoredTrainingSession(), a ↵Gravatar A. Unique TensorFlower2018-06-14
| | | | | | | | | separate summary directory. When set, summary_dir is passed as output directory to StepCounterHook and SummarySaverHook. When unset, the behavior is unchanged and checkpoint_dir is used instead. PiperOrigin-RevId: 200526130
* Move fn_args utility into core TensorFlow from Estimator.Gravatar Michael Case2018-05-11
| | | | | | | Working on untangling TF/Estimator deps. Some core TF code depends on Estimator by using the fn_args utility function within Estimator. PiperOrigin-RevId: 196277612
* Merge changes from github.Gravatar Patrick Nguyen2018-05-01
| | | | PiperOrigin-RevId: 194997009
* Automated g4 rollback of changelist 190858242Gravatar Jianwei Xie2018-03-29
| | | | PiperOrigin-RevId: 190953197
* Automated g4 rollback of changelist 190835392Gravatar Anna R2018-03-28
| | | | PiperOrigin-RevId: 190858242
* Merge changes from github.Gravatar Jianwei Xie2018-03-28
| | | | PiperOrigin-RevId: 190835392
* Adding tf_export decorators/calls to TensorFlow functions and constants.Gravatar Anna R2018-01-31
| | | | PiperOrigin-RevId: 183936100
* Adds some logging around model creation.Gravatar A. Unique TensorFlower2018-01-12
| | | | | | | | When debugging slow startups, it is useful to be able to determine the following, and I which I was not able to get from the current logging: - When, and how long, model construction happens with Estimator - When, and how long, the init op takes PiperOrigin-RevId: 181768945
* Initialize local_resources during session initialization.Gravatar A. Unique TensorFlower2017-12-11
| | | | PiperOrigin-RevId: 178694869
* Mark Supervisor deprecated. Please use MonitoredTrainingSession instead.Gravatar Martin Wicke2017-11-28
| | | | | | Fixes #6263. PiperOrigin-RevId: 177230053
* Plumb worker max_wait_secs arguments up to tf.contrib.train.train.Gravatar A. Unique TensorFlower2017-11-16
| | | | PiperOrigin-RevId: 175991159
* Fix tensorflow.org rendering of the example code for run_step_fn.Gravatar Igor Saprykin2017-11-10
| | | | | | Python code isn't indented correctly. PiperOrigin-RevId: 175067065
* Add more recovery functionality to MonitoredSession.run_step_fn.Gravatar Igor Saprykin2017-10-31
| | | | | | | | Current implemention wouldn't recover from one of `_PREEMPTION_ERRORS` during a fetch through the raw session that is made available to the step_fn. The changelist presents a way to map the desired functionality to the hiearchy of _MonitoredSession > (possibly!) _RecoverableSession > _CoordinatedSession > _HookedSession. PiperOrigin-RevId: 174053865
* Should not affect public.Gravatar A. Unique TensorFlower2017-10-25
| | | | PiperOrigin-RevId: 173467560
* Add a way to run ops using a step function to MonitoredSession.Gravatar Igor Saprykin2017-10-24
| | | | | | With this method users have access to a raw Session while getting the benefit of recoverable behavior of MonitoredSession. PiperOrigin-RevId: 173334319
* Propagate the original stack trace when exceptions caught be MonitoredSessionGravatar A. Unique TensorFlower2017-09-06
| | | | | | are re-raised. PiperOrigin-RevId: 167781071
* Merge changes from github.Gravatar Andrew Harp2017-08-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | END_PUBLIC --- Commit 575bd01d4 authored by Vijay Vasudevan<vrv@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Remove /replica:0 declaration in device functions and allow them to be freely bound based on cluster names present. When more than one value matches, it will choose the first lexicographically available device that matches the specification, which in practice will do pretty much the same thing as hardcoding /replica:0. PiperOrigin-RevId: 165766815 --- Commit d685bbc54 authored by Alexandre Passos<apassos@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Benchmarks with backprop enabled (and removes overhead). Before: np.array([[3]]) took 1.50us (30000 iterations) Tensor([[3]]) took 16.30us (30000 iterations) MatMul [2, 2]: np.dot took 0.61us (30000 iterations) MatMul [2, 2]: tf.matmul took 60.53us (30000 iterations) MatMul [2, 2]: gen_math_ops.mat_mul took 25.72us (30000 iterations) MatMul [2, 2]: TFE_Py_Execute took 2.82us (30000 iterations) MatMul [2, 2]: defun(tf.matmul) took 45.70us (30000 iterations) MatMul [100, 784]: np.dot took 383.32us (1000 iterations) MatMul [100, 784]: tf.matmul took 350.35us (1000 iterations) MatMul [100, 784]: gen_math_ops.mat_mul took 315.97us (1000 iterations) MatMul [100, 784]: TFE_Py_Execute took 249.42us (1000 iterations) MatMul [100, 784]: defun(tf.matmul) took 280.95us (1000 iterations) If backprop is enabled: np.array([[3]]) took 0.83us (30000 iterations) Tensor([[3]]) took 15.21us (30000 iterations) MatMul [2, 2]: np.dot took 0.63us (30000 iterations) MatMul [2, 2]: tf.matmul took 76.31us (30000 iterations) MatMul [2, 2]: gen_math_ops.mat_mul took 38.66us (30000 iterations) MatMul [2, 2]: TFE_Py_Execute took 2.31us (30000 iterations) MatMul [2, 2]: defun(tf.matmul) took 51.96us (30000 iterations) MatMul [100, 784]: np.dot took 378.34us (1000 iterations) MatMul [100, 784]: tf.matmul took 352.09us (1000 iterations) MatMul [100, 784]: gen_math_ops.mat_mul took 364.28us (1000 iterations) MatMul [100, 784]: TFE_Py_Execute took 350.68us (1000 iterations) MatMul [100, 784]: defun(tf.matmul) took 377.19us (1000 iterations) After: np.array([[3]]) took 0.86us (30000 iterations) Tensor([[3]]) took 15.19us (30000 iterations) MatMul [2, 2]: np.dot took 0.60us (30000 iterations) MatMul [2, 2]: tf.matmul took 64.51us (30000 iterations) MatMul [2, 2]: gen_math_ops.mat_mul took 28.34us (30000 iterations) MatMul [2, 2]: TFE_Py_Execute took 2.38us (30000 iterations) MatMul [2, 2]: defun(tf.matmul) took 48.50us (30000 iterations) MatMul [100, 784]: np.dot took 475.27us (1000 iterations) MatMul [100, 784]: tf.matmul took 399.50us (1000 iterations) MatMul [100, 784]: gen_math_ops.mat_mul took 307.80us (1000 iterations) MatMul [100, 784]: TFE_Py_Execute took 272.83us (1000 iterations) MatMul [100, 784]: defun(tf.matmul) took 350.06us (1000 iterations) PiperOrigin-RevId: 165765641 --- Commit d902babbd authored by David Majnemer<majnemer@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: [XLA] Algebraic simplifier incorrectly transformed convolutions into bitcasts PiperOrigin-RevId: 165765575 --- Commit 8e78e10ef authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: disable test temporarily PiperOrigin-RevId: 165763204 --- Commit a271c37db authored by Benoit Steiner<bsteiner@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Small improvements to the arithmetic optimizer PiperOrigin-RevId: 165760972 --- Commit b6409594d authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Convert some tests to cover both eager and graph. PiperOrigin-RevId: 165760364 --- Commit 5ead76420 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Reduce XLA compile time by ~7% for a convolutional image model: * Added CompactPointerSet<T>, which is optimized for set size <= 1. * Changed expensive CHECKs to DCHECKS in buffer_assignment.cc * Reserve space in DFS state array before starting DFS. * Use unsigned arithmetic in DFS state maintenance. * HloInstruction: - Moved frequently used fields to start for better cache locality. - Use InlinedVector instead of vector for operand array. - Use InlinedVector instead of vector for DFS stack. * Pre-compute "is array" and "is tuple" for LogicalBuffer. * PointsToSet: - Combine two ShapeTrees into one. - Use CompactPointerSet instead of std::set to hold sources. - Use CompactPointerSet instead of std::set to hold flattened buffers. * ShapeTree: use unique_ptr instead of optional for shape storage (reduces size and destruction overhead). * Add proper const qualifiers to some FlatSet iterator methods. Co-author=jeff PiperOrigin-RevId: 165759117 --- Commit a0544b0b8 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Make TPU symbols more easily accessible from contrib. PiperOrigin-RevId: 165753322 --- Commit cdc08afbb authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Slightly relax numeric tolerance for sinlge precision tests of matrix_solve_ls (and tighten it for double precision). PiperOrigin-RevId: 165750936 --- Commit eebcc861a authored by Jianwei Xie<xiejw@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Fixed the race condition between multi eval step increments. PiperOrigin-RevId: 165750595 --- Commit bbc0b8471 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Go: Update generated wrapper functions for TensorFlow ops. PiperOrigin-RevId: 165748384 --- Commit 65f87c967 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Change device string in RecvNodeDescriptor in VirtualScheduler from const reference to const as the RecvNodeDescriptor (and cached_recv_nodes map) outlives device string from the NodeDef. PiperOrigin-RevId: 165748244 --- Commit 57b0276cf authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Update ops-related pbtxt files. PiperOrigin-RevId: 165747467 --- Commit 64e54423b authored by Derek Murray<mrry@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: [tf.contrib.data] Fix nested dictionary handling in dataset elements. Backports recent changes to the core version of the nest.py library. Fixes #12372. PiperOrigin-RevId: 165746517 --- Commit 378463ae8 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Make tf.eye accept Python integer shapes and avoid generating unnecessary shape handling ops. Clean up test and add tests with placeholders. PiperOrigin-RevId: 165746090 --- Commit 109ecf823 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Add support for complex in matrix_solve_ls_op. Split into separate files for each data type to speed up build. PiperOrigin-RevId: 165744539 --- Commit 51441302d authored by Alexandre Passos<apassos@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Internal change. PiperOrigin-RevId: 165737455 --- Commit d0cb32c2a authored by Alexandre Passos<apassos@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Docstring for ResourceVariable. PiperOrigin-RevId: 165735441 --- Commit 32f4c5b6e authored by Chris Leary<leary@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: [XLA] Add IsFinite op in tf2xla. PiperOrigin-RevId: 165734702 --- Commit 5f5c3eb0a authored by Mark Daoust<markdaoust@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Move "supervisor.md" from programmer's guide to api_guides. PiperOrigin-RevId: 165732026 --- Commit d001b58de authored by Derek Murray<mrry@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: [tf.contrib.data] Fix handling of multi-output tf.py_func() in Dataset.map(). If the `map_func` returns a list of tensors, the current code will attempt to stack it into a single tensor and raise an unintuitive error. Some multi-output ops (such as `tf.py_func()`) return lists of typically-not-stackable tensors. This change treats lists returned from `map_func` as tuples; users who were relying on this auto-stacking behavior should manually call `tf.stack()` (or `tf.convert_to_tensor()`) on the list being returned. Fixes #12396. PiperOrigin-RevId: 165731970 --- Commit e6c60fb36 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Fix flakyness, sometimes the op takes ms to run. PiperOrigin-RevId: 165728705 --- Commit 360bff8ae authored by Ali Yahya<alive@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Makes tape.watch() work with ResourceVariables. To this end, also adds a property, `device`, to TensorNode. PiperOrigin-RevId: 165726368 --- Commit 80bd004cd authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Implements SVDF model for keyword spotting tutorial. PiperOrigin-RevId: 165725938 --- Commit aaabf6b90 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Fix bug: Using a ComputationDataHandle from the wrong ComputationBuilder. PiperOrigin-RevId: 165724017 --- Commit 107d165d9 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Use 2-arg TraceMe constructor to prevent unnecessary StrCat computation when tracing is disabled. PiperOrigin-RevId: 165722280 --- Commit 7d01f89cc authored by Pete Warden<petewarden@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Android demo app for speech recognition PiperOrigin-RevId: 165714459 --- Commit a6729325a authored by Alexandre Passos<apassos@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Deletes convert_n_to_eager_tensor. Moves convert_to_eager_tensor to constant_op. PiperOrigin-RevId: 165704074 --- Commit 573b303ac authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: BUILD cleanup in tensorflow/core/kernels PiperOrigin-RevId: 165688864 --- Commit 711be6adc authored by Derek Murray<mrry@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: `Dataset.from_generator()` constructs a dataset from a Python generator. With this change, it becomes possible to use a Python generator as the source dataset for a `tf.contrib.data` input pipeline. This enables easier integration with non-TensorFlow data sources. The generator can yield a nested structure of NumPy arrays, or values convertible to NumPy arrays. This addresses a concern raised in issue #7951. PiperOrigin-RevId: 165663857 --- Commit 00594ecdd authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: New landing page and leftnav for Programmer's Guide. PiperOrigin-RevId: 165660897 --- Commit 7359fec79 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Implement Batchnorm Inference by expanding them into smaller ops. 1. Add batch norm inference support in batchnorm_rewriter 2. Connect xla's batchnorm inference to tf's FusedBatchNorm RELNOTES: n/a PiperOrigin-RevId: 165655351 --- Commit f0da8bf56 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: [Rematerialization] Reconsider to remat operations with control dependencies We added a conservartive logic to not rematerialize operations with control dependencies since the rematerialized operations could result in undesired ordering. However, we now realize that when we remat an operation, we also copy the dependencies of them, which guarantees the rematerialized operation has the same constraint as the original operation. PiperOrigin-RevId: 165654629 --- Commit a1225879c authored by Chris Leary<leary@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: [XLA] Propagate error code in computation replay tool. PiperOrigin-RevId: 165654497 --- Commit 513def0bb authored by Benoit Steiner<bsteiner@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Fixed BuildOpInfoWithoutDevice PiperOrigin-RevId: 165653933 --- Commit d7e425f0b authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Fix linear algebra benchmarks. PiperOrigin-RevId: 165653891 --- Commit 465c40819 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Fix the shape information propagation for Enter op. PiperOrigin-RevId: 165653579 --- Commit c0198fd8d authored by Derek Murray<derek.murray@gmail.com> Committed by gunan<gunan@google.com>: [CMake] Add missing dependencies on boosted_trees protos and other fixes (#12315) * [CMake] Add missing dependencies * Avoid rebuilding boosted_trees protos for Python. * Add GPU implementation ZeroInitializerOp to the CMake build. --- Commit 641943fd7 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Update ops-related pbtxt files. PiperOrigin-RevId: 165652758 --- Commit e31346452 authored by Jonathan Hseu<jhseu@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: TPUEstimator: Fix the outfeed thread join. PiperOrigin-RevId: 165651781 --- Commit 565a9d350 authored by Vijay Vasudevan<vrv@google.com> Committed by Andrew Harp<andrewharp@users.noreply.github.com>: Add missing 'type' keyword to ArgumentParser add_argument (#12275) Fixes #12210 --- Commit 19a55725a authored by Rohan Jain<rohanj@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Allowing functions to run across devices. This change expands the ProcessFunctionLibraryRuntime library to Instantiate and Run functions on different devices. When a FunctionLibraryRuntime encounters a function with a target that is another device, it delegates Instantiate() and Run() calls to the ProcessFunctionLibraryRuntime. This change also moves the table_ containing all function instantiations to the PFLR instead of the FunctionLibraryRuntime. PiperOrigin-RevId: 165651194 --- Commit 8c0853db7 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Add a test for negative and zero pow() input. PiperOrigin-RevId: 165650096 --- Commit a3c4e980e authored by Pete Warden<petewarden@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Fixed input shape for freezing audio graphs PiperOrigin-RevId: 165649546 --- Commit 9b9e5989d authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Add a call_logit_fn utility for logit_fn's, similar to Estimator's _call_model_fn. PiperOrigin-RevId: 165649388 --- Commit 4ff1f4442 authored by Amit Patankar<amitpatankar@google.com> Committed by Amit Patankar<amitpatankar@google.com>: Remove the script as well if building tf_nightly. --- Commit 373d78987 authored by Amit Patankar<amitpatankar@google.com> Committed by Amit Patankar<amitpatankar@google.com>: Adding the break. --- Commit 0139ac983 authored by Amit Patankar<amitpatankar@google.com> Committed by Amit Patankar<amitpatankar@google.com>: Remove tensorboard as a required package if we are building tf_nightly. --- Commit a92bd5d5c authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: BEGIN_PUBLIC Automated g4 rollback of changelist 165630063 PiperOrigin-RevId: 165957821
* Recover MonitoredSession when the Coordinator is requested to stop with one ↵Gravatar Igor Saprykin2017-08-10
| | | | | | | | | | | | | | | | | | | | | | of the _PREEMPTION_ERRORS. When SyncReplicasOptimizer is used, a preemption in the Coordinator may result in two cases: Case 1) the session gets silently marked as complete Case 2) the session gets stuck This CL aims to solve and verify solutions for both of these problems. Fix 1 changes the should_stop logic. Fix 2 changes the CoordinatedSession.run() logic. SyncReplicasOptimizer runs a separate set of threads using a Coordinator instance. Those threads do FIFOQueue.enqueue; the main thread does a blocking FIFOQueue.dequeue. `sync_token_q` FIFOQueue is on parameter-servers. When one of the PS instances gets preempted, an AbortedError causes the Coordinator to stop via request_stop(ex). That by itself changes the state of MonitoredSession.should_stop() to True (Fix 1). Results of the blocking Dequeue operation are sent to the chief worker via Recv. What happens next depends on the amount of tokens in `sync_token_q`. If there are enough for the next call to Dequeue to return, then the low-level "tf session run() call" returns. The next iteration of the `while not MonitoredSession.should_stop()` loop decides that the training is complete (Case 1). If there are not enough tokens in `sync_token_q`, then the blocking Dequeue is going to keep waiting for them. This results in the graph execution getting stuck and the whole session getting garbage collected after 10 minutes (Case 2). We decided to fix that by re-creating a session after it gets garbage collected (Fix 2). An alternative was to try to cancel the pending Dequeue operation, but it's not clear that it is the right thing to do and it is also not easy. PiperOrigin-RevId: 164888390
* Merge changes from github.Gravatar Benoit Steiner2017-08-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | END_PUBLIC --- Commit cf375f067 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by Benoit Steiner<benoitsteiner@users.noreply.github.com>: Adds cudnn_rnn_ops_op_lib and cudnn_rnn_kernels to contrib_ops_op_lib and contrib_kernels respectively. PiperOrigin-RevId: 164170971 --- Commit 95ec58e27 authored by Asim Shankar<ashankar@google.com> Committed by Benoit Steiner<benoitsteiner@users.noreply.github.com>: C API: Make TF_TensorFromTensor return an error instead of just logging it. PiperOrigin-RevId: 164167582 --- Commit 15175c870 authored by Jonathan Hseu<jhseu@google.com> Committed by Benoit Steiner<benoitsteiner@users.noreply.github.com>: Build fixes. - Allow var_list as a positional argument in CrossShardOptimizer. - Set the number of shards to 1 when not running on TPU, to allow evaluate() and predict() on CPU/GPU to work. PiperOrigin-RevId: 164161640 --- Commit bd3e894f7 authored by Yao Zhang<yaozhang@google.com> Committed by Benoit Steiner<benoitsteiner@users.noreply.github.com>: Support freeze mode for fused batch norm. PiperOrigin-RevId: 164149032 --- Commit e6b6b84c0 authored by Asim Shankar<ashankar@google.com> Committed by Benoit Steiner<benoitsteiner@users.noreply.github.com>: C API: TF_Tensors will always be in host memory. This change undoes some experimentation in commit 22651083406ca01ac9d481e3367a3510d25f88cd and restores TF_Tensor behavior to what is was prior to that change. PiperOrigin-RevId: 164146670 --- Commit 8bf3f88f7 authored by Peter Hawkins<phawkins@google.com> Committed by Benoit Steiner<benoitsteiner@users.noreply.github.com>: [TF:XLA] Add _XLASend and _XLARecv TF ops that wrap the XLA Send/Recv HLO ops. PiperOrigin-RevId: 164124764 --- Commit 626d3200f authored by Peter Hawkins<phawkins@google.com> Committed by Benoit Steiner<benoitsteiner@users.noreply.github.com>: [XLA] Add test blacklist mechanism for XLA C++ unit tests. PiperOrigin-RevId: 164124423 --- Commit 359cc5f5e authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by Benoit Steiner<benoitsteiner@users.noreply.github.com>: Document dict ordering in nest and make it consistent with sonnet. PiperOrigin-RevId: 164114335 --- Commit 05813b531 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by Benoit Steiner<benoitsteiner@users.noreply.github.com>: Go: Update generated wrapper functions for TensorFlow ops. PiperOrigin-RevId: 164089206 --- Commit c451f465d authored by Anna R<annarev@google.com> Committed by Benoit Steiner<benoitsteiner@users.noreply.github.com>: BEGIN_PUBLIC Automated g4 rollback of changelist 164078808 PiperOrigin-RevId: 164318935
* Merge changes from github.Gravatar Vijay Vasudevan2017-07-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | END_PUBLIC I dropped the following commit because it doesn't compile. I will follow up with Andrew to fix it or revert it. Commit 003deb88b authored by osdamv<osdamv@gmail.com> Committed by Vijay Vasudevan<vrv@google.com>: Refactor and implementation of the camera API 1, it fixes #8736 (#10771) List of commits in this CL: --- Commit 446450369 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Use identity of param variable in cudnn_rnn.RNNParamsSaveable instead of parameter variable directly. The RNNParamsSaveable is usually used in a graph which also has a saver for the cudnn param variable itself, if the same op is used for both, fails with a two savers for same op error. PiperOrigin-RevId: 163431826 --- Commit d629a8316 authored by RJ Ryan<rjryan@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Increase bound on tf.contrib.signal.inverse_stft gradient error to avoid flakiness on macOS. PiperOrigin-RevId: 163426631 --- Commit 253bcbb71 authored by Kay Zhu<kayzhu@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: [XLA] Use HloEvaluator for convolution in reference_util. Also Speed up HloEvaluator's HandleConvolution in non-opt build, by moving calls to HloInstruction::shape() out of the inner loop. PiperOrigin-RevId: 163416183 --- Commit 569a00e68 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Update API to traffic in unique_ptrs rather than owning raw pointers PiperOrigin-RevId: 163414320 --- Commit 31a77bc77 authored by Asim Shankar<ashankar@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Java: Update release to 1.3.0-rc1 PiperOrigin-RevId: 163413736 --- Commit 1ebbf4325 authored by Jonathan Hseu<vomjom@vomjom.net> Committed by GitHub<noreply@github.com>: Add missing grpc dependency (#11828) --- Commit 905abb1f9 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Test asserts should have `expected` first. PiperOrigin-RevId: 163409348 --- Commit d5cc143e2 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Increase timeout to deflake the test. PiperOrigin-RevId: 163407824 --- Commit ce1c7f02a authored by Eli Bendersky<eliben@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Properly include logging header in xla_internal_test_main PiperOrigin-RevId: 163405986 --- Commit 22241cd42 authored by joetoth<joetoth@gmail.com> Committed by Vijay Vasudevan<vrv@google.com>: External leveldb link changed (#11833) table_format.txt was renamed to table_format.md --- Commit 6b7314de4 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Consolidating the code to fill the partition's function library into one place. Previously, Partition() and MasterSession::RegisterPartition() both fills in the partitioned graph's function library. PiperOrigin-RevId: 163400992 --- Commit 28373cfe7 authored by Frank Chen<frankchn@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Adds preliminary support for Cloud TPUs with Cluster Resolvers. This aims to allow users to have a better experienec when specifying one or multiple Cloud TPUs for their training jobs by allowing users to use names rather than IP addresses. PiperOrigin-RevId: 163393443 --- Commit e5353c941 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Don't prune nodes that have reference inputs. PiperOrigin-RevId: 163390862 --- Commit 226510834 authored by Asim Shankar<ashankar@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: C API: Groundwork for experimenting with TF_Tensor in device memory. TF_Tensor objects are always backed by host memory. This commit lays the groundwork for allowing TF_Tensor objects to refer to tensor data on device (e.g., GPU) memory. PiperOrigin-RevId: 163388079 --- Commit 613bf1c7c authored by Yuefeng Zhou<yuefengz@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: fix asan test failure in SingleMachineTest::ReleaseMemoryAfterDestruction. PiperOrigin-RevId: 163386941 --- Commit 4653d37a3 authored by Eli Bendersky<eliben@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: [XLA] Change type to appease GPU builds. PiperOrigin-RevId: 163384927 --- Commit 9f131bd15 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Internal change PiperOrigin-RevId: 163378484 --- Commit 8bc0236c8 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: PiperOrigin-RevId: 163366493 --- Commit 3b97f1f9b authored by Yangzihao Wang<yangzihao@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Change to only run one round of matmul benchmark. PiperOrigin-RevId: 163364341 --- Commit a4a3a3335 authored by Yun Peng<pcloudy@google.com> Committed by Vijay Vasudevan<vrv@google.com>: Fix ./configure on Windows (#11775) * Fix ./configure on Windows * Disable bitwise_ops_test on Windows --- Commit ae3119d16 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Small changes to op framework. PiperOrigin-RevId: 163361071 --- Commit f40189d26 authored by qjivy<ji.qiu@spreadtrum.com> Committed by Vijay Vasudevan<vrv@google.com>: PR again: Enable building label_image with jpeg/gif/png decoder for Android. (#11475) * Enable building label_image with jpeg/gif/png decoder for Android. Add dependency "android_tesnorflow_image_op" to label_image, which is not overlapped with android_tensorflow_kernels. * Running buildifier to reformat the BUILD files for sanity check. --- Commit 599165861 authored by KB Sriram<kbsriram@gmail.com> Committed by Vijay Vasudevan<vrv@google.com>: Add the Constant operator class (#11559) Create a custom operator class to create constants in the Graph, and introduce the Operator marker annotation to identify operator classes. Please see #7149 for the master tracking issue. --- Commit 86ca3506f authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Further BUILD cleanup PiperOrigin-RevId: 163360750 --- Commit 376bb063b authored by Pete Warden<petewarden@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Look inside functions to see which node types are used. PiperOrigin-RevId: 163360375 --- Commit 2139e7d8b authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: [tf.contrib.data] map expects a nested structure. Fixes #11786 PiperOrigin-RevId: 163359134 --- Commit d09304fca authored by Jonathan Hseu<vomjom@vomjom.net> Committed by Vijay Vasudevan<vrv@google.com>: Upgrade gRPC (#11768) * BUILD rule modifications * More build fixes * Code changes * More code fixes * Working tests * CMake build * Fix pprof * Fix header includes * CMake fix test * Bazel clean * Fix verbs * More verbs fixes * bazel clean for XLA * Windows build fix test * Add openssl/rand.h * New cmake build command * --config Release --- Commit 3cd828474 authored by David Norman<DavidNorman@users.noreply.github.com> Committed by Vijay Vasudevan<vrv@google.com>: Fix error with default python path selection (#11814) * Fix error with default python path selection * Move setting of environment var outside if / else --- Commit ddd8e21b7 authored by Eli Bendersky<eliben@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: [XLA] Consolidate all similar main()s in tests into a single target. PiperOrigin-RevId: 163354724 --- Commit a36bca25b authored by Tayo Oguntebi<tayo@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Remove ShapeWithoutPadding() utility function, as it is no longer needed. PiperOrigin-RevId: 163353430 --- Commit b26f9cd44 authored by David Norman<DavidNorman@users.noreply.github.com> Committed by Vijay Vasudevan<vrv@google.com>: Ensure that the multi-instruction fuse can take shared inputs (#11748) * Ensure that the multi-instruction fuse can take shared inputs Note that the fuse action only works when the shared input / constant appears after all of its consumers in the list of instructions. * Add a comment describing the test --- Commit 34cbf161d authored by Jiri Simsa<jsimsa@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Update Dataset API documentation. PiperOrigin-RevId: 163349457 --- Commit 2381ce5c3 authored by Abdullah Alrasheed<a.rasheed@tc-sa.com> Committed by Vijay Vasudevan<vrv@google.com>: DOC: Fix typo. (#11813) you could could be I/O bottlenecked. TO: you could be I/O bottlenecked. --- Commit e4a5c5356 authored by Toby Boyd<tobyboyd@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: ["Variable", "VariableV2", "VarHandleOp"] is the default for ps_ops=None PiperOrigin-RevId: 163344629 --- Commit 722f6f361 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Fix TensorForest's saveable object names so loading a savedmodel works. PiperOrigin-RevId: 163332598 --- Commit cda80a785 authored by Eric Liu<ioeric@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: [tpu profiler] Dump HLO graphs in profile responses to the log directory. PiperOrigin-RevId: 163318992 --- Commit cea9ef6f5 authored by horance<horance-liu@users.noreply.github.com> Committed by Vijay Vasudevan<vrv@google.com>: Refactoring device name utils (#11797) * remove duplicated code for full_name and legacy_name for DeviceNameUtils * replace tabs * Real->Device --- Commit 1f7c0f917 authored by Kongsea<kongsea@gmail.com> Committed by Vijay Vasudevan<vrv@google.com>: Refine docstrings (#11800) --- Commit dd1f0cddd authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Supports lookup devices by fullname either in the canonical form or the legacy form. This makes DeviceSet behaves the same as DeviceMgr's FindDevice method. PiperOrigin-RevId: 163300346 --- Commit 631a364cd authored by Kay Zhu<kayzhu@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: [XLA] Add Reduce, DynamicSlice and DynamicSliceUpdate to HloEvaluator. - Reduce is disabled explicitly for constant folding, as not all types of embedded computation can be currently supported by the evaluator. - Added support to evaluate HloModule to HloEvaluator. - Minor signature change to Evaluate(). PiperOrigin-RevId: 163299238 --- Commit a52470172 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Sets the incarnation number even when the attribute is set. PiperOrigin-RevId: 163299121 --- Commit a49fe0366 authored by Suharsh Sivakumar<suharshs@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Remove platform bridge for grpc_response_reader. PiperOrigin-RevId: 163295986 --- Commit 4404aa7cb authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: [XLA] Add TODO comment explaining why the IsScalar check exists. PiperOrigin-RevId: 163292777 --- Commit 43036ac16 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Remove unnecessary break statements. PiperOrigin-RevId: 163291947 --- Commit fd5de4690 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: [XLA] Add regression test for a corner case using Reduce that currently fails with the GPU backend. PiperOrigin-RevId: 163287986 --- Commit 32e198f2d authored by Chris Leary<leary@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: [TF:XLA] Add tf.cross support. See #11788 PiperOrigin-RevId: 163287731 --- Commit 88abddbc3 authored by Alan Yee<alyee@ucsd.edu> Committed by Vijay Vasudevan<vrv@google.com>: Update README.md (#11793) Remove bad practices of sudo pip and install use safer pip install commands --- Commit 9b30dc3a8 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Remove final mentions of `get_shape` in docstring. PiperOrigin-RevId: 163282839 --- Commit 423c1eea0 authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: BREAKING CHANGE: Fix semantic error in how maybe_batch* handles sparse tensors. PiperOrigin-RevId: 163276613 --- Commit 6028c071b authored by Justin Lebar<jlebar@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Highlight incoming/outgoing edges on hover in HLO graphviz dumps, and other improvements. Other improvements: - Don't show tooltips for nodes and clusters. Previously we'd show a tooltip containing a pointer value expressed as decimal. Not so useful. - Show tooltips on edges with the to/from node names. - Fix bug wherein if we had - a node at the "edge" of the graph (so its operands aren't included unless they're referenced by another node), - with all of its operands included in the graph save one or more constants, and - those constants weren't referenced by any nodes not at the edge of the graph, we would incorrectly draw the node as "grayed out", indicating that one of its operands (namely, its constant operand) wasn't present in the graph. This is wrong because constants are inlined into their users, so they should always count as "displayed" for the purposes of determining whether a node is grayed out. PiperOrigin-RevId: 163276108 --- Commit ce7a355bd authored by Joshua V. Dillon<jvdillon@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Update contrib/distributions/estimator_test build dependency. PiperOrigin-RevId: 163272464 --- Commit 1b8458a1c authored by A. Unique TensorFlower<gardener@tensorflow.org> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Shorten docstring line. PiperOrigin-RevId: 163269709 --- Commit 69e323cc6 authored by Asim Shankar<ashankar@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Fix comment ypo PiperOrigin-RevId: 163266376 --- Commit 08790e73d authored by Chris Leary<leary@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: [XLA] Fix a bug in cloning outfeeds, carried the wrong shape. PiperOrigin-RevId: 163265592 --- Commit 1bad826d6 authored by Yangzihao Wang<yangzihao@google.com> Committed by TensorFlower Gardener<gardener@tensorflow.org>: Rollback of GPU kernel implementation of transpose for tensors with one small dimension. END_PUBLIC BEGIN_PUBLIC BEGIN_PUBLIC Automated g4 rollback of changelist 162525519 PiperOrigin-RevId: 163490703
* Allow for specifying a checkpoint path when creating SingularMonitoredSession.Gravatar A. Unique TensorFlower2017-07-21
| | | | PiperOrigin-RevId: 162728280
* Fix typos in comments.Gravatar A. Unique TensorFlower2017-07-08
| | | | PiperOrigin-RevId: 161305803
* Support only providing save_summaries_secs to MonitoredTrainingSession.Gravatar A. Unique TensorFlower2017-06-28
| | | | PiperOrigin-RevId: 160382016
* Add a warning to documentation of MonitoredSession.Gravatar Mustafa Ispir2017-06-01
| | | | PiperOrigin-RevId: 157728225
* Fix StopAtStepHook with num_steps when multiple steps are executed in a singleGravatar Jonathan Hseu2017-05-26
| | | | | | session.run(). PiperOrigin-RevId: 157277945
* Fix a bug in Scaffold's copy_from_scaffold argument because tf.Tensor cannot ↵Gravatar A. Unique TensorFlower2017-05-17
| | | | | | be converted to python bool PiperOrigin-RevId: 156375789
* Add copy_from_scaffold parameter in the Scaffold constructor. This allows ↵Gravatar A. Unique TensorFlower2017-05-15
| | | | | | creating a new Scaffold instance from an exiting one by copying the fields of the original scaffold and replacing those that are provided in the constructor. PiperOrigin-RevId: 156099660
* Merge changes from github.Gravatar Dan Ringwalt2017-05-05
| | | | Change: 155209832
* Organize the lookup table ops into it's own lookup_ops.cc file instead of ↵Gravatar Yutaka Leon2017-05-04
| | | | | | data_flow_ops.cc Change: 155119120
* Time out after 30 minutes when waiting for the session to be ready.Gravatar A. Unique TensorFlower2017-04-26
| | | | Change: 154362697
* Log steps/sec every 100 steps in MonitoredSession, as before.Gravatar Lukasz Kaiser2017-04-06
| | | | Change: 152465320
* Add default saver option to CheckpointSaverHook and improve docstrings.Gravatar A. Unique TensorFlower2017-03-31
| | | | Change: 151839983
* Decrease volume of spammy logs.Gravatar A. Unique TensorFlower2017-03-28
| | | | | Remove unused _monitored_train. Change: 151526705
* Fix some documentation formatting errors.Gravatar Patrick Nguyen2017-03-21
| | | | Change: 150841749
* Fix lint issues introduced by my pull from GitHub.Gravatar Dandelion Mané2017-03-13
| | | | Change: 149985352
* Merge changes from github.Gravatar Dandelion Mané2017-03-10
| | | | Change: 149800363
* Improved wording of MonitoredSession preemption error logging.Gravatar A. Unique TensorFlower2017-03-02
| | | | Change: 149025947
* Fix spelling errors.Gravatar Patrick Nguyen2017-02-27
| | | | Change: 148678164
* Fix recovery from failed PS tasks in the gRPC implementation forGravatar Jonathan Hseu2017-02-22
| | | | | | | | | | MonitoredSession. Also satisfy the constraint that close() doesn't throw in _WrappedSession, as mentioned in its docs. gRPC internally throws UnavailableError, which we convert and propagate. Fixes #7767 and #6780. Change: 148269127
* Add disable error option for non-stopped threads in Coordinator.join(). ↵Gravatar Mustafa Ispir2017-02-13
| | | | | | | There are queue runner threads which are stuck in session.run. MonitoredSession calls session.close after calling coordinator.join. Session.close will kill those stuck threads. Added argument 'stop_grace_period' to MonitoredSession and it's variants. It will be used in unit tests to reduce timeouts. Change: 147398900
* Added summary-save-sec option to the MonitoredTrainingSession, which is ↵Gravatar Mustafa Ispir2017-01-17
| | | | | | commonly used in example models. Change: 144721588
* Add tf.tables_initializer as a replacement for tf.initialize_all_tables, andGravatar A. Unique TensorFlower2017-01-15
| | | | | | | update callers in tensorflow. Deprecated tf.initialize_all_tables. Change: 144575599
* Run SyncReplicasOptimizer with MonitoredSession.Gravatar Mustafa Ispir2017-01-12
| | | | | | | | | | | User code will look like as follows: opt = tf.SyncReplicasOptimizer(...) train_op = opt.minimize(total_loss, global_step=global_step) sync_rep_hook = opt.make_session_run_hook(is_chief) with training.MonitoredTrainingSession(master=master, is_chief=is_chief, hooks=[sync_rep_hook]) as mon_sess: while not mon_sess.should_stop(): mon_sess.run(training_op) Change: 144353039