tensorflow - machine learning framework

	Commit message (Collapse)	Author	Age
*	Add GPUOptions::num_dev_to_dev_copy_streams to allow creation of	A. Unique TensorFlower	2018-06-28
\| \| \| \| \| \| \| \| \| \| \| \| \|	more than one device-to-device copy stream per GPU device. This is an experimental feature that will have no effect unless copy operations explicitly request a stream other than 0, which currently does not occur anywhere in a standard build. Eventually it may be of benefit in the presence of multiple bi-directional concurrent data copies. PiperOrigin-RevId: 202354513
*	Allow non-isolated worker sessions to borrow `WorkerEnv::device_mgr`.	Derek Murray	2018-04-19
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Without this change, a shared resource (e.g. an Iterator) could not be created in one session `s1`, and used in a later session `s2` after `s1` was closed, because the iterator might indirectly capture devices from the previous session, and use them after they are freed when the `WorkerSession` was deleted. The current change only affects the singleton "legacy" WorkerSession, which is never deleted, but this is necessary to switch all sessions to use separate WorkerSession objects. PiperOrigin-RevId: 193541426
*	Replaced calls to deprecated tensorflow::StringPiece methods with their	A. Unique TensorFlower	2018-04-02
\| \| \| \| \| \| \| \|	tensorflow::str_util equivalents. This will allow the deprecated methods to be removed. PiperOrigin-RevId: 191350894
*	Internal Variant API allowing registering Variants to be copied from/to GPU.	Eugene Brevdo	2017-10-03
\| \| \| \| \| \| \| \| \| \|	Adds a test in the variant_op_copy_test. Modifies the base GPUDevice to use this registry if it sees a singleton variant. Modifies the rendezvous manager to do the same. PiperOrigin-RevId: 170908757
*	Merge changes from github.	Martin Wicke	2017-09-02
\| \| \| \|	PiperOrigin-RevId: 167401527
*	Removes tolerate_dup_recv from LocaRendezvous.	A. Unique TensorFlower	2017-07-21
\| \| \| \|	PiperOrigin-RevId: 162782660
*	Bugfix: Never use env_->device_mgr	Brennan Saeta	2017-07-20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The base_rendezvous_mgr handles transferring a tensor using DMAs to non-"host" devices such as GPUs in the SameWorkerRecvDone function. This function would use the worker_env's device_mgr to obtain a pointer to the relevant Device (using the LookupDevice call). In the ClusterSpec-propagation world, using the environment's device_set means that the devices are not renamed, often resulting in devices not being found. This change updates BaseRemoteRendezvous to use the WorkerSession stored when the BaseRemoteRendezvous is initialized. The WorkerSession has a pointer to a DeviceMgr that contains the appropriately renamed devices for the given session the Rendezvous is associated with. Note: because we have a fast-path host-device-only copy, the original bug does not show up when using 2 CPU devices. I have added a test to ensure that transferring between 2 CPU devices works in a ClusterSpec propagation session, but note that this test does not actually reproduce the motivating bug. In the process of writing a test for the original bug, I discovered another latent bug in ClusterSpec propagation where if there were 2 CPU devices (i.e. due to explicit server configuration to have 2 CPU devices), a DCHECK could be triggered. The Master::CreateSession would call `device_set->set_client_device` multiple times (once for each CPU device). PiperOrigin-RevId: 162680163
*	Implement ClusterSpec Propagation in TF Master	Brennan Saeta	2017-05-04
\| \| \| \| \| \| \| \| \| \| \|	ClusterSpec propagation is a capability upgrade for TensorFlow that should make it much easier to (1) build distributed TensorFlow clusters, and (2) handle node failures. The ClusterSpec propagation capability allows TensorFlow workers to be booted independently of each other, and with no knowledge about others. The client can then construct a ClusterDef (ClusterSpec), and then send it to the TF master at session creation. The master in turn then propagates the ClusterDef along to all of the workers. Change: 155159972
*	Consolidate worker state behind a session-centric abstraction.	Brennan Saeta	2017-03-25
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	State in workers is currently splayed across graph_mgr, rendezvous_mgr, and additional components. This has resulted in it being difficult to ensure proper cleanup and shut down of the worker components. In addition to paving the way for a more reliable shut down, this CL also sets up the beginnings of ClusterSpec propagation. ClusterSpec propagation is a capability upgrade for TensorFlow that should make it much easier to (1) build distributed TensorFlow clusters, and (2) handle node failures. After the ClusterSpec propagation capability is fully implemented, the TensorFlow workers can be booted independently of each other, and with no knowledge about others. A client can then query a central cluster scheduler or other API to find all of the workers, and then send the ClusterDef (ClusterSpec) to the TF master, which then propagates that along to all of the workers. This change is only the first of a sequence to fully implement ClusterSpec propagation in TensorFlow. Change: 151229111
*	A few micro-optimizations / cleanup in tensorflow/core.	A. Unique TensorFlower	2016-10-07
\| \| \| \|	Change: 135518672
*	Address a bunch of compiler warnings on the Mac.	Asim Shankar	2016-09-22
\| \| \| \| \| \| \|	For example: - Unused variables/functions/typedefs - Thread-safety annotations Change: 133986225
*	Reduce some allocations on grpc code paths.	Suharsh Sivakumar	2016-08-08
\| \| \| \| \| \| \|	-Use std::move when assigning std::function to reduce some simple allocations.' -Use std::bind to avoid copy of std::function in lambda statements. Change: 129654152
*	A series of changes to significantly reduce the number of allocations	A. Unique TensorFlower	2016-06-27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	done by models distributed across many devices. A small microbenchmark model that runs two banks (A and B) of 30 nodes with a 30x30 full shuffle between them, where each of the nodes in A and in B run with one node on each of the 30 devices (so 3029+30+30, or ~930 separate RPCs) was showing ~111,000 allocations per iteration of the graph. With the changes here, this is now down to ~64,300 allocations per iteration. Changes include: o DeviceContext::CopyDeviceTensorToCPU and related helper routines: use StringPiece instead of const string& for the tensor name (avoids creating a string in some cases where the caller only has a StringPiece available). o Change some Rendezvous and BaseRemoteRendezvous interfaces to take a 'const Rendezvous::ParsedKey& key', rather than 'const string& key'. In many cases, the callers were already having to parse the key into a ParsedKey, and so we were doing the parsing multiple times at different levels as we processed receiving or sending of a tensor. This reduces the number of times that we parse a key as it flows from a Send node through to a Recv node on another worker. o Changed Rendezvous::ParsedKey so that it makes a copy of the underlying full key, and then uses StringPiece objects to point into this copy for the src_device, dst_device, and edge_name pieces. This turns 3 string allocations into 1 per Rendezvous::ParseKey call. o Added new StringPiece Rendezvous::ParsedKey::FullKey() accessor to return a StringPiece for the underlying full key, and used that in a few places (mostly logging) where that is useful. o In many places, used std::move(function_variable) when assigning to an instance variable. This eliminates a very large number of excess std::function allocations/initializations (~56000 of the baseline allocations were related to std::function setup or cloning, and this is now down to ~11000 after this cl). o In the RPC-based remote workers (StubbyRemoteWorker and GrpcRemoteWorker), changed the code path in RecvTensorAsync to avoid creation of a std::function with 6 arguments unless necessary. There are three cases now handled separately: (a) We're not logging, and we didn't make a copy of the request that we need to free: just use the passed in 'StatusCallback done' object directly, without creating a wrapper std::function object at all (b) We're not logging, but we made a copy of the request that we need to free: we create a simple wrapper std::function that invokes the passed in 'done' callback, and then frees the req_copy request copy object. (c) We're logging: we create the std::function object with all the necessary state to log when the recv has finished. o Changed DeviceMgr::LookupDevice to take a StringPiece, rather than a const string&, and changed the hash table to use StringPiece keys. This allows clients that just have a StringPiece device name in their hand to avoid a string creation to lookup the Device object. o Changed ExecutorState to use a specialized TaggedNodeReadyQueue that internally uses a gtl::InlinedVector<TaggedNode, 16>, rather than using a std::deque<TaggedNode> for keeping track of nodes ready to execute. This is faster because it avoids allocations entirely if the ready node queue doesn't get bigger than 16, and inlined vectors are generally faster than std::deque, at a minor risk of using more memory if this queue grows to very large numbers of ready nodes (mostly imaginable only in pathological graphs). o In ExecutorState::Process, allocated a single ExecutorState::AsyncState object to keep track of all the state we need to preserve for an asynchronously executed node, rather than keeping this state implicitly via a very large number of arguments to a lamda function. o Added new atomic std::atomic<bool> status_is_ok_ in BaseRemoteRendezvous. This allows us to avoid acquiring the lock when we just want to check if the status is non-OK in BaseRemoteRendezvous::Send and BaseRemoteRendezvous::ValidateDevices. o In GraphMgr::RunAllDone, changed assignment of args.runner to avoid one extra level of std::function indirection (binding the function directly to the ThreadPool::Schedule routine, rather than creating an intermediate lambda function that invokes this inside the body of the lambda. o Added freelist of RpcRecvTensorCall objects in third_party/tensorflow/core/distributed_runtime/rpc/rpc_rendezvous_mgr.cc o Changed third_party/tensorflow/core/framework/rendezvous.cc to keep the hashtable of Item* objects keyed by uint64 (hash of the tensor name), rather than the full-string tensor name. Collisions in the 64-bit hash space should basically never happen. o Sped up DeviceNameUtils::ParseFullName by optimizing for the common ordering of parts of /job, /replica, /task, /device. The parsing code was general enough to handle any order, but did so by comparing the prefixes 4, 3, 2, and 1 times, respectively, rather than 1, 1, 1, and 1 times. o Sped up DeviceNameUtils::SplitDeviceName to avoid extra string copies. Change: 125991891
*	Update copyright for 3p/tf/core.	A. Unique TensorFlower	2016-06-02
\| \| \| \|	Change: 123900938
*	TensorFlow: move SimpleGraphExecutionState and some related files from	Vijay Vasudevan	2016-05-02
\| \| \| \| \| \|	distributed_runtime to common_runtime in prepration for being shared by DirectSession. Change: 121339508
*	Initial version of the open-source distributed TensorFlow runtime.	Derek Murray	2016-02-25
	This includes a gRPC server (grpc_tensorflow_server) that can serve as both the master of a distributed TensorFlow computation, and an individual worker in the computation. The GrpcSession class is included to allow client programs (including Python clients) to interact with a server. See tensorflow/core/distributed_runtime/README.md for usage instructions. This change partially addresses issue #23. Change: 115634191