| Commit message (Collapse) | Author | Age |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Use 'LABEL maintainer=' in Dockerfile
This fix is a follow up of 13661 to replace `MAINTAINER`
with `LABEL maintainer=` in Dockerfile. The keyword
`MAINTAINER` has long been deprecated and is replaced by `LABEL`,
which is much more flexible and is easily searchable through `docker inspect`.
This fix replaces remaining `MAINTAINER` with `LABEL`.
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
* Additional `MAITAINER` -> `LABEL`
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This make it possible to test OSS GRPC distributed runtime in
dist_test/remote_test.sh against a release build.
Usage example:
1. Build the server using a release whl file. (Obviously this means that
the Linxu CPU PIP release build has to pass first.)
$ export DOCKER_VERSION_TAG="0.11.0rc1"
$ tensorflow/tools/dist_test/build_server.sh
tensorflow/tf_grpc_test_server:${DOCKER_VERSION_TAG}
http://ci.tensorflow.org/view/Release/job/release-matrix-cpu/TF_BUILD_CONTAINER_TYPE=CPU,TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON2,label=cpu-slave/lastSuccessfulBuild/artifact/pip_test/whl/tensorflow-${DOCKER_VERSION_TAG}-cp27-none-linux_x86_64.whl
--test
2. Run remote_test.sh:
$ export TF_DIST_DOCKER_NO_CACHE=1
$ export
TF_DIST_SERVER_DOCKER_IMAGE="tensorflow/tf_grpc_test_server:${DOCKER_VERSION_TAG}"
$ export TF_DIST_GCLOUD_PROJECT="my-project"
$ export TF_DIST_GCLOUD_COMPUTE_ZONE="my-zone"
$ export TF_DIST_CONTAINER_CLUSTER="my-cluster"
$ export TF_DIST_GCLOUD_KEY_FILE="/path/to/my/key.json"
$ tensorflow/tools/dist_test/remote_test.sh
"http://ci.tensorflow.org/view/Release/job/release-matrix-cpu/TF_BUILD_CONTAINER_TYPE=CPU,TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON2,label=cpu-slave/lastSuccessfulBuild/artifact/pip_test/whl/tensorflow-${DOCKER_VERSION_TAG}-cp27-none-linux_x86_64.whl"
|
| |
|
| |
|
| |
|
|
|
|
| |
Change: 132313453
|
|
|
|
| |
Change: 130741928
|
|
|
|
| |
Change: 128958134
|
|
|
|
| |
Change: 126335170
|
|
|
|
| |
Change: 123889091
|
|
|
|
| |
Change: 121586635
|
|
|
|
| |
Change: 120185825
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Usage example: ./remote_test.sh --num-workers 3 --sync-replicas
Also changed:
1) In local and remote tests, let different workers contact separate GRPC
sessions.
2) In local and remote tests, adding the capacity to specify the number of
workers. Before it was hard-coded at 2.
Usage example:
./remote_test.sh --num-workers 2 --sync-replicas
3) Using device setter in mnist_replica.py
Change: 119599547
|
|
See README.md for detailed descriptions of the usage of the tools and tests in this changeset.
Three modes of testing are supported:
1) Launch a local Kubernetes (k8s) cluster and run the test suites on it
(See local_test.sh)
2) Launch a remote k8s cluster on Google Container Engine (GKE) and run the test suite on it
(See remote_test.sh)
3) Run the test suite on an existing k8s TensorFlow cluster
(Also see remote_test.sh)
Take the remote test for example, the following steps are performed:
1) Builds a Docker image with gcloud and Kubernetes tools, and the latest TensorFlow pip installed (see Dockerfile)
2) Launches a Docker container based on the said image (see test_distributed.sh)
3) From within the image, authenticate the gcloud user (with credentials files mapped from outside the container), configer the k8s cluster and launch a new k8s container cluster for TensorFlow workers
4) Generate a k8s (yaml) config file and user this yaml file to create a TensorFlow worker cluster consisting of a certian number of parameter servers (ps) and workers. The workers are exposed as external services with public IPs (see dist_test.sh)
5) Run a simple softmax MNIST model on multiple workers, with the model weights and biases located on the ps nodes. Train the models in parallel and observe the final validation cross entropy (see dist_mnist_test.sh)
Change: 117543657
|