| Commit message (Collapse) | Author | Age |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
GPUs in the following cases:
a) If the inner-most dimension of b is 1, i.e. the operation is (possibly a batch of) matrix*vector multiplication(s). This is accomplished by calling Cublas GEMV rather than GEMM. This speeds up large matrix-vector products by about 4x.
b) If one or more dimensions are unknown at graph construction time but the operation is in fact either a single matrix*matrix or matrix*vector multiplication.
The following benchmark numbers illustrating the improvements for matrix * vector products
were measured on a NVIDIA Titan X (Maxwell) card.
Benchmark Base (ns) New (ns) Improvement
----------------------------------------------------------------------------
BM_Matmul_50_50_1_false_false_DT_FLOAT_gpu 18102 17056 +5.8%
BM_Matmul_50_50_1_true_false_DT_FLOAT_gpu 18108 16374 +9.6%
BM_Matmul_50_50_1_false_true_DT_FLOAT_gpu 18153 17173 +5.4%
BM_Matmul_50_50_1_true_true_DT_FLOAT_gpu 18150 15950 +12.1%
BM_Matmul_500_500_1_false_false_DT_FLOAT_gpu 64605 16874 +73.9%
BM_Matmul_500_500_1_true_false_DT_FLOAT_gpu 62810 17298 +72.5%
BM_Matmul_500_500_1_false_true_DT_FLOAT_gpu 60447 17014 +71.9%
BM_Matmul_500_500_1_true_true_DT_FLOAT_gpu 58443 16934 +71.0%
BM_Matmul_2000_2000_1_false_false_DT_FLOAT_gpu 343298 81898 +76.1%
BM_Matmul_2000_2000_1_true_false_DT_FLOAT_gpu 294738 63723 +78.4%
BM_Matmul_2000_2000_1_false_true_DT_FLOAT_gpu 300671 83650 +72.2%
BM_Matmul_2000_2000_1_true_true_DT_FLOAT_gpu 284540 63742 +77.6%
Change: 150456725
|
|
|
|
|
|
| |
imaginary part. Fix a bug in test util function assertAllCloseAccordingToType, which wasn't picking up the right values for complex64.
Change: 149016078
|
|
|
|
| |
Change: 142080137
|
|
|
|
| |
Change: 139370036
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This implements the same basic workaround as cl/128009436, namely to use the regular Eigen matmul kernel instead of the Eigen tensor contraction when not parallelizing the inner matrix products in BatchMatMul.
Run on rmlarsen3.mtv (12 X 3501 MHz CPUs); 2016-09-22T14:22:52.507929557-07:00
CPU: Intel Haswell with HyperThreading (6 cores) dL1:32KB dL2:256KB dL3:15MB
Benchmark Base (ns) New (ns) Improvement
------------------------------------------------------------------
BM_BatchMatmul_1_1_1024_1024_false_false_DT_FLOAT_cpu 179300 122010 +32.0%
BM_BatchMatmul_2_1_1024_1024_false_false_DT_FLOAT_cpu 209037 162153 +22.4%
BM_BatchMatmul_8_1_1024_1024_false_false_DT_FLOAT_cpu 906019 946502 -4.5%
BM_BatchMatmul_32_1_1024_1024_false_false_DT_FLOAT_cpu 3814403 4473018 -17.3%
BM_BatchMatmul_1_10000_200_1_false_false_DT_FLOAT_cpu 322285 252677 +21.6%
BM_BatchMatmul_8_10000_200_1_false_false_DT_FLOAT_cpu 2370631 2028039 +14.5%
BM_BatchMatmul_32_10000_200_1_false_false_DT_FLOAT_cpu 8994979 12904697 -43.5%
BM_BatchMatmul_1_10000_200_1_true_false_DT_FLOAT_cpu 663253 223017 +66.4%
BM_BatchMatmul_8_10000_200_1_true_false_DT_FLOAT_cpu 5731654 2266151 +60.5%
BM_BatchMatmul_32_10000_200_1_true_false_DT_FLOAT_cpu 18692987 12063885 +35.5%
BM_BatchMatmul_1_10000_200_1_false_true_DT_FLOAT_cpu 318234 251075 +21.1%
BM_BatchMatmul_8_10000_200_1_false_true_DT_FLOAT_cpu 2355295 2032887 +13.7%
BM_BatchMatmul_32_10000_200_1_false_true_DT_FLOAT_cpu 8997442 11618660 -29.1%
BM_BatchMatmul_1_10000_200_1_true_true_DT_FLOAT_cpu 652865 225256 +65.5%
BM_BatchMatmul_8_10000_200_1_true_true_DT_FLOAT_cpu 5700875 2383607 +58.2%
BM_BatchMatmul_32_10000_200_1_true_true_DT_FLOAT_cpu 18957878 12451622 +34.3%
BM_BatchMatmul_1_1_200_10000_false_false_DT_FLOAT_cpu 288420 226135 +21.6%
BM_BatchMatmul_8_1_200_10000_false_false_DT_FLOAT_cpu 2155747 2406166 -11.6%
BM_BatchMatmul_32_1_200_10000_false_false_DT_FLOAT_cpu 10031700 12248817 -22.1%
BM_BatchMatmul_1_1_200_10000_true_false_DT_FLOAT_cpu 298456 226108 +24.2%
BM_BatchMatmul_8_1_200_10000_true_false_DT_FLOAT_cpu 2096256 2409435 -14.9%
BM_BatchMatmul_32_1_200_10000_true_false_DT_FLOAT_cpu 10259905 12408712 -20.9%
BM_BatchMatmul_1_1_200_10000_false_true_DT_FLOAT_cpu 1657311 254414 +84.6%
BM_BatchMatmul_8_1_200_10000_false_true_DT_FLOAT_cpu 5976722 2031486 +66.0%
BM_BatchMatmul_32_1_200_10000_false_true_DT_FLOAT_cpu 23514286 11622619 +50.6%
BM_BatchMatmul_1_1_200_10000_true_true_DT_FLOAT_cpu 1653482 250161 +84.9%
BM_BatchMatmul_8_1_200_10000_true_true_DT_FLOAT_cpu 5951562 2032097 +65.9%
BM_BatchMatmul_32_1_200_10000_true_true_DT_FLOAT_cpu 23587247 11633259 +50.7%
Change: 134814248
|
|
|
|
|
| |
Clean up tests and extend coverage to all supported types.
Change: 133766358
|
|
|
|
| |
Change: 132750089
|
|
|
|
| |
Change: 127386123
|
|
|
|
| |
Change: 123900456
|
|
|
|
| |
Change: 114374558
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Change 109321497
Move all images to images directory to make docs versioning easier
- adjust all paths in the docs to point to the new locations
- remove some now redundant section-order tags added for the old website
Change 109317807
Added a kernel op to compute the eigendecomposition of a self-adjoint matrix.
Added a new kernel op called self_adjoint_eig (and a batch_self_adjoint_eig) that
computes the eigendecomposition of a self-adjoint matrix. The return value is
the concatenation of the eigenvalues as a row vector, and the eigenvectors.
Change 109310773
Change `_read32()` in the MNIST input example to return an int.
Currently we return a 1-D numpy array with 1 element. Numpy has
recently deprecated the ability to treat this as a scalar, and as a
result this tutorial fails. The fix returns the 0th element of the
array instead.
Change 109301269
Re-arrange TensorBoard demo files.
Change 109273589
add ci_build for ci.tensorflow.org
Change 109260293
Speed up NodeDef -> OpKernel process by not spending time generating
an error message for missing "_kernel" attr that will be thrown away.
Change 109257179
TensorFlow:make event_file_loader_test hermetic by using tempfile
instead of fixed filenames. Without this change, running
event_file_loader_test twice in the same client (locally)
causes it to fail, because it writes into the same file and appends
another event, instead of starting from scratch.
Change 109256464
Minor cleanup in TensorBoard server code
Change 109255382
Change to reduce critical section times in gpu_event_mgr.h:
(1) Call stream->ThenRecordEvent outside the EventMgr critical section
(2) Do memory deallocation outside the critical section
Speeds up one configuration of ptb_word_lm from 2924 words per
second (wps) to 3278 wps on my desktop machine with a Titan X.
Change 109254843
Fix use of uninitialized memory in test.
Change 109250995
python_config.sh needs a license header
Otherwise the license test fails.
Change 109249914
add ci_build for ci.tensorflow.org
Change 109249397
Fixes reduce_sum (complex) on GPU segfaults.
Fixes #357
Change 109245652
add ci_build for ci.tensorflow.org
Base CL: 109321563
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Change:
Clean up documentation for ReverseSequence
Change:
Updated several tensorflow operations to use 32bit indices on GPU.
Change:
Add attribute batch_dim to ReverseSequenceOp.
Change:
Fix error in convert_to_records.py. As reported in
https://github.com/tensorflow/tensorflow/issues/370
by AlexUnderMicrocontRoll.
Change:
Update TensorBoard README.
Change:
Fixes to boolean flags reported in
https://github.com/tensorflow/tensorflow/issues/379. Supports:
--bool_flag=True --> True
--bool_flag=False --> False
--bool_flag=gibberish --> False
--bool_flag --> True
--nobool_flag --> False
Fixes #379
Change:
Update generated Op docs.
Change:
Enable local development of TensorBoard using gulp
Also make tf-tensorboard a regular component rather than special case
This is mostly effected by creating tfserve.js, which is a small server
with clever routing to load from bower_components/ and components/ using
the paths that work within google3.
Workflow: `gulp serve`
Change:
Add a full working code example to the tensorboard and summaries tutorial
Change:
Fix seq2seq_test when running on GPU.
The "proj_w" and "proj_b" variables were being created before the
`test_session()`'s device function took effect, which pushed the
placement algorithm into making an incorrect decision.
Change:
Add a sentence in TensorBoard README on how to serialize summary data to logs and provide link to the how-to tutorial on the TensorFlow website.
Change:
Add error-catching code if string_input_producer is supplied a null input.
Before this change, it would die with an opaque shape error from inside
the queue. This change catches (most) python null lists being
passed directly in, and at runtime detects null tensors.
Adds two tests for this to input_test.py
Change:
Speed up for models that use the same variable multiple times in the case
where variables must be copied across devices:
- Have Variables wrap the Variable op in an Identity op when converted to Tensor.
This avoids multiple copies across devices if a variable is used multiple time
in a computation.
- Add Variable.mutable() to return the non-wrapped Variable op for used when
assigning new values.
- Add an as_ref parameter to convert_to_tensor() to allow code to specify
if they plan to assign a new value to the result of the conversion. Make Variable
return the result of Variable.mutable() when as_ref is True.
- Make all ops that assign values to variables pass as_ref=True when converting
their arguments.
Change:
Change to reduce critical section times in gpu_event_mgr.h:
(1) Call stream->ThenRecordEvent outside the EventMgr critical section
(2) Do memory deallocation outside the critical section
Speeds up one configuration of ptb_word_lm from 2924 words per
second (wps) to 3278 wps on my desktop machine with a Titan X.
Change:
Remove some colons that break the open source build
::tensorflow::StringPiece breaks for @raingo, see
https://github.com/tensorflow/tensorflow/issues/358.
tensorflow::StringPiece (without the leading colons)
seems to fix the problem.
Change:
Added check that inputs to Operation is a list and make a defensive copy of the input. This is for cases where the input list is changed such as in _add_input.
Change:
Use standard names for TensorFlow dtypes in the tutorial.
Change:
Add tests for tensor inputs.
Change:
Fix build after declaring more types for ops
Change:
Switch to 32 bit indexing to speedup convolutions and concatenations.
Change:
Add convert_image op to convert between types for images (similar to OpenCV's cvtScale).
Change:
Make cast work between numeric types (bool, uint8, int16, int32, int64, float, double).
Change:
Padding input data for odd number of paddings, so we can use cudnn anyway.
+ Fix total padding computation when padding==VALID.
+ This CL makes the Googlenet benchmark run 5x faster.
Change:
Support IndexedSlices in ConcatGrad
Change:
* sampled softmax op uses one embedding lookup for positive and negative samples
* float64 support for sampled softmax
Change:
Move RNN code out of models.rnn (without breaking existing code). The API may still undergo minor changes, until full documentation as added.
Change:
Changed to use per-step stacks for the accumulators used in while-loop gradient computation. This addresses the problem caused by using concat without sufficient static shape information. It should also improve performance as we avoided those expensive concats.
Change:
Update generated Op docs.
Change:
Improve error messages when the optimizer finds no variables to minimize or
when none of the variables has gradients.
Change:
Say that -1 isn't just for flattening in reshape docs
Also add scalar reshape (reshape(t, [])) as an example.
This fixes https://github.com/tensorflow/tensorflow/issues/281.
Change:
This is a test.
Base CL: 109118714
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Changes:
* error message that refers to removed `DefaultSession` method.
* -Wnull-conversion warnings
* the "_start_time" attr for recvs when the flag "--brain_enable_scheduling_for_recvs" is set.
* typo in tutorial data download progress message.
* a typo ("however their installing"=>"however installing").
* typo, rename "TensorFlow Mechanics" to "How To" to be consistent with the website.
* a typo ("subtact"=>"subtract").
* protobuf examples in comments in tensorflow::Example.proto.
* formula formatting in MNIST beginner tutorial
* negative fraction-of-queue-full stats
* protobuf inclusion path so that Android demo will build under Blaze.
* small typo (moderatly > moderately)
* Session.run() to check that tensor arguments come from the session's graph.
* another six import
* seq2seq typo in bazel command
Base CL: 108349164
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Changes:
- futurize --stage2 changes for Python 3 compatibility by @girving.
- Small updates to documentation by @vrv, schuster and others
- Account for failure of std::thread::hardware_concurrency by @ebrevdo.
- More changes for backwards-compatibility tests by Josh
- Updates to python op doc generation by Josh
- Added support for using the best-fit allocator via ConfigProto by @vrv.
- Rename LocalSession to DirectSession, since local was a bad name for
it.
- Enable tf.nn.moments() to work with tensors of unknown shape by @mrry.
GITHUB_ISSUE: 139
- Changes for Android build by Andrew.
Base CL: 107645181
|
|
TensorFlow is an open source software library for numerical computation
using data flow graphs.
Base CL: 107276108
|