diff options
author | 2017-08-15 21:39:28 -0700 | |
---|---|---|
committer | 2017-08-15 21:42:58 -0700 | |
commit | 2b51e0ba27af69c914a7523d9aae232de09e3206 (patch) | |
tree | 892089b4eca69784760e397f9bf1a5dd64dc9f18 /tensorflow/core/profiler | |
parent | 8041185b16c0ce9348bd79900f6682fc3976bfd7 (diff) |
1. Support profiling nodes in RunMetadata but not in GraphDef.
2. Add an API to allow easier profile retrieval. Currently in contrib.
PiperOrigin-RevId: 165399640
Diffstat (limited to 'tensorflow/core/profiler')
-rw-r--r-- | tensorflow/core/profiler/README.md | 87 | ||||
-rw-r--r-- | tensorflow/core/profiler/g3doc/options.md | 26 | ||||
-rw-r--r-- | tensorflow/core/profiler/internal/tfprof_node.cc | 7 | ||||
-rw-r--r-- | tensorflow/core/profiler/internal/tfprof_show_test.cc | 31 | ||||
-rw-r--r-- | tensorflow/core/profiler/internal/tfprof_stats.cc | 27 | ||||
-rw-r--r-- | tensorflow/core/profiler/internal/tfprof_stats_test.cc | 14 | ||||
-rw-r--r-- | tensorflow/core/profiler/internal/tfprof_timeline_test.cc | 4 |
7 files changed, 132 insertions, 64 deletions
diff --git a/tensorflow/core/profiler/README.md b/tensorflow/core/profiler/README.md index f755089fc7..40fb1f836e 100644 --- a/tensorflow/core/profiler/README.md +++ b/tensorflow/core/profiler/README.md @@ -1,12 +1,10 @@ <h1>TensorFlow Profiler and Advisor</h1> * [Features](#features) -* [Interfaces](#interfaces) -* [Tutorials](#tutorials) +* [Quick Start](#quick-start) * [Demo](#demo) * [Feature Request and Bug Report](#feature-request-and-bug-report) -Contact for bug report and feature request (xpan@google.com) ### Features @@ -20,43 +18,57 @@ Contact for bug report and feature request (xpan@google.com) * operation configuration check * distributed runtime check (Not OSS) -### Interfaces - -* Python API -* Command Line -* Visualization -* C++ API (Not public, contact us if needed.) - -tfprof provides 4 different views to organize the profiles. - - * code view: operations are grouped by Python codes that generate them. - * op view: operations are grouped by operation type (E.g. MatMul, Conv2D). - * scope view: operations are organized based on name scope hierarchies. - * graph view: operations are organized based on input/output. - -tfprof provides options to help user select, filter and order statistics. -See [Options](g3doc/options.md) for detail instructions. +### Quick Start + +```python +# Enable tracing and profile the time and memory information. +builder = tf.profiler.ProfileOptionBuilder +opts = builder(builder.time_and_memory()).order_by('micros').build() +with tf.contrib.tfprof.ProfileContext() as pctx: + with tf.Session() as sess: + x = build_model() + sess.run(tf.global_variables_initializer()) + for i in range(10000): + if i == 10: + pctx.capture_next_run_meta() + _ = sess.run(x) + tf.profiler.profile(sess.graph, run_meta=pctx.run_meta(), cmd='op', options=opts) + else: + _ = sess.run(x) +``` +```python +# When using high-level API, session is usually hidden. +# Register profiling to ProfileContext that performs profiling automatically. +builder = tf.profiler.ProfileOptionBuilder +opts = builder(builder.time_and_memory()).order_by('micros').build() +opts2 = tf.profiler.ProfileOptionBuilder.trainable_variables_parameter() +with tf.contrib.tfprof.ProfileContext() as pctx: + # Profile time and memory from step 100 to 109. + # Note: High level APIs can run sessions for queue, variable initialization + # or training at different steps. Try profile multiple steps. + pctx.add_auto_profiling('op', opts, range(100, 110)) + # Profile model parameters once at step 10 (since it is static). + pctx.add_auto_profiling('scope', opts2, [10]) + # High level API, such as slim, Estimator, etc. + train_loop() ``` --max_depth 10 --min_bytes 0 --min_micros 0 --min_params 0 --min_float_ops 0 --min_occurrence 0 --step -1 --order_by name --account_type_regexes .* --start_name_regexes .* --trim_name_regexes --show_name_regexes .* --hide_name_regexes --account_displayed_op_only false --select params --output stdout: + +```python +# Profiling from Python API is not interactive. +# Dump the profiles to files and profile with interactive command line. +with tf.contrib.tfprof.ProfileContext() as pctx: + pctx.add_auto_profile_dump('/tmp/profiles', [100]) + train_loop() + +bazel-bin/tensorflow/core/profiler/profiler \ + --graph_path=/tmp/profiles/graph.pbtxt \ + --run_meta_path=/tmp/profiles/run_meta \ + --op_log_path=/tmp/profiles/tfprof_log \ +tfprof> op -select micros,bytes,occurrence -order_by micros ``` -### Tutorials +<b>Detail Tutorials</b> * [Python API](g3doc/python_api.md) * [Command Line Interface](g3doc/command_line.md) @@ -64,6 +76,9 @@ See [Options](g3doc/options.md) for detail instructions. * [Profile Memory](g3doc/profile_memory.md) * [Profile Model Architecture](g3doc/profile_model_architecture.md) * [Auto Detect and Advise](g3doc/advise.md) + +<b>Detail Documentation</b> + * [Options](g3doc/options.md) ## Demo diff --git a/tensorflow/core/profiler/g3doc/options.md b/tensorflow/core/profiler/g3doc/options.md index bdcc6b2bd8..15712d04c2 100644 --- a/tensorflow/core/profiler/g3doc/options.md +++ b/tensorflow/core/profiler/g3doc/options.md @@ -41,6 +41,32 @@ can vary. For example `-max_depth` in scope view means the depth of name scope <b>tree</b>. In op view, it means the length of operation <b>list</b>. In graph view, in means the number of hops in the <b>graph</b>. +### Times + +Most machines have mutli-core CPUs. Some installs one or more accelerators. +Each accelerator usually performs massive parallel processing. The profiler +tracks the accumulated processing times. Hence, the accumulated processing +time is likely larger than the time of each step. + +micros: This is the sum of cpu and accelerator times. +accelerator_micros: This is the accelerator times. +cpu_micros: This is the cpu times. + +### Memory + +Tensor memory are usually ref-counted. The memory is released when there is +no more reference to it. It will be difficult to track the release of memory. +Currently, profiler only tracks the allocation of memory. As a result, the +accumulated memory request is uaually larger than the peak memory of the overall +model. + +bytes: The memory allocations requested by the operation. +peak_bytes: The peak requested memory (not de-allocated) by the operation. +residual_bytes: The memory requested by the operation and not de-allocated + when Compute finishes. +output_bytes: The memory output by the operation. It's not necessarily requested + by the current operation. For example, it can be a tensor + forwarded from input to output, with in-place mutation. ###Docs diff --git a/tensorflow/core/profiler/internal/tfprof_node.cc b/tensorflow/core/profiler/internal/tfprof_node.cc index d4a784ffaa..19257968e6 100644 --- a/tensorflow/core/profiler/internal/tfprof_node.cc +++ b/tensorflow/core/profiler/internal/tfprof_node.cc @@ -25,7 +25,8 @@ bool CountAsAcceleratorTime(const string& device) { } bool CountAsCPUTime(const string& device) { - return RE2::FullMatch(device, ".*/(device:gpu|gpu|cpu|device:sycl):\\d+"); + return RE2::FullMatch(device, + ".*/(device:gpu|gpu|device:cpu|cpu|device:sycl):\\d+"); } bool IsCanonicalDevice(const string& device) { return CountAsCPUTime(device); } @@ -148,8 +149,8 @@ void TFGraphNode::AddStepStat(int64 step, const string& device, if (IsCanonicalDevice(dev)) { if (!canonical_device_.empty()) { if (canonical_device_ != dev) { - fprintf(stderr, "Unexpected: graph node changed device: %s->%s.\n", - canonical_device_.c_str(), dev.c_str()); + // TODO(xpan): Some RunMetadata node appears at multiple devices. + // Need to address it. return; } } else { diff --git a/tensorflow/core/profiler/internal/tfprof_show_test.cc b/tensorflow/core/profiler/internal/tfprof_show_test.cc index f2c8b662d0..24cf9afb17 100644 --- a/tensorflow/core/profiler/internal/tfprof_show_test.cc +++ b/tensorflow/core/profiler/internal/tfprof_show_test.cc @@ -110,13 +110,14 @@ TEST_F(TFProfShowTest, DumpAcceleratorAndCPUMicros) { TF_CHECK_OK(ReadFileToString(Env::Default(), dump_file, &dump_str)); EXPECT_EQ( "node name | accelerator execution time | cpu execution " - "time\n_TFProfRoot (--/404us, --/4.50ms)\n Conv2D (226us/226us, " - "4.07ms/4.07ms)\n Conv2D_1 (178us/178us, 419us/419us)\n DW2 (0us/0us, " - "11us/11us)\n DW2/Assign (0us/0us, 0us/0us)\n DW2/Initializer " - "(0us/0us, 0us/0us)\n DW2/Initializer/random_normal (0us/0us, " - "0us/0us)\n DW2/Initializer/random_normal/RandomStandardNormal " - "(0us/0us, 0us/0us)\n DW2/Initializer/random_normal/mean " - "(0us/0us, 0us/0us)\n DW2/Initializer/random_normal/mul (0us/0us, " + "time\n_TFProfRoot (--/404us, --/4.54ms)\n Conv2D (226us/226us, " + "4.07ms/4.07ms)\n Conv2D_1 (178us/178us, 419us/419us)\n " + "_retval_Conv2D_1_0_0 (0us/0us, 41us/41us)\n DW2 (0us/0us, 11us/11us)\n " + " DW2/Assign (0us/0us, 0us/0us)\n DW2/Initializer (0us/0us, " + "0us/0us)\n DW2/Initializer/random_normal (0us/0us, 0us/0us)\n " + " DW2/Initializer/random_normal/RandomStandardNormal (0us/0us, " + "0us/0us)\n DW2/Initializer/random_normal/mean (0us/0us, " + "0us/0us)\n DW2/Initializer/random_normal/mul (0us/0us, " "0us/0us)\n DW2/Initializer/random_normal/shape (0us/0us, " "0us/0us)\n DW2/Initializer/random_normal/stddev (0us/0us, " "0us/0us)\n DW2/read (0us/0us, 0us/0us)\n DW (0us/0us, 2us/2us)\n " @@ -156,27 +157,27 @@ TEST_F(TFProfShowTest, DumpOpMode) { EXPECT_EQ( "nodename|requestedbytes|totalexecutiontime|acceleratorexecutiontime|" "cpuexecutiontime|#parameters|#float_ops|opoccurrence(run|defined)|" - "inputshapes\nVariableV20B(0.00%,0.00%),13us(100.00%,0.27%),0us(100.00%," + "inputshapes\nVariableV20B(0.00%,0.00%),13us(100.00%,0.26%),0us(100.00%," "0.00%),13us(100.00%,0.29%),451params(100.00%,100.00%),0float_ops(100.00%" ",0.00%),2|3\n\ninput_type:\t(run*2|defined*3)\texec_time:13us\n\nAdd0B(" - "0.00%,0.00%),0us(99.73%,0.00%),0us(100.00%,0.00%),0us(99.71%,0.00%)," + "0.00%,0.00%),0us(99.74%,0.00%),0us(100.00%,0.00%),0us(99.71%,0.00%)," "0params(0.00%,0.00%),0float_ops(100.00%,0.00%),0|3\n\ninput_type:0:1," "\t1:1\t(run*0|defined*1)\texec_time:0us\ninput_type:0:2x2x6x12,\t1:1\t(" "run*0|defined*1)\texec_time:0us\ninput_type:0:3x3x3x6,\t1:1\t(run*0|" - "defined*1)\texec_time:0us\n\nAssign0B(0.00%,0.00%),0us(99.73%,0.00%)," + "defined*1)\texec_time:0us\n\nAssign0B(0.00%,0.00%),0us(99.74%,0.00%)," "0us(100.00%,0.00%),0us(99.71%,0.00%),0params(0.00%,0.00%),0float_ops(" "100.00%,0.00%),0|3\n\ninput_type:0:1,\t1:1\t(run*0|defined*1)\texec_" "time:0us\ninput_type:0:2x2x6x12,\t1:2x2x6x12\t(run*0|defined*1)\texec_" "time:0us\ninput_type:0:3x3x3x6,\t1:3x3x3x6\t(run*0|defined*1)\texec_" - "time:0us\n\nConst0B(0.00%,0.00%),2us(99.73%,0.04%),0us(100.00%,0.00%)," + "time:0us\n\nConst0B(0.00%,0.00%),2us(99.74%,0.04%),0us(100.00%,0.00%)," "2us(99.71%,0.04%),0params(0.00%,0.00%),0float_ops(100.00%,0.00%),1|" "10\n\ninput_type:\t(run*1|defined*10)\texec_time:2us\n\nConv2D14.59KB(" - "100.00%,100.00%),4.89ms(99.69%,99.69%),404us(100.00%,100.00%),4.49ms(99." - "67%,99.67%),0params(0.00%,0.00%),10.44kfloat_ops(100.00%,100.00%),2|" + "100.00%,100.00%),4.89ms(99.70%,98.87%),404us(100.00%,100.00%),4.49ms(99." + "67%,98.77%),0params(0.00%,0.00%),10.44kfloat_ops(100.00%,100.00%),2|" "2\n\ninput_type:0:2x3x3x6,\t1:2x2x6x12\t(run*1|defined*1)\texec_time:" "597us\ninput_type:0:2x6x6x3,\t1:3x3x3x6\t(run*1|defined*1)\texec_time:4." - "29ms\n\nIdentity0B(0.00%,0.00%),0us(0.00%,0.00%),0us(0.00%,0.00%),0us(0." - "00%,0.00%),0params(0.00%,0.00%),0float_ops(0.00%,0.00%),0|3\n\ninput_" + "29ms\n\nIdentity0B(0.00%,0.00%),0us(0.83%,0.00%),0us(0.00%,0.00%),0us(0." + "90%,0.00%),0params(0.00%,0.00%),0float_ops(0.00%,0.00%),0|3\n\ninput_" "type:0:1\t(run*0|defined*1)\texec_time:0us\ninput_type:0:2x2x6x12\t(run*" "0|defined*1)\texec_time:0us\ninput_type:0:3x3x3x6\t(run*0|defined*1)" "\texec_time:0us\n\n", diff --git a/tensorflow/core/profiler/internal/tfprof_stats.cc b/tensorflow/core/profiler/internal/tfprof_stats.cc index ea3e7b487c..012c49525a 100644 --- a/tensorflow/core/profiler/internal/tfprof_stats.cc +++ b/tensorflow/core/profiler/internal/tfprof_stats.cc @@ -25,6 +25,22 @@ limitations under the License. namespace tensorflow { namespace tfprof { +namespace { +bool CreateRunMetadataNode(const string& name, NodeDef* def) { + // TODO(xpan): Better solution than blacklisting this 2 nodes. They + // actually cost some resources, maybe include them. Some nodes, such + // as _SOURCE appear in multiple devices, which breaks tfprof's assumption. + if (name == "RecvTensor" || name == "_SOURCE" || + name.find("MEMCPY") != name.npos) { + return false; + } + def->set_name(name); + // TODO(xpan): Better operation type. + def->set_op("RunTimeOp"); + return true; +} +} // namespace + TFStats::TFStats(std::unique_ptr<GraphDef> graph, std::unique_ptr<RunMetadata> run_meta, std::unique_ptr<OpLogProto> op_log, @@ -191,7 +207,16 @@ void TFStats::AddRunMeta(int64 step, std::unique_ptr<RunMetadata> run_meta) { name = node_stat.node_name().substr(0, split_pos); } auto node = nodes_map_.find(name); - if (node != nodes_map_.end()) { + if (node == nodes_map_.end()) { + NodeDef def; + if (CreateRunMetadataNode(name, &def)) { + NodeDef* ndef = graph_->mutable_node()->Add(); + ndef->MergeFrom(def); + nodes_map_[name] = + std::unique_ptr<TFGraphNode>(new TFGraphNode(ndef)); + nodes_map_.at(name)->AddStepStat(step, dev_stat.device(), node_stat); + } + } else { node->second->AddStepStat(step, dev_stat.device(), node_stat); } } diff --git a/tensorflow/core/profiler/internal/tfprof_stats_test.cc b/tensorflow/core/profiler/internal/tfprof_stats_test.cc index e67c158521..0bea14fe66 100644 --- a/tensorflow/core/profiler/internal/tfprof_stats_test.cc +++ b/tensorflow/core/profiler/internal/tfprof_stats_test.cc @@ -137,7 +137,7 @@ TEST_F(TFProfStatsTest, TestGraph) { GraphNodeProto expected; CHECK(protobuf::TextFormat::ParseFromString( - "name: \"_TFProfRoot\"\ntotal_exec_micros: 4904\ntotal_requested_bytes: " + "name: \"_TFProfRoot\"\ntotal_exec_micros: 4945\ntotal_requested_bytes: " "14592\ntotal_parameters: 451\nchildren {\n name: " "\"DW/Initializer/random_normal/mul\"\n children {\n name: " "\"DW/Initializer/random_normal/RandomStandardNormal\"\n children {\n " @@ -152,8 +152,8 @@ TEST_F(TFProfStatsTest, TestGraph) { "}\n input_shapes {\n key: 1\n value {\n dim {\n " "size: 1\n }\n }\n }\n total_definition_count: " "4\n}\ntotal_float_ops: 10440\ntotal_accelerator_exec_micros: " - "404\ntotal_cpu_exec_micros: 4500\ntotal_run_count: " - "5\ntotal_definition_count: 31\ntotal_peak_bytes: " + "404\ntotal_cpu_exec_micros: 4541\ntotal_run_count: " + "6\ntotal_definition_count: 32\ntotal_peak_bytes: " "9984\ntotal_residual_bytes: 1280\ntotal_output_bytes: 4864\n", &expected)); EXPECT_EQ(expected.DebugString(), root.DebugString()); @@ -166,7 +166,7 @@ TEST_F(TFProfStatsTest, TestFloatOps) { GraphNodeProto expected; CHECK(protobuf::TextFormat::ParseFromString( - "name: \"_TFProfRoot\"\ntotal_exec_micros: 4904\ntotal_requested_bytes: " + "name: \"_TFProfRoot\"\ntotal_exec_micros: 4945\ntotal_requested_bytes: " "14592\ntotal_parameters: 451\nchildren {\n name: \"Conv2D\"\n " "exec_micros: 4292\n requested_bytes: 9472\n total_exec_micros: 4292\n " " total_requested_bytes: 9472\n devices: " @@ -199,7 +199,7 @@ TEST_F(TFProfStatsTest, TestFloatOps) { "output_bytes: 512\n total_peak_bytes: 4096\n total_residual_bytes: " "512\n total_output_bytes: 512\n}\ntotal_float_ops: " "10440\ntotal_accelerator_exec_micros: 404\ntotal_cpu_exec_micros: " - "4500\ntotal_run_count: 5\ntotal_definition_count: 34\ntotal_peak_bytes: " + "4541\ntotal_run_count: 6\ntotal_definition_count: 35\ntotal_peak_bytes: " "9984\ntotal_residual_bytes: 1280\ntotal_output_bytes: 4864\n", &expected)); EXPECT_EQ(expected.DebugString(), root.DebugString()); @@ -246,7 +246,7 @@ TEST_F(TFProfStatsTest, TestShowTensorValue) { const GraphNodeProto& root = tf_stats_->ShowGraphNode("scope", opts); GraphNodeProto expected; CHECK(protobuf::TextFormat::ParseFromString( - "name: \"_TFProfRoot\"\ntotal_exec_micros: 4904\ntotal_requested_bytes: " + "name: \"_TFProfRoot\"\ntotal_exec_micros: 4945\ntotal_requested_bytes: " "14592\ntotal_parameters: 451\nchildren {\n name: \"DW\"\n " "exec_micros: 2\n parameters: 162\n total_exec_micros: 2\n " "total_parameters: 162\n devices: " @@ -336,7 +336,7 @@ TEST_F(TFProfStatsTest, TestShowTensorValue) { "total_run_count: 1\n total_definition_count: 10\n output_bytes: " "1280\n total_output_bytes: 1280\n}\ntotal_float_ops: " "10440\ntotal_accelerator_exec_micros: 404\ntotal_cpu_exec_micros: " - "4500\ntotal_run_count: 5\ntotal_definition_count: 34\ntotal_peak_bytes: " + "4541\ntotal_run_count: 6\ntotal_definition_count: 35\ntotal_peak_bytes: " "9984\ntotal_residual_bytes: 1280\ntotal_output_bytes: 4864\n", &expected)); EXPECT_EQ(expected.DebugString(), root.DebugString()); diff --git a/tensorflow/core/profiler/internal/tfprof_timeline_test.cc b/tensorflow/core/profiler/internal/tfprof_timeline_test.cc index 2fe3653ec2..5dd440e9a2 100644 --- a/tensorflow/core/profiler/internal/tfprof_timeline_test.cc +++ b/tensorflow/core/profiler/internal/tfprof_timeline_test.cc @@ -71,7 +71,7 @@ TEST_F(TFProfTimelineTest, GraphView) { string dump_str; TF_CHECK_OK(ReadFileToString(Env::Default(), dump_file, &dump_str)); - EXPECT_EQ(16947107375505024864ull, Hash64(dump_str)); + EXPECT_EQ(1754536562981488144ull, Hash64(dump_str)); } TEST_F(TFProfTimelineTest, ScopeView) { @@ -85,7 +85,7 @@ TEST_F(TFProfTimelineTest, ScopeView) { string dump_str; TF_CHECK_OK(ReadFileToString(Env::Default(), dump_file, &dump_str)); - EXPECT_EQ(2710044785377031280ull, Hash64(dump_str)); + EXPECT_EQ(17545174915963890413ull, Hash64(dump_str)); } // TODO(xpan): tfprof_log is too large to include in testdata when adding |