1. Support profiling nodes in RunMetadata but not in GraphDef.

2. Add an API to allow easier profile retrieval. Currently in contrib. PiperOrigin-RevId: 165399640
author: A. Unique TensorFlower <gardener@tensorflow.org> 2017-08-15 21:39:28 -0700
committer: TensorFlower Gardener <gardener@tensorflow.org> 2017-08-15 21:42:58 -0700
commit: 2b51e0ba27af69c914a7523d9aae232de09e3206 (patch)
tree: 892089b4eca69784760e397f9bf1a5dd64dc9f18 /tensorflow/core/profiler
parent: 8041185b16c0ce9348bd79900f6682fc3976bfd7 (diff)
7 files changed, 132 insertions, 64 deletions
diff --git a/tensorflow/core/profiler/README.md b/tensorflow/core/profiler/README.md
index f755089fc7..40fb1f836e 100644
--- a/tensorflow/core/profiler/README.md
+++ b/tensorflow/core/profiler/README.md
@@ -1,12 +1,10 @@
 <h1>TensorFlow Profiler and Advisor</h1>
 
 * [Features](#features)
-* [Interfaces](#interfaces)
-* [Tutorials](#tutorials)
+* [Quick Start](#quick-start)
 * [Demo](#demo)
 * [Feature Request and Bug Report](#feature-request-and-bug-report)
 
-Contact for bug report and feature request (xpan@google.com)
 
 ### Features
 
@@ -20,43 +18,57 @@ Contact for bug report and feature request (xpan@google.com)
   * operation configuration check
   * distributed runtime check (Not OSS)
 
-### Interfaces
-
-* Python API
-* Command Line
-* Visualization
-* C++ API (Not public, contact us if needed.)
-
-tfprof provides 4 different views to organize the profiles.
-
-    *  code view: operations are grouped by Python codes that generate them.
-    *  op view: operations are grouped by operation type (E.g. MatMul, Conv2D).
-    *  scope view: operations are organized based on name scope hierarchies.
-    *  graph view: operations are organized based on input/output.
-
-tfprof provides options to help user select, filter and order statistics.
-See [Options](g3doc/options.md) for detail instructions.
+### Quick Start
+
+```python
+# Enable tracing and profile the time and memory information.
+builder = tf.profiler.ProfileOptionBuilder
+opts = builder(builder.time_and_memory()).order_by('micros').build()
+with tf.contrib.tfprof.ProfileContext() as pctx:
+  with tf.Session() as sess:
+    x = build_model()
+    sess.run(tf.global_variables_initializer())
+    for i in range(10000):
+      if i == 10:
+        pctx.capture_next_run_meta()
+        _ = sess.run(x)
+        tf.profiler.profile(sess.graph, run_meta=pctx.run_meta(), cmd='op', options=opts)
+      else:
+        _ = sess.run(x)
+```
 
+```python
+# When using high-level API, session is usually hidden.
+# Register profiling to ProfileContext that performs profiling automatically.
+builder = tf.profiler.ProfileOptionBuilder
+opts = builder(builder.time_and_memory()).order_by('micros').build()
+opts2 = tf.profiler.ProfileOptionBuilder.trainable_variables_parameter()
+with tf.contrib.tfprof.ProfileContext() as pctx:
+  # Profile time and memory from step 100 to 109.
+  # Note: High level APIs can run sessions for queue, variable initialization
+  # or training at different steps. Try profile multiple steps.
+  pctx.add_auto_profiling('op', opts, range(100, 110))
+  # Profile model parameters once at step 10 (since it is static).
+  pctx.add_auto_profiling('scope', opts2, [10])
+  # High level API, such as slim, Estimator, etc.
+  train_loop()
 ```
--max_depth                  10
--min_bytes                  0
--min_micros                 0
--min_params                 0
--min_float_ops              0
--min_occurrence             0
--step                       -1
--order_by                   name
--account_type_regexes       .*
--start_name_regexes         .*
--trim_name_regexes
--show_name_regexes          .*
--hide_name_regexes
--account_displayed_op_only  false
--select                     params
--output                     stdout:
+
+```python
+# Profiling from Python API is not interactive.
+# Dump the profiles to files and profile with interactive command line.
+with tf.contrib.tfprof.ProfileContext() as pctx:
+  pctx.add_auto_profile_dump('/tmp/profiles', [100])
+  train_loop()
+
+bazel-bin/tensorflow/core/profiler/profiler \
+    --graph_path=/tmp/profiles/graph.pbtxt \
+    --run_meta_path=/tmp/profiles/run_meta \
+    --op_log_path=/tmp/profiles/tfprof_log \
+tfprof> op -select micros,bytes,occurrence -order_by micros
 ```
 
-### Tutorials
+<b>Detail Tutorials</b>
 
 *  [Python API](g3doc/python_api.md)
 *  [Command Line Interface](g3doc/command_line.md)
@@ -64,6 +76,9 @@ See [Options](g3doc/options.md) for detail instructions.
 *  [Profile Memory](g3doc/profile_memory.md)
 *  [Profile Model Architecture](g3doc/profile_model_architecture.md)
 *  [Auto Detect and Advise](g3doc/advise.md)
+
+<b>Detail Documentation</b>
+
 *  [Options](g3doc/options.md)
 
 ## Demo
diff --git a/tensorflow/core/profiler/g3doc/options.md b/tensorflow/core/profiler/g3doc/options.md
index bdcc6b2bd8..15712d04c2 100644
--- a/tensorflow/core/profiler/g3doc/options.md
+++ b/tensorflow/core/profiler/g3doc/options.md
@@ -41,6 +41,32 @@ can vary. For example `-max_depth` in scope view means the depth of
 name scope <b>tree</b>. In op view, it means the length of operation <b>list</b>.
 In graph view, in means the number of hops in the <b>graph</b>.
 
+### Times
+
+Most machines have mutli-core CPUs. Some installs one or more accelerators.
+Each accelerator usually performs massive parallel processing. The profiler
+tracks the accumulated processing times. Hence, the accumulated processing
+time is likely larger than the time of each step.
+
+micros: This is the sum of cpu and accelerator times.
+accelerator_micros: This is the accelerator times.
+cpu_micros: This is the cpu times.
+
+### Memory
+
+Tensor memory are usually ref-counted. The memory is released when there is
+no more reference to it. It will be difficult to track the release of memory.
+Currently, profiler only tracks the allocation of memory. As a result, the
+accumulated memory request is uaually larger than the peak memory of the overall
+model.
+
+bytes: The memory allocations requested by the operation.
+peak_bytes: The peak requested memory (not de-allocated) by the operation.
+residual_bytes: The memory requested by the operation and not de-allocated
+                when Compute finishes.
+output_bytes: The memory output by the operation. It's not necessarily requested
+              by the current operation. For example, it can be a tensor
+              forwarded from input to output, with in-place mutation.
 
 ###Docs
 
diff --git a/tensorflow/core/profiler/internal/tfprof_node.cc b/tensorflow/core/profiler/internal/tfprof_node.cc
index d4a784ffaa..19257968e6 100644
--- a/tensorflow/core/profiler/internal/tfprof_node.cc
+++ b/tensorflow/core/profiler/internal/tfprof_node.cc
@@ -25,7 +25,8 @@ bool CountAsAcceleratorTime(const string& device) {
 }
 
 bool CountAsCPUTime(const string& device) {
-  return RE2::FullMatch(device, ".*/(device:gpu|gpu|cpu|device:sycl):\\d+");
+  return RE2::FullMatch(device,
+                        ".*/(device:gpu|gpu|device:cpu|cpu|device:sycl):\\d+");
 }
 
 bool IsCanonicalDevice(const string& device) { return CountAsCPUTime(device); }
@@ -148,8 +149,8 @@ void TFGraphNode::AddStepStat(int64 step, const string& device,
   if (IsCanonicalDevice(dev)) {
     if (!canonical_device_.empty()) {
       if (canonical_device_ != dev) {
-        fprintf(stderr, "Unexpected: graph node changed device: %s->%s.\n",
-                canonical_device_.c_str(), dev.c_str());
+        // TODO(xpan): Some RunMetadata node appears at multiple devices.
+        // Need to address it.
         return;
       }
     } else {
diff --git a/tensorflow/core/profiler/internal/tfprof_show_test.cc b/tensorflow/core/profiler/internal/tfprof_show_test.cc
index f2c8b662d0..24cf9afb17 100644
--- a/tensorflow/core/profiler/internal/tfprof_show_test.cc
+++ b/tensorflow/core/profiler/internal/tfprof_show_test.cc
@@ -110,13 +110,14 @@ TEST_F(TFProfShowTest, DumpAcceleratorAndCPUMicros) {
   TF_CHECK_OK(ReadFileToString(Env::Default(), dump_file, &dump_str));
   EXPECT_EQ(
       "node name | accelerator execution time | cpu execution "
-      "time\n_TFProfRoot (--/404us, --/4.50ms)\n  Conv2D (226us/226us, "
-      "4.07ms/4.07ms)\n  Conv2D_1 (178us/178us, 419us/419us)\n  DW2 (0us/0us, "
-      "11us/11us)\n    DW2/Assign (0us/0us, 0us/0us)\n    DW2/Initializer "
-      "(0us/0us, 0us/0us)\n      DW2/Initializer/random_normal (0us/0us, "
-      "0us/0us)\n        DW2/Initializer/random_normal/RandomStandardNormal "
-      "(0us/0us, 0us/0us)\n        DW2/Initializer/random_normal/mean "
-      "(0us/0us, 0us/0us)\n        DW2/Initializer/random_normal/mul (0us/0us, "
+      "time\n_TFProfRoot (--/404us, --/4.54ms)\n  Conv2D (226us/226us, "
+      "4.07ms/4.07ms)\n  Conv2D_1 (178us/178us, 419us/419us)\n  "
+      "_retval_Conv2D_1_0_0 (0us/0us, 41us/41us)\n  DW2 (0us/0us, 11us/11us)\n "
+      "   DW2/Assign (0us/0us, 0us/0us)\n    DW2/Initializer (0us/0us, "
+      "0us/0us)\n      DW2/Initializer/random_normal (0us/0us, 0us/0us)\n      "
+      "  DW2/Initializer/random_normal/RandomStandardNormal (0us/0us, "
+      "0us/0us)\n        DW2/Initializer/random_normal/mean (0us/0us, "
+      "0us/0us)\n        DW2/Initializer/random_normal/mul (0us/0us, "
       "0us/0us)\n        DW2/Initializer/random_normal/shape (0us/0us, "
       "0us/0us)\n        DW2/Initializer/random_normal/stddev (0us/0us, "
       "0us/0us)\n    DW2/read (0us/0us, 0us/0us)\n  DW (0us/0us, 2us/2us)\n    "
@@ -156,27 +157,27 @@ TEST_F(TFProfShowTest, DumpOpMode) {
   EXPECT_EQ(
       "nodename|requestedbytes|totalexecutiontime|acceleratorexecutiontime|"
       "cpuexecutiontime|#parameters|#float_ops|opoccurrence(run|defined)|"
-      "inputshapes\nVariableV20B(0.00%,0.00%),13us(100.00%,0.27%),0us(100.00%,"
+      "inputshapes\nVariableV20B(0.00%,0.00%),13us(100.00%,0.26%),0us(100.00%,"
       "0.00%),13us(100.00%,0.29%),451params(100.00%,100.00%),0float_ops(100.00%"
       ",0.00%),2|3\n\ninput_type:\t(run*2|defined*3)\texec_time:13us\n\nAdd0B("
-      "0.00%,0.00%),0us(99.73%,0.00%),0us(100.00%,0.00%),0us(99.71%,0.00%),"
+      "0.00%,0.00%),0us(99.74%,0.00%),0us(100.00%,0.00%),0us(99.71%,0.00%),"
       "0params(0.00%,0.00%),0float_ops(100.00%,0.00%),0|3\n\ninput_type:0:1,"
       "\t1:1\t(run*0|defined*1)\texec_time:0us\ninput_type:0:2x2x6x12,\t1:1\t("
       "run*0|defined*1)\texec_time:0us\ninput_type:0:3x3x3x6,\t1:1\t(run*0|"
-      "defined*1)\texec_time:0us\n\nAssign0B(0.00%,0.00%),0us(99.73%,0.00%),"
+      "defined*1)\texec_time:0us\n\nAssign0B(0.00%,0.00%),0us(99.74%,0.00%),"
       "0us(100.00%,0.00%),0us(99.71%,0.00%),0params(0.00%,0.00%),0float_ops("
       "100.00%,0.00%),0|3\n\ninput_type:0:1,\t1:1\t(run*0|defined*1)\texec_"
       "time:0us\ninput_type:0:2x2x6x12,\t1:2x2x6x12\t(run*0|defined*1)\texec_"
       "time:0us\ninput_type:0:3x3x3x6,\t1:3x3x3x6\t(run*0|defined*1)\texec_"
-      "time:0us\n\nConst0B(0.00%,0.00%),2us(99.73%,0.04%),0us(100.00%,0.00%),"
+      "time:0us\n\nConst0B(0.00%,0.00%),2us(99.74%,0.04%),0us(100.00%,0.00%),"
       "2us(99.71%,0.04%),0params(0.00%,0.00%),0float_ops(100.00%,0.00%),1|"
       "10\n\ninput_type:\t(run*1|defined*10)\texec_time:2us\n\nConv2D14.59KB("
-      "100.00%,100.00%),4.89ms(99.69%,99.69%),404us(100.00%,100.00%),4.49ms(99."
-      "67%,99.67%),0params(0.00%,0.00%),10.44kfloat_ops(100.00%,100.00%),2|"
+      "100.00%,100.00%),4.89ms(99.70%,98.87%),404us(100.00%,100.00%),4.49ms(99."
+      "67%,98.77%),0params(0.00%,0.00%),10.44kfloat_ops(100.00%,100.00%),2|"
       "2\n\ninput_type:0:2x3x3x6,\t1:2x2x6x12\t(run*1|defined*1)\texec_time:"
       "597us\ninput_type:0:2x6x6x3,\t1:3x3x3x6\t(run*1|defined*1)\texec_time:4."
-      "29ms\n\nIdentity0B(0.00%,0.00%),0us(0.00%,0.00%),0us(0.00%,0.00%),0us(0."
-      "00%,0.00%),0params(0.00%,0.00%),0float_ops(0.00%,0.00%),0|3\n\ninput_"
+      "29ms\n\nIdentity0B(0.00%,0.00%),0us(0.83%,0.00%),0us(0.00%,0.00%),0us(0."
+      "90%,0.00%),0params(0.00%,0.00%),0float_ops(0.00%,0.00%),0|3\n\ninput_"
       "type:0:1\t(run*0|defined*1)\texec_time:0us\ninput_type:0:2x2x6x12\t(run*"
       "0|defined*1)\texec_time:0us\ninput_type:0:3x3x3x6\t(run*0|defined*1)"
       "\texec_time:0us\n\n",
diff --git a/tensorflow/core/profiler/internal/tfprof_stats.cc b/tensorflow/core/profiler/internal/tfprof_stats.cc
index ea3e7b487c..012c49525a 100644
--- a/tensorflow/core/profiler/internal/tfprof_stats.cc
+++ b/tensorflow/core/profiler/internal/tfprof_stats.cc
@@ -25,6 +25,22 @@ limitations under the License.
 
 namespace tensorflow {
 namespace tfprof {
+namespace {
+bool CreateRunMetadataNode(const string& name, NodeDef* def) {
+  // TODO(xpan): Better solution than blacklisting this 2 nodes. They
+  // actually cost some resources, maybe include them. Some nodes, such
+  // as _SOURCE appear in multiple devices, which breaks tfprof's assumption.
+  if (name == "RecvTensor" || name == "_SOURCE" ||
+      name.find("MEMCPY") != name.npos) {
+    return false;
+  }
+  def->set_name(name);
+  // TODO(xpan): Better operation type.
+  def->set_op("RunTimeOp");
+  return true;
+}
+}  // namespace
+
 TFStats::TFStats(std::unique_ptr<GraphDef> graph,
                  std::unique_ptr<RunMetadata> run_meta,
                  std::unique_ptr<OpLogProto> op_log,
@@ -191,7 +207,16 @@ void TFStats::AddRunMeta(int64 step, std::unique_ptr<RunMetadata> run_meta) {
         name = node_stat.node_name().substr(0, split_pos);
       }
       auto node = nodes_map_.find(name);
-      if (node != nodes_map_.end()) {
+      if (node == nodes_map_.end()) {
+        NodeDef def;
+        if (CreateRunMetadataNode(name, &def)) {
+          NodeDef* ndef = graph_->mutable_node()->Add();
+          ndef->MergeFrom(def);
+          nodes_map_[name] =
+              std::unique_ptr<TFGraphNode>(new TFGraphNode(ndef));
+          nodes_map_.at(name)->AddStepStat(step, dev_stat.device(), node_stat);
+        }
+      } else {
         node->second->AddStepStat(step, dev_stat.device(), node_stat);
       }
     }
diff --git a/tensorflow/core/profiler/internal/tfprof_stats_test.cc b/tensorflow/core/profiler/internal/tfprof_stats_test.cc
index e67c158521..0bea14fe66 100644
--- a/tensorflow/core/profiler/internal/tfprof_stats_test.cc
+++ b/tensorflow/core/profiler/internal/tfprof_stats_test.cc
@@ -137,7 +137,7 @@ TEST_F(TFProfStatsTest, TestGraph) {
 
   GraphNodeProto expected;
   CHECK(protobuf::TextFormat::ParseFromString(
-      "name: \"_TFProfRoot\"\ntotal_exec_micros: 4904\ntotal_requested_bytes: "
+      "name: \"_TFProfRoot\"\ntotal_exec_micros: 4945\ntotal_requested_bytes: "
       "14592\ntotal_parameters: 451\nchildren {\n  name: "
       "\"DW/Initializer/random_normal/mul\"\n  children {\n    name: "
       "\"DW/Initializer/random_normal/RandomStandardNormal\"\n    children {\n "
@@ -152,8 +152,8 @@ TEST_F(TFProfStatsTest, TestGraph) {
       "}\n  input_shapes {\n    key: 1\n    value {\n      dim {\n        "
       "size: 1\n      }\n    }\n  }\n  total_definition_count: "
       "4\n}\ntotal_float_ops: 10440\ntotal_accelerator_exec_micros: "
-      "404\ntotal_cpu_exec_micros: 4500\ntotal_run_count: "
-      "5\ntotal_definition_count: 31\ntotal_peak_bytes: "
+      "404\ntotal_cpu_exec_micros: 4541\ntotal_run_count: "
+      "6\ntotal_definition_count: 32\ntotal_peak_bytes: "
       "9984\ntotal_residual_bytes: 1280\ntotal_output_bytes: 4864\n",
       &expected));
   EXPECT_EQ(expected.DebugString(), root.DebugString());
@@ -166,7 +166,7 @@ TEST_F(TFProfStatsTest, TestFloatOps) {
 
   GraphNodeProto expected;
   CHECK(protobuf::TextFormat::ParseFromString(
-      "name: \"_TFProfRoot\"\ntotal_exec_micros: 4904\ntotal_requested_bytes: "
+      "name: \"_TFProfRoot\"\ntotal_exec_micros: 4945\ntotal_requested_bytes: "
       "14592\ntotal_parameters: 451\nchildren {\n  name: \"Conv2D\"\n  "
       "exec_micros: 4292\n  requested_bytes: 9472\n  total_exec_micros: 4292\n "
       " total_requested_bytes: 9472\n  devices: "
@@ -199,7 +199,7 @@ TEST_F(TFProfStatsTest, TestFloatOps) {
       "output_bytes: 512\n  total_peak_bytes: 4096\n  total_residual_bytes: "
       "512\n  total_output_bytes: 512\n}\ntotal_float_ops: "
       "10440\ntotal_accelerator_exec_micros: 404\ntotal_cpu_exec_micros: "
-      "4500\ntotal_run_count: 5\ntotal_definition_count: 34\ntotal_peak_bytes: "
+      "4541\ntotal_run_count: 6\ntotal_definition_count: 35\ntotal_peak_bytes: "
       "9984\ntotal_residual_bytes: 1280\ntotal_output_bytes: 4864\n",
       &expected));
   EXPECT_EQ(expected.DebugString(), root.DebugString());
@@ -246,7 +246,7 @@ TEST_F(TFProfStatsTest, TestShowTensorValue) {
   const GraphNodeProto& root = tf_stats_->ShowGraphNode("scope", opts);
   GraphNodeProto expected;
   CHECK(protobuf::TextFormat::ParseFromString(
-      "name: \"_TFProfRoot\"\ntotal_exec_micros: 4904\ntotal_requested_bytes: "
+      "name: \"_TFProfRoot\"\ntotal_exec_micros: 4945\ntotal_requested_bytes: "
       "14592\ntotal_parameters: 451\nchildren {\n  name: \"DW\"\n  "
       "exec_micros: 2\n  parameters: 162\n  total_exec_micros: 2\n  "
       "total_parameters: 162\n  devices: "
@@ -336,7 +336,7 @@ TEST_F(TFProfStatsTest, TestShowTensorValue) {
       "total_run_count: 1\n  total_definition_count: 10\n  output_bytes: "
       "1280\n  total_output_bytes: 1280\n}\ntotal_float_ops: "
       "10440\ntotal_accelerator_exec_micros: 404\ntotal_cpu_exec_micros: "
-      "4500\ntotal_run_count: 5\ntotal_definition_count: 34\ntotal_peak_bytes: "
+      "4541\ntotal_run_count: 6\ntotal_definition_count: 35\ntotal_peak_bytes: "
       "9984\ntotal_residual_bytes: 1280\ntotal_output_bytes: 4864\n",
       &expected));
   EXPECT_EQ(expected.DebugString(), root.DebugString());
diff --git a/tensorflow/core/profiler/internal/tfprof_timeline_test.cc b/tensorflow/core/profiler/internal/tfprof_timeline_test.cc
index 2fe3653ec2..5dd440e9a2 100644
--- a/tensorflow/core/profiler/internal/tfprof_timeline_test.cc
+++ b/tensorflow/core/profiler/internal/tfprof_timeline_test.cc
@@ -71,7 +71,7 @@ TEST_F(TFProfTimelineTest, GraphView) {
 
   string dump_str;
   TF_CHECK_OK(ReadFileToString(Env::Default(), dump_file, &dump_str));
-  EXPECT_EQ(16947107375505024864ull, Hash64(dump_str));
+  EXPECT_EQ(1754536562981488144ull, Hash64(dump_str));
 }
 
 TEST_F(TFProfTimelineTest, ScopeView) {
@@ -85,7 +85,7 @@ TEST_F(TFProfTimelineTest, ScopeView) {
 
   string dump_str;
   TF_CHECK_OK(ReadFileToString(Env::Default(), dump_file, &dump_str));
-  EXPECT_EQ(2710044785377031280ull, Hash64(dump_str));
+  EXPECT_EQ(17545174915963890413ull, Hash64(dump_str));
 }
 
 // TODO(xpan): tfprof_log is too large to include in testdata when adding
author	A. Unique TensorFlower <gardener@tensorflow.org>	2017-08-15 21:39:28 -0700
committer	TensorFlower Gardener <gardener@tensorflow.org>	2017-08-15 21:42:58 -0700
commit	2b51e0ba27af69c914a7523d9aae232de09e3206 (patch)
tree	892089b4eca69784760e397f9bf1a5dd64dc9f18 /tensorflow/core/profiler
parent	8041185b16c0ce9348bd79900f6682fc3976bfd7 (diff)