tensorflow/docs_src/programmers_guide/supervisor.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402

# Supervisor: Training Helper for Days-Long Trainings.

To train a model with TensorFlow you can simply run a training op a number of
times and save a checkpoint of the trained parameters when you're done.  This
works well for small models that can train in a few hours.

Larger models that require days of training, possibly across multiple replicas,
need a more robust training process that:

 * Handles shutdowns and crashes cleanly.
 * Can be resumed after a shutdown or a crash.
 * Can be monitored through TensorBoard.

To be able to resume training after a shutdown or a crash the training process
must save checkpoints regularly.  On restart, it must look for the most recent
checkpoint and load it before resuming training.

To be monitored through TensorBoard, the training process must run summary ops
regularly and append the returned values to an events file as explained in
@{$summaries_and_tensorboard$TensorBoard: Visualizing Learning}.
TensorBoard monitors events files and displays graphs reporting training
progress over time.

The @{tf.train.Supervisor} provides
a set of services that helps implement a robust training process.

This how-to shows how to use the supervisor directly.  Please also consider
using one of several frameworks built on top of the supervisor that provide
richer training loops, and numerous customization options:
@{$python/contrib.learn$`tf.learn`} is a good choice.

Note that the supervisor is very helpful for training large models, but can
also be used for smaller models without any penalty.

## Very Simple Scenario

The simplest scenario for using a supervisor is to:

 * Create a `Supervisor` object, passing it the path to a directory where to
   save checkpoints and summaries.

 * Ask the supervisor for a session with
   @{tf.train.Supervisor.managed_session}.

 * Use the session to execute a train op, checking at each step if the
   supervisor requests that the training stops.

```python
  ...create graph...
  my_train_op = ...

  sv = tf.train.Supervisor(logdir="/my/training/directory")
  with sv.managed_session() as sess:
    for step in range(100000):
      if sv.should_stop():
        break
      sess.run(my_train_op)
```

### Started Services

In the very simple scenario, the `managed_session()` call starts a few
services, which run in their own threads, and use the managed session to run
ops in your graph.

If your graph contains an integer variable named `global_step`, the services
use its value to measure the number of training steps executed.  See the @{$mechanics#training$MNIST training tutorial} for how to
create a `global_step` variable.

 * _Checkpointing_ service: Saves a copy of the graph variables in the logdir.
   The checkpoint filename uses the value of the `global_step` variable if one
   was added to your graph.  Runs every 10 minutes by default.

 * _Summary_ service: Runs all the summary ops and appends their output to an
   @{$summaries_and_tensorboard$events file} in the logdir.  Runs
   every 2 minutes by default.

 * _Step counter_: Counts how many steps have been executed, by looking at
   changes in the `global_step` variable.  Appends a summary to the events file
   reporting the number of global steps per second.  The summary tag is
   "global_step/sec".  This also runs every 2 minutes by default.

 * _Queue Runners_: If any @{tf.train.QueueRunner} were added to the
   graph, the supervisor launches them in their own threads.

All time intervals can be changed when constructing the supervisor object.  See
the [supervisor reference](#supervisor_reference) for details.

### Checking for Stop

The check for stop in the main training loop is important and necessary.

Exceptions raised in the service threads are reported to the supervisor which
then sets its `should_stop()` condition to true.  Other service threads notice
that condition and terminate properly.  The main training loop, within the
`managed_session()` block, must also check for the stop condition and
terminate.

Note that `managed_session()` takes care of catching exceptions raised from the
training loop to report them to the supervisor.  The main loop does not need to
do anything special about exceptions.  It only needs to check for the stop
condition.

### Recovery

If the training program shuts down or crashes, its most recent checkpoint and
event files are left in the logdir.  When you restart the program,
`managed_session()` restores the graph from the most recent checkpoint and
resumes training where it stopped.

A new events file is created.  If you start TensorBoard and point it to the
logdir, it will know how to merge the contents of the two events files and will
show the training resuming at the last global step from the checkpoint.

## Larger Model Scenario

The very simple scenario is sufficient for most small to medium sized models.
Larger models may run out memory when the summary service runs: The summary ops
are run in parallel with the main loop running the train op.  This can cause
memory usage to peak to up to two times the normal use.

For a larger model you can tell the supervisor to not run the summary service
and instead run it yourself in your main training loop: pass `summary_op=None`
when constructing the supervisor.

For example this code runs the summary op every 100 steps in the training loop:

```python
  ...create graph...
  my_train_op = ...
  my_summary_op = tf.summary.merge_all()

  sv = tf.train.Supervisor(logdir="/my/training/directory",
                     summary_op=None) # Do not run the summary service
  with sv.managed_session() as sess:
    for step in range(100000):
      if sv.should_stop():
        break
      if step % 100 == 0:
        _, summ = session.run([my_train_op, my_summary_op])
        sv.summary_computed(sess, summ)
      else:
        session.run(my_train_op)
```

## Pre-trained Model Scenario

The `managed_session()` call takes care of initializing the model in the
session.  The model is restored from a checkpoint if one is available,
or initialized from scratch otherwise.

One common scenario is to initialize the model by loading a "pre-trained"
checkpoint that was saved while training a usually slightly different model
using a different dataset.

You can load a pre-trained checkpoint by passing an "init function" to the
supervisor.  This function is called only if the model needs to be initialized
from scratch, not when the model can be recovered from a checkpoint from the
logdir.

To load the pre-trained model, the init function needs a
@{tf.train.Saver} object, so you should create
a saver for this purpose.  This is usually a good idea because the new model
may contain variables that are not present in the pre-trained checkpoint: This
saver must only restore the pre-trained variables.  If you were using the
default saver, you could get an error trying to restore all the variables of
the new model from the pre-trained checkpoint.

```python
  ...create graph...
  # Create a saver that restores only the pre-trained variables.
  pre_train_saver = tf.train.Saver([pre_train_var1, pre_train_var2])

  # Define an init function that loads the pretrained checkpoint.
  def load_pretrain(sess):
    pre_train_saver.restore(sess, "<path to pre-trained-checkpoint>")

  # Pass the init function to the supervisor.
  #
  # The init function is called _after_ the variables have been initialized
  # by running the init_op.
  sv = tf.train.Supervisor(logdir="/my/training/directory",
                     init_fn=load_pretrain)
  with sv.managed_session() as sess:
    # Here sess was either initialized from the pre-trained-checkpoint or
    # recovered from a checkpoint saved in a previous run of this code.
    ...
```

## Running Your Own Services

Supervisor services, such as the checkpointing service, run in threads parallel
to the main training loop.  You sometimes want to add your own services, for
example to fetch different sets of summaries on a different schedule than the
usual summary service.

Use the @{tf.train.Supervisor.loop} method of
the supervisor for this purpose.  It repeatedly calls a function of your choice
on a timer until the supervisor stop condition becomes true, so it plays nicely
with the other services.

Example: Call `my_additional_summaries()` every 20mn:

```python

def my_additional_sumaries(sv, sess):
 ...fetch and write summaries, see below...

...
  sv = tf.train.Supervisor(logdir="/my/training/directory")
  with sv.managed_session() as sess:
    # Call my_additional_sumaries() every 1200s, or 20mn,
    # passing (sv, sess) as arguments.
    sv.loop(1200, my_additional_sumaries, args=(sv, sess))
    ...main training loop...
```

## Writing Summaries

The supervisor always creates an events file in its logdir, as well as a
@{tf.summary.FileWriter} to append
events and summaries to that file.  If you want to write your own summaries it
is a good idea to append them to that same events file: TensorBoard likes it
better when only one events file in a directory is being actively appended to.

The supervisor provides a helper function to append summaries:
@{tf.train.Supervisor.summary_computed}.
Just pass to the function the output returned by a summary op.  Here is an
example of using that function to implement `my_additional_sumaries()` from the
previous example:

```python
def my_additional_sumaries(sv, sess):
  summaries = sess.run(my_additional_summary_op)
  sv.summary_computed(sess, summaries)
```

For more advanced usages, the supervisor provides access to its summary writer
through its
@{tf.train.Supervisor.summary_writer}
attribute.

## Supervisor Reference

The [Very Simple Scenario](#very_simple_scenario), and the [Larger Model
Scenario](#larger_model_scenario) show basic uses of a supervisor.  More
advanced scenarios can be constructed by using the many options provided by the
supervisor

### Checkpointing: Where and When.

The `managed_session()` call launches the checkpointing service, which can be
configured by the following keyword arguments to the `Supervisor()`
constructor:

 * `logdir`: path to a directory where the checkpointing service creates
   checkpoints.  The directory is created if needed.  Passing `None` disables
   the checkpointing and the summary services.

 * `checkpoint_basename`: Name of the checkpoint files to create, defaults to
   "model.ckpt".

   If the model contains a scalar integer variable named `global_step`, the
   value of that variable is appended to the checkpoint filename.

   For example, at global step 1234 the checkpoint filename is
   "model.ckpt-1234".

 * `save_model_secs`: Number of seconds between each checkpoint.  Defaults to
   600, or 10 minutes.

   When choosing a value, consider how much work you want to lose in case of a
   crash: you will never lose more than `save_model_secs` seconds of work.
   Setting this to 0 disables the checkpointing service.

 * `saver`: A @{tf.train.Saver} object to use
   for checkpointing.

   If you do not pass one, the supervisor creates one for you by calling
   `tf.train.Saver()`, which add ops to save and restore all variables in your model.
   This is usually what you need.

Example: Use a custom Saver and checkpoint every 30 seconds.

```python
  ...create graph...
  my_saver = tf.train.Saver(<only some variables>)
  sv = tf.train.Supervisor(logdir="/my/training/directory",
                     saver=my_saver,
                     save_model_secs=30)
  with sv.managed_session() as sess:
    ...training loop...
```

### Summaries: Where and When.

The `managed_session()` call launches the summary service which fetches
summaries and reports the number of steps executed per second.  It can be
configured by the following keyword arguments to the `Supervisor()`
constructor:

 * `logdir`: Path to a directory where the summary service creates event files.
   The directory is created if needed.  Passing `None` disables the summary
   service as well as the checkpointing services.

 * `save_summaries_secs`: Number of seconds between each run of the summary
   service.  Defaults to 120, or 2 minutes.

   When choosing a value, consider how expensive your summaries are, and how
   much disk they will occupy.  Pass 0 to disable the summary service.

 * `summary_op`: Op to use to fetch the summaries.

   If not specified, the supervisor use the first op in the
   `tf.GraphKeys.SUMMARY_OP` @{tf.Graph.add_to_collection$graph collection}.  If
   the collection is empty the supervisor creates an op that aggregates all
   summaries in the graph using `tf.summary.merge_all()`.

   Passing `None` disables the summary service.

 * `global_step`: Tensor to use to count the global step.

   If not specified, the supervisor uses the first tensor in the
   `tf.GraphKeys.GLOBAL_STEP` @{tf.Graph.add_to_collection$graph collection}.  If
   the collection is empty, the supervisor looks for a scalar integer variable
   named `global_step` in the graph.

   If found, the global step tensor is used to measure the number of training
   steps executed.  Note that your training op is responsible for incrementing
   the global step value.

### Model Initialization and Recovery

The `managed_session()` call takes care of initializing or recovering a
session.  It returns a session with a fully initialized model, ready to run
ops.  If a checkpoint exists in the logdir when `managed_session()` is called,
the model is initialized by loading that checkpoint, otherwise it is
initialized by calling an init op and optionally an init function.

When no checkpoint is available, model initialization is controlled by the
following keyword arguments to the `Supervisor()` constructor:

 * `init_op`: Op to run to initialize the model.

   If not specified, the supervisor uses the first op in the
   `tf.GraphKeys.INIT_OP` collection.  If the collection is empty, the
   supervisor adds an op to initialize all the variables in the graph by
   calling `tf.global_variables_initializer()`.

   Pass `None` to not use an init op.

 * `init_fn`: Python function to call to initialize the model.

   If specified, called as `init_fn(sess)` where `sess` is the managed session.
   If an init op is also used, the init function is called _after_ the init op.

 * `local_init_op`: An additional op to initialize parts of the graph that are
   not saved in checkpoints such as tables and
   @{tf.contrib.framework.local_variable$local variables}. The
   local init op is run _before_ the init op and the init function.

   If not specified, the supervisor uses the first op in the
   `tf.GraphKeys.LOCAL_INIT_OP` collection.  If the collection is empty the
   supervisor adds an op to initialize all the tables and local variables in
   the graph by calling `tf.tables_initializer()` and
   `tf.local_variables_initializer()`.

   Pass `None` to not use a local init op.

 * `ready_op`: Op to check if the model is initialized.

   After running the local init op, the init op, and the init function, the
   supervisor verifies that the model is fully initialized by running the ready
   op.  This is an op that returns an empty string if the model is initialized,
   or a description of what parts of the model are not initialized if not.

   If not specified, the supervisor uses the first op in the
   `tf.GraphKeys.READY_OP` collection.  If the collection is empty the
   supervisor creates a ready op that verifies that all variables are
   initialized by calling `tf.report_uninitialized_variables()`.

   Pass `None` to disable the ready op.  In that case the model is not
   checked after initialization.

Checkpoint recovery is controlled by the following keyword arguments to the
`Supervisor()` constructor:

 * `logdir`: Path to a directory in which to look for checkpoints.  The
  checkpoint service saves a metadata file, named "checkpoint", in the
  checkpoint directory that indicates the path to the most recent checkpoint.

  This file is in text format. When in a pinch, you can edit it manually to
  recover from a different checkpoint than the most recent one.

 * `ready_op`: (see above).  The ready op is run before and after loading the
   checkpoint.  The first run checks if the model needs to be initialized and
   the second run verifies that the model is fully initialized.

 * `local_init_op`: (see above).  The local init op is run before running the
   ready op the first time, to initialize local variables and tables.

 * `saver`: (see above).  Saver object used to load the checkpoint.