# How to run TensorFlow on S3

Tensorflow supports reading and writing data to S3. S3 is an object storage API which is nearly ubiquitous, and can help in situations where data must accessed by multiple actors, such as in distributed training.

This document guides you through the required setup, and provides examples on usage.

## Configuration

When reading or writing data on S3 with your TensorFlow program, the behavior
can be controlled by various environmental variables:

*   **AWS_REGION**: By default, regional endpoint is used for S3, with region
    controlled by `AWS_REGION`. If `AWS_REGION` is not specified, then
    `us-east-1` is used.
*   **S3_ENDPOINT**: The endpoint could be overridden explicitly with
    `S3_ENDPOINT` specified.
*   **S3_USE_HTTPS**: HTTPS is used to access S3 by default, unless
    `S3_USE_HTTPS=0`.
*   **S3_VERIFY_SSL**: If HTTPS is used, SSL verification could be disabled
    with `S3_VERIFY_SSL=0`.

To read or write objects in a bucket that is not publicly accessible,
AWS credentials must be provided through one of the following methods:

*   Set credentials in the AWS credentials profile file on the local system,
    located at: `~/.aws/credentials` on Linux, macOS, or Unix, or
    `C:\Users\USERNAME\.aws\credentials` on Windows.
*   Set the `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` environment
    variables.
*   If TensorFlow is deployed on an EC2 instance, specify an IAM role and then
    give the EC2 instance access to that role.

## Example Setup

Using the above information, we can configure Tensorflow to communicate to an S3 endpoint by setting the following environment variables:

```bash
AWS_ACCESS_KEY_ID=XXXXX                 # Credentials only needed if connecting to a private endpoint
AWS_SECRET_ACCESS_KEY=XXXXX
AWS_REGION=us-east-1                    # Region for the S3 bucket, this is not always needed. Default is us-east-1.
S3_ENDPOINT=s3.us-east-1.amazonaws.com  # The S3 API Endpoint to connect to. This is specified in a HOST:PORT format.
S3_USE_HTTPS=1                          # Whether or not to use HTTPS. Disable with 0.
S3_VERIFY_SSL=1                         # If HTTPS is used, controls if SSL should be enabled. Disable with 0.
```

## Usage

Once setup is completed, Tensorflow can interact with S3 in a variety of ways. Anywhere there is a Tensorflow IO function, an S3 URL can be used.

### Smoke Test

To test your setup, stat a file:

```python
from tensorflow.python.lib.io import file_io
print file_io.stat('s3://bucketname/path/')
```

You should see output similar to this:

```console
<tensorflow.python.pywrap_tensorflow_internal.FileStatistics; proxy of <Swig Object of type 'tensorflow::FileStatistics *' at 0x10c2171b0> >
```

### Reading Data

When [reading data](../api_guides/python/reading_data.md), change the file paths you use to read and write
data to an S3 path. For example:

```python
filenames = ["s3://bucketname/path/to/file1.tfrecord",
             "s3://bucketname/path/to/file2.tfrecord"]
dataset = tf.data.TFRecordDataset(filenames)
```

### Tensorflow Tools

Many Tensorflow tools, such as Tensorboard or model serving, can also take S3 URLS as arguments:

```bash
tensorboard --logdir s3://bucketname/path/to/model/
tensorflow_model_server --port=9000 --model_name=model --model_base_path=s3://bucketname/path/to/model/export/
```

This enables an end to end workflow using S3 for all data needs.

## S3 Endpoint Implementations

S3 was invented by Amazon, but the S3 API has spread in popularity and has several implementations. The following implementations have passed basic compatibility tests:

* [Amazon S3](https://aws.amazon.com/s3/)
* [Google Storage](https://cloud.google.com/storage/docs/interoperability)
* [Minio](https://www.minio.io/kubernetes.html)