PipeMode all files under prefix #24

gdj0nes · 2018-10-19T17:40:25Z

Is there a way to access all files under a prefix using PipeModel using a channel? If so it would be nice to add an example to the documentation since this is likely a common use case.

owen-t · 2018-10-19T20:26:49Z

Hi Gareth,

Yes, this is indeed possible and a great way to use Pipe Mode in SageMaker.

When you construct your CreateTrainingJob request, specify a Channel bound to an S3Uri with an S3DataType of "S3Prefix". All objects in that prefix will be transmitted through the pipe for that channel.

The following boto call demonstrates creating a "train" channel from a prefix collection of S3 objects.

import boto3
sagemaker = boto3.client('sagemaker')

sagemaker.create_training_job(
    # ... Other parameters omitted for brevity
    InputDataConfig=[
        {
            "ChannelName": "train",
            "DataSource": {
                "S3DataSource": {
                    "S3Uri": "s3://my-great-bucket/my/great/prefix",
                    "S3DataType": "S3Prefix"
                }
            }
        }
    ]
)

More information can be found on the CreateTrainingJob api page.

If your prefix contains gzip compressed objects, then set the CompressionType parameter to be GZip. Doing so will cause SageMaker to decompress the object when transmitting it into the channel's pipe.

A word of caution: SageMaker Pipe Mode will pass each object in this prefix to your training container. If, for example, you're training on tfrecord files, then ensure that this prefix only contains tfrecord files.

Warm regards,

Owen.

owen-t · 2018-10-19T20:36:08Z

If you're using the SageMaker Python SDK, then simply pass a S3 prefix URIs to your estimator's fit method.

gdj0nes · 2018-10-22T22:43:17Z

Thanks!

owen-t · 2018-10-22T22:56:55Z

I'm going to leave this issue open for the time being. The point about incorporating into the docs is a good one.

gdj0nes · 2018-11-02T20:01:41Z

Could you provide a short example of using PipeModeDataset with multiple files with multiple channels i.e. for training and evaluation

kalpitsmehta · 2019-06-06T20:19:10Z

Can you please provide an example of how to parse the Augmented Manifest file in the entry_point script so as to get image data (from the URL) and label?

athewsey · 2020-04-22T08:52:59Z

It might be a bit off-topic from the issue, but since @kalpitsmehta asked and somebody +1'd:

If your channel is in Augmented Manifest mode, I think per the docs you should receive the binary file contents (not the URL) for any manifest attributes ending in -ref.

For example if my manifest has source-ref (image URI) and labels (SM Ground Truth label data), I'll get alternating records of JPEG/PNG/whatever data, and label data.

So the first step is to batch your dataset by the number of manifest file attributes, and then you want to create a map() to process the raw (binary) string tensors into the data type you actually want.

TensorFlow has decode_image() but sadly I haven't yet found any good functions for loading JSON data...

As a workaround (with a performance hit due to the Python GIL), we can use a tf.py_func() inside the mapper: Inside the py_func you should be able to call json.loads() no problem, and just return whatever fragment of the data you want as a numpy array.

I think something along these lines should do it?:

def py_parse_json(label):
    # Extracting a numeric array from some subfield:
    return np.array(json.loads(label)["annotations"]["etc"])

def tf_parse_sample(fields):
    img = tf.io.decode_image(fields[0])
    boxes = tf.py_func(py_parse_json, [fields[1]], [tf.float64])
    boxes.set_shape([None, 5])  # TF doesn't know shape of py_func outputs
    return (img, boxes)

ds_train = PipeModeDataset(channel="train") \
        .repeat(args.epochs) \
        .batch(2) \
        .map(tf_parse_sample)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PipeMode all files under prefix #24

PipeMode all files under prefix #24

gdj0nes commented Oct 19, 2018

owen-t commented Oct 19, 2018

owen-t commented Oct 19, 2018

gdj0nes commented Oct 22, 2018

owen-t commented Oct 22, 2018

gdj0nes commented Nov 2, 2018

kalpitsmehta commented Jun 6, 2019

athewsey commented Apr 22, 2020

PipeMode all files under prefix #24

PipeMode all files under prefix #24

Comments

gdj0nes commented Oct 19, 2018

owen-t commented Oct 19, 2018

owen-t commented Oct 19, 2018

gdj0nes commented Oct 22, 2018

owen-t commented Oct 22, 2018

gdj0nes commented Nov 2, 2018

kalpitsmehta commented Jun 6, 2019

athewsey commented Apr 22, 2020