Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PipeMode all files under prefix #24

Open
gdj0nes opened this issue Oct 19, 2018 · 7 comments
Open

PipeMode all files under prefix #24

gdj0nes opened this issue Oct 19, 2018 · 7 comments

Comments

@gdj0nes
Copy link

gdj0nes commented Oct 19, 2018

Is there a way to access all files under a prefix using PipeModel using a channel? If so it would be nice to add an example to the documentation since this is likely a common use case.

@owen-t
Copy link
Contributor

owen-t commented Oct 19, 2018

Hi Gareth,

Yes, this is indeed possible and a great way to use Pipe Mode in SageMaker.

When you construct your CreateTrainingJob request, specify a Channel bound to an S3Uri with an S3DataType of "S3Prefix". All objects in that prefix will be transmitted through the pipe for that channel.

The following boto call demonstrates creating a "train" channel from a prefix collection of S3 objects.

import boto3
sagemaker = boto3.client('sagemaker')

sagemaker.create_training_job(
    # ... Other parameters omitted for brevity
    InputDataConfig=[
        {
            "ChannelName": "train",
            "DataSource": {
                "S3DataSource": {
                    "S3Uri": "s3://my-great-bucket/my/great/prefix",
                    "S3DataType": "S3Prefix"
                }
            }
        }
    ]
)

More information can be found on the CreateTrainingJob api page.

If your prefix contains gzip compressed objects, then set the CompressionType parameter to be GZip. Doing so will cause SageMaker to decompress the object when transmitting it into the channel's pipe.

A word of caution: SageMaker Pipe Mode will pass each object in this prefix to your training container. If, for example, you're training on tfrecord files, then ensure that this prefix only contains tfrecord files.

Warm regards,

Owen.

@owen-t
Copy link
Contributor

owen-t commented Oct 19, 2018

If you're using the SageMaker Python SDK, then simply pass a S3 prefix URIs to your estimator's fit method.

@gdj0nes
Copy link
Author

gdj0nes commented Oct 22, 2018

Thanks!

@owen-t
Copy link
Contributor

owen-t commented Oct 22, 2018

I'm going to leave this issue open for the time being. The point about incorporating into the docs is a good one.

@gdj0nes
Copy link
Author

gdj0nes commented Nov 2, 2018

Could you provide a short example of using PipeModeDataset with multiple files with multiple channels i.e. for training and evaluation

@kalpitsmehta
Copy link

Can you please provide an example of how to parse the Augmented Manifest file in the entry_point script so as to get image data (from the URL) and label?

@athewsey
Copy link

It might be a bit off-topic from the issue, but since @kalpitsmehta asked and somebody +1'd:

If your channel is in Augmented Manifest mode, I think per the docs you should receive the binary file contents (not the URL) for any manifest attributes ending in -ref.

For example if my manifest has source-ref (image URI) and labels (SM Ground Truth label data), I'll get alternating records of JPEG/PNG/whatever data, and label data.

So the first step is to batch your dataset by the number of manifest file attributes, and then you want to create a map() to process the raw (binary) string tensors into the data type you actually want.

TensorFlow has decode_image() but sadly I haven't yet found any good functions for loading JSON data...

As a workaround (with a performance hit due to the Python GIL), we can use a tf.py_func() inside the mapper: Inside the py_func you should be able to call json.loads() no problem, and just return whatever fragment of the data you want as a numpy array.

I think something along these lines should do it?:

def py_parse_json(label):
    # Extracting a numeric array from some subfield:
    return np.array(json.loads(label)["annotations"]["etc"])

def tf_parse_sample(fields):
    img = tf.io.decode_image(fields[0])
    boxes = tf.py_func(py_parse_json, [fields[1]], [tf.float64])
    boxes.set_shape([None, 5])  # TF doesn't know shape of py_func outputs
    return (img, boxes)

ds_train = PipeModeDataset(channel="train") \
        .repeat(args.epochs) \
        .batch(2) \
        .map(tf_parse_sample)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants