-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PipeMode all files under prefix #24
Comments
Hi Gareth, Yes, this is indeed possible and a great way to use Pipe Mode in SageMaker. When you construct your The following boto call demonstrates creating a "train" channel from a prefix collection of S3 objects. import boto3
sagemaker = boto3.client('sagemaker')
sagemaker.create_training_job(
# ... Other parameters omitted for brevity
InputDataConfig=[
{
"ChannelName": "train",
"DataSource": {
"S3DataSource": {
"S3Uri": "s3://my-great-bucket/my/great/prefix",
"S3DataType": "S3Prefix"
}
}
}
]
) More information can be found on the CreateTrainingJob api page. If your prefix contains A word of caution: SageMaker Pipe Mode will pass each object in this prefix to your training container. If, for example, you're training on tfrecord files, then ensure that this prefix only contains tfrecord files. Warm regards, Owen. |
If you're using the SageMaker Python SDK, then simply pass a S3 prefix URIs to your estimator's fit method. |
Thanks! |
I'm going to leave this issue open for the time being. The point about incorporating into the docs is a good one. |
Could you provide a short example of using PipeModeDataset with multiple files with multiple channels i.e. for training and evaluation |
Can you please provide an example of how to parse the Augmented Manifest file in the entry_point script so as to get image data (from the URL) and label? |
It might be a bit off-topic from the issue, but since @kalpitsmehta asked and somebody +1'd: If your channel is in Augmented Manifest mode, I think per the docs you should receive the binary file contents (not the URL) for any manifest attributes ending in For example if my manifest has So the first step is to batch your dataset by the number of manifest file attributes, and then you want to create a TensorFlow has decode_image() but sadly I haven't yet found any good functions for loading JSON data... As a workaround (with a performance hit due to the Python GIL), we can use a tf.py_func() inside the mapper: Inside the py_func you should be able to call json.loads() no problem, and just return whatever fragment of the data you want as a numpy array. I think something along these lines should do it?: def py_parse_json(label):
# Extracting a numeric array from some subfield:
return np.array(json.loads(label)["annotations"]["etc"])
def tf_parse_sample(fields):
img = tf.io.decode_image(fields[0])
boxes = tf.py_func(py_parse_json, [fields[1]], [tf.float64])
boxes.set_shape([None, 5]) # TF doesn't know shape of py_func outputs
return (img, boxes)
ds_train = PipeModeDataset(channel="train") \
.repeat(args.epochs) \
.batch(2) \
.map(tf_parse_sample) |
Is there a way to access all files under a prefix using PipeModel using a channel? If so it would be nice to add an example to the documentation since this is likely a common use case.
The text was updated successfully, but these errors were encountered: