Cost optimization of a data processing pipeline

In this blog we will attempt to optimize, both the cost and the performance of a data processing pipelines, more specifically its data movement step. The pipeline was originally created using Azure Data Factory or Synapse Pipelines.

Scenario

A company needs to process new files from different customers they have. Each customer has their own container. Files initially land into a 'Raw Source' Storage Account, and are then copied and processed in a different Storage account in the following layers Bronze -> Silver -> Gold.

We will use 1k, 2k, 5k and 10k files per iteration. There will also be a validation iteration that replaces each 1k of files with a single zipped (not compressed) file, so 1, 2, 5 & 10 files. We want to see what will be the impact of a single file versus multiple files on the copy step.

We will attempt the optimization with:

different languages/frameworks: AzCopy, REST API, .NET SDK, Copy Activity, Spark.
different compute: Azure IR and SHIR. For the SHIR scenario, we will start & stop the SHIR with web calls to save on VMs cost.
we will also examine if we can rethink the way the pipeline was built to preserve the processing outcome and save cost.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

user-story.md

user-story.md

Cost optimization of a data processing pipeline

Scenario

Files

user-story.md

Latest commit

History

user-story.md

File metadata and controls

Cost optimization of a data processing pipeline

Scenario