how to improve performance of list_blobs_segmented? #329

yxiang92128 · 2020-02-25T23:28:34Z

@JinmingHu-MSFT

Is there a way to improve the performance of list_blobs_segemented by passing certain options or use an entirely different function to achieve the list of a container incrementally by 1000 objects each iteration? Currently the list takes 3X longer than S3 with the same amount of objects in the bucket. See the following code snippet I currently have below:

    do
    {
        num_in_progress = 0;

        azure::storage::list_blob_item_segment result;

        // Azure support prefix filter as an argument which is handy
        result = container.list_blobs_segmented(utility::string_t(prefix), true, azure::storage::blob_listing_details::none, max_return, token, azure::storage::blob_request_options(), operation_context());


        // remember token
        token = result.continuation_token();

        for (auto& item : result.results())
        {
          if (item.is_blob())
          {

             // tune the clock from MS clock to Linux Epoch
             // it works with the offset now
             long unsigned int input = item.as_blob().properties().last_modified().to_interval();
             long unsigned int linuxtime_milisecs = input/10000 - epoch_offset;// diff between windows and unix epochs (seconds)

             num_in_progress++;
          }
          else
          {
             ucout << _XPLATSTR("Directory: ") << item.as_directory().uri().primary_uri().to_string() << std::endl;
          }
        }

        num += num_in_progress;

      // only when max_return is set to 0 when
      // we grab all items in one loop
      // otherwise we will set the token and return
      // whatever number of items this list_blobs_segments returns
    } while (!token.empty() && max_return == 0);

Any ideas of potential improvement to the above code?

Thanks,

Yang

Jinming-Hu · 2020-02-26T05:05:45Z

Hi @yxiang92128 , I want to know how you tested the elapsed time. Did you measure the total end-to-end time or just local processing time excluding network round-trip time?

Because I think the network should take most of the e2e time. If the latency from your test client to AWS server and Azure server is different, the result doesn't make too much sense.

yxiang92128 · 2020-02-26T06:43:27Z

I measured the total time for the same numbers of objects in the list to come back. I am just wondering if I did something suboptimal in the above code.
thanks.

Jinming-Hu · 2020-02-26T06:49:05Z

@yxiang92128 I wouldn't think of that as a valid test. Because the network round-trip time would take most of the total time. If latency to one server is very low and to another server is very high, it's reasonable that you might see several times difference in total time.

Can you also share the latency to both servers?

yxiang92128 · 2020-02-26T23:32:49Z

yeah. I understand the network round-trip time varies between systems. I only wanted to confirm from my code, there is nothing I could do in order to improve the latency in the "list" operation.

thanks

Jinming-Hu · 2020-02-27T02:05:50Z

@yxiang92128 I think your code is fine. It's very concise and straight-forward, I cannot find any places that can be further optimized.

jamwhy · 2021-02-25T05:06:16Z

@JinmingHu-MSFT @yxiang92128

I found list_blobs_segmented call is taking 10 to 20 seconds for 1000 items. I have been testing this with a number of iterations. Same directory with AzCopy takes about 1.5 seconds. Why is there such a huge difference? Do you have any insight as to why the order of magnitude difference between list_blobs_segmented and AzCopy?

UPDATE: The problem is the XML parsing and other stuff in set_postprocess_response. It takes about 5 seconds for 5000 items to be returned from the http request. It takes another 85 seconds to execute set_postprocess_response in cloud_blob_container.cpp (around line 477).

Jinming-Hu added the question label Feb 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to improve performance of list_blobs_segmented? #329

how to improve performance of list_blobs_segmented? #329

yxiang92128 commented Feb 25, 2020 •

edited by Jinming-Hu

Loading

Jinming-Hu commented Feb 26, 2020

yxiang92128 commented Feb 26, 2020

Jinming-Hu commented Feb 26, 2020

yxiang92128 commented Feb 26, 2020

Jinming-Hu commented Feb 27, 2020 •

edited

Loading

jamwhy commented Feb 25, 2021 •

edited

Loading

how to improve performance of list_blobs_segmented? #329

how to improve performance of list_blobs_segmented? #329

Comments

yxiang92128 commented Feb 25, 2020 • edited by Jinming-Hu Loading

Jinming-Hu commented Feb 26, 2020

yxiang92128 commented Feb 26, 2020

Jinming-Hu commented Feb 26, 2020

yxiang92128 commented Feb 26, 2020

Jinming-Hu commented Feb 27, 2020 • edited Loading

jamwhy commented Feb 25, 2021 • edited Loading

yxiang92128 commented Feb 25, 2020 •

edited by Jinming-Hu

Loading

Jinming-Hu commented Feb 27, 2020 •

edited

Loading

jamwhy commented Feb 25, 2021 •

edited

Loading