Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to improve performance of list_blobs_segmented? #329

Open
yxiang92128 opened this issue Feb 25, 2020 · 6 comments
Open

how to improve performance of list_blobs_segmented? #329

yxiang92128 opened this issue Feb 25, 2020 · 6 comments
Labels

Comments

@yxiang92128
Copy link

yxiang92128 commented Feb 25, 2020

@JinmingHu-MSFT

Is there a way to improve the performance of list_blobs_segemented by passing certain options or use an entirely different function to achieve the list of a container incrementally by 1000 objects each iteration? Currently the list takes 3X longer than S3 with the same amount of objects in the bucket. See the following code snippet I currently have below:

    do
    {
        num_in_progress = 0;

        azure::storage::list_blob_item_segment result;

        // Azure support prefix filter as an argument which is handy
        result = container.list_blobs_segmented(utility::string_t(prefix), true, azure::storage::blob_listing_details::none, max_return, token, azure::storage::blob_request_options(), operation_context());


        // remember token
        token = result.continuation_token();

        for (auto& item : result.results())
        {
          if (item.is_blob())
          {

             // tune the clock from MS clock to Linux Epoch
             // it works with the offset now
             long unsigned int input = item.as_blob().properties().last_modified().to_interval();
             long unsigned int linuxtime_milisecs = input/10000 - epoch_offset;// diff between windows and unix epochs (seconds)

             num_in_progress++;
          }
          else
          {
             ucout << _XPLATSTR("Directory: ") << item.as_directory().uri().primary_uri().to_string() << std::endl;
          }
        }

        num += num_in_progress;

      // only when max_return is set to 0 when
      // we grab all items in one loop
      // otherwise we will set the token and return
      // whatever number of items this list_blobs_segments returns
    } while (!token.empty() && max_return == 0);

Any ideas of potential improvement to the above code?

Thanks,

Yang

@Jinming-Hu
Copy link
Member

Hi @yxiang92128 , I want to know how you tested the elapsed time. Did you measure the total end-to-end time or just local processing time excluding network round-trip time?

Because I think the network should take most of the e2e time. If the latency from your test client to AWS server and Azure server is different, the result doesn't make too much sense.

@yxiang92128
Copy link
Author

I measured the total time for the same numbers of objects in the list to come back. I am just wondering if I did something suboptimal in the above code.
thanks.

@Jinming-Hu
Copy link
Member

@yxiang92128 I wouldn't think of that as a valid test. Because the network round-trip time would take most of the total time. If latency to one server is very low and to another server is very high, it's reasonable that you might see several times difference in total time.

Can you also share the latency to both servers?

@yxiang92128
Copy link
Author

yeah. I understand the network round-trip time varies between systems. I only wanted to confirm from my code, there is nothing I could do in order to improve the latency in the "list" operation.

thanks

@Jinming-Hu
Copy link
Member

Jinming-Hu commented Feb 27, 2020

@yxiang92128 I think your code is fine. It's very concise and straight-forward, I cannot find any places that can be further optimized.

@jamwhy
Copy link

jamwhy commented Feb 25, 2021

@JinmingHu-MSFT @yxiang92128

I found list_blobs_segmented call is taking 10 to 20 seconds for 1000 items. I have been testing this with a number of iterations. Same directory with AzCopy takes about 1.5 seconds. Why is there such a huge difference? Do you have any insight as to why the order of magnitude difference between list_blobs_segmented and AzCopy?

UPDATE: The problem is the XML parsing and other stuff in set_postprocess_response. It takes about 5 seconds for 5000 items to be returned from the http request. It takes another 85 seconds to execute set_postprocess_response in cloud_blob_container.cpp (around line 477).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants