-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question/Feature Request: download_to_byte_array ? #289
Comments
Hi, currently we don't support this feature, and it's mainly because of limitation from cpprestsdk. In cpprestsdk, buffers are managed with std::shared_ptr to better maintain their lifetimes. User-provided buffer is passed by value to cpprestsdk so there will be a copy. After some research into cpprestsdk source code, I came up with a workaround. We can create a derived class which accepts reference instead of copy as parameter. As long as we make sure the container lives throughout the HTTP request/response, everything should work just fine. Here is a demonstration. You may want to make changes to fit your needs.
use it like this:
The cpprestsdk will still annoyingly allocate/free some small buffers(1M~4M) repeatedly during download. I didn't further study it, but I found disabling parallel download and checksum validation eliminates that.
I use |
Thanks. We will give it a try. |
@JinmingHu-MSFT With std::iostream, we were able to do the following and we are hoping we could do something similar with azure sdk/cpp rest sdk: Thanks for helping out. Yang |
@yxiang92128 Hi, I'm glad I can help. In the previous reply, by saying Did you try it with By the way, I don't think it's possible to completely avoid malloc. There would always be dozens of bytes allocation. But to avoid big allocation (like larger than 500K) is possible, I think. |
@JinmingHu-MSFT |
Hi Yang, give this a try
use it like this
Since you mentioned you are calling C++ code from C, please make sure you handle exceptions properly. Tell me if you have any further questions. |
@JinmingHu-MSFT We are experimenting with your code snippet and will keep you updated. Thanks so much. Have a great weekend. |
@JinmingHu-MSFT Thanks again, Yang |
@yxiang92128 This looks not easy if taking multithreaded download into account. But I instinctively feel it's doable. I need to study further on cpprestsdk source code, will get back to you later. |
@JinmingHu-MSFT I don't think setting parallelism_factor to anything other than 1 would work for the raw buffer you've implemented for the download_range_to_stream call as you mentioned before. If I set the parallelism_factor to be 2 for instance, and download a chunk from the middle of a blob, the content came back will not match the original segment. |
Yeah, if we only consider single-thread download, then I think
Also, We can use |
@JinmingHu-MSFT We are testing the raw buffer thoroughly in our application. Here is another issue I've found. Is there something special about the blobs in “archive” status as far as download is concerned? Does it require to fetch the first 233 bytes or the buffer needs to be at least that big? I understand it should throw an exception because blob in archive state does NOT support download, but it should not crash if the download buffer isn’t at least 233 byte long (anything less than 233 would cause a crash that is). Currently as a workaround, I put in a check for blob_tier_status and if it is “archive” I would not proceed with the download operation, but I would like to understand why the crash would occur and its potential impact to other types of blobs. If I allocated the in_buffer to be 233 bytes long: I And download size = 128 the correct exception was thrown and no crash/coredump: starting number for this thread is: 0 However, if I allocated in_bufer to be 232 byte long (one less than previous) and the download size is the same 128 bytes, it will crash: |
@yxiang92128 Hi, this is expected behavior. If you use some HTTP network traffic capture tool, you can see the HTTP response of downloading archive blob is like this:
The response body is 233-byte long. Internally, the cpp sdk puts HTTP response body to the buffer, no matter it's blob data or error message. So in your case, you should provide a buffer that is big enough to hold blob data or error message, whichever is bigger. I know this is not easy, since the sizes of various error messages are hard to predict. The workaround I can think of is to slightly modify size_t write(const _CharType* ptr, size_t count)
{
if (!this->can_write() || (count == 0)) return 0;
size_t write_size = std::min(count, m_size - m_current_position);
// Copy the data
std::copy(ptr, ptr + write_size, m_data + m_current_position);
// Update write head and satisfy pending reads if any
update_current_position(m_current_position + write_size);
return count;
} In this way, if you only care about the HTTP response status code and do not care about the error message at all, you can just provide a 0-sized buffer. If the HTTP body is larger than the buffer, it's truncated. Although there might be an exception thrown when the sdk tries to parse the HTTP response body for detailed error reason, it should be able to be caught. But I'm not 100% sure about this, use it after thorough test and at your own risk. |
@JinmingHu-MSFT
I wonder why it actually worked!
Thanks as always! Yang |
|
We're going to close this issue because of inactivity, feel free to reopen it if you have any further questions. |
@yxiang92128 Please use ref Note there's a small behavior difference, when size of |
@Jinming-Hu looks like we will have to stick with the original SDK for a while without switching to Track2. Many thanks for any potential hint. We are now in the phase where write operations are being fully implemented including upload_chunk as well. Yang |
@yxiang92128 what is your use scenario? for example the blob size, parallelism of upload jobs, the kind of stream you pass to |
16M blobs, up to 256 threads simultaneous upload and the stream contains are in parquet format. |
@yxiang92128 can you give me an example of how do you call
Yes, but we can avoid some copies |
It's in a raw buffer so I converted it as follows, let me know if I can do better to avoid copies:
Thanks! Yang |
so the original data is in
and I believe there's another copy in the stream
i'm not sure if there're more copies. |
@yxiang92128 can you try this?
You have to ensure |
AZ_blob.cpp:392:19: error: ‘rawptr_stream’ is not a member of ‘Concurrency::streams’ Any other work around? |
@yxiang92128 did you |
that worked. I will make a patch and release to test. Thanks!! Yang |
@yxiang92128 why? where's that copy? |
@yxiang92128 we can check by uploading a huge blob, like 8GB blob. If the process takes 16GB memory, then there's an extra copy. |
would parallelism_factor matter in this case? Do I need to set it to "1" only? |
I don't think so, because rawptrstream is seekable. It would be a different story for non-seekable stream. |
Got it! Thanks. |
Hi,
Currently there are only download_to_stream and download_to_file APIs And the download_to_stream allocates memory on heap during the download and thus increases the footprint of our application per thread. We have preallocated buffers (uint8 byte array) that we prefer to use to receive the download bytes and we wonder if there is a method we can rely upon to directly fetch the block blob contents into that buffer if we pass the pointer to the SDK. If not, is there a workaround that you could propose?
Thanks again,
Yang
The text was updated successfully, but these errors were encountered: