Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BadZipFile error when ran on AWS lambda #3759

Open
pastram-i opened this issue Oct 29, 2024 · 5 comments
Open

BadZipFile error when ran on AWS lambda #3759

pastram-i opened this issue Oct 29, 2024 · 5 comments

Comments

@pastram-i
Copy link

pastram-i commented Oct 29, 2024

My custom image works as expected when ran locally against a test.docx from an s3 path.

But when I upload the image to lambda, I get the error BadZipFile: Bad magic number for central directory on the from unstructured.partition.auto import partition function - even though the file isn't a zip, and is still the same test.docx from s3.

Example code below:

from typing import TYPE_CHECKING, Any, Callable, List, Optional, Union
import botocore
import os
import tempfile
#langchain unstructured loader docs = https://python.langchain.com/api_reference/_modules/langchain_community/document_loaders/unstructured.html#UnstructuredBaseLoader
from langchain_community.document_loaders.unstructured import UnstructuredBaseLoader

#langchain s3 loader docs = https://python.langchain.com/api_reference/_modules/langchain_community/document_loaders/s3_file.html#S3FileLoader
class S3FileLoader(UnstructuredBaseLoader):
    def __init__(
        self,
        bucket: str,
        key: str,
        *,
        region_name: Optional[str] = None,
        api_version: Optional[str] = None,
        use_ssl: Optional[bool] = True,
        verify: Union[str, bool, None] = None,
        endpoint_url: Optional[str] = None,
        aws_access_key_id: Optional[str] = None,
        aws_secret_access_key: Optional[str] = None,
        aws_session_token: Optional[str] = None,
        boto_config: Optional[botocore.client.Config] = None,
        mode: str = "single",
        post_processors: Optional[List[Callable]] = None,
        **unstructured_kwargs: Any,
    ):
        super().__init__(mode, post_processors, **unstructured_kwargs)
        self.bucket = bucket
        self.key = key
        self.region_name = region_name
        self.api_version = api_version
        self.use_ssl = use_ssl
        self.verify = verify
        self.endpoint_url = endpoint_url
        self.aws_access_key_id = aws_access_key_id
        self.aws_secret_access_key = aws_secret_access_key
        self.aws_session_token = aws_session_token
        self.boto_config = boto_config

    def _get_elements(self) -> List:
        from unstructured.partition.auto import partition

        import boto3
        s3 = boto3.client(
            "s3",
            region_name=self.region_name,
            api_version=self.api_version,
            use_ssl=self.use_ssl,
            verify=self.verify,
            endpoint_url=self.endpoint_url,
            aws_access_key_id=self.aws_access_key_id,
            aws_secret_access_key=self.aws_secret_access_key,
            aws_session_token=self.aws_session_token,
            config=self.boto_config,
        )

        with tempfile.TemporaryDirectory() as temp_dir:
            file_path = f"{temp_dir}/{self.key}"

            os.makedirs(os.path.dirname(file_path), exist_ok=True)

            s3.download_file(self.bucket, self.key, file_path)
            #for logging purposes
            print(file_path)
            #Error provided here on the return (BadZipFile: Bad magic number for central directory)
            return partition(filename=file_path, **self.unstructured_kwargs)

bucket = <redacted>
key = <redacted>
loader = S3FileLoader(bucket, key)

/tmp/tmp1ewdwsvg/5fd0ca9a-4673-5363-b2e9-416b2256b741/1110f0d7-8d77-5568-9209-6cde8d271661/test.docx

@scanny
Copy link
Collaborator

scanny commented Oct 30, 2024

@pastram-i the DOCX format is in fact a Zip archive that contains the XML files (and images etc.) that define the document.

So it's entirely plausible to get a Zip-related error when trying to partition one.

I'd be inclined to suspect some kind of corruption has occurred in the S3 round-trip. Can you possibly do a SHA1 or MD5 hash check on the before and after to see if the file was changed in some way? I believe the central directory on a Zip archive is at the end of the file (to make appending efficient), so my first guess would be truncation of some sort, leaving the file-type identifier at the very top of the file in place.

@pastram-i
Copy link
Author

Thanks for the response @scanny -

@pastram-i the DOCX format is in fact a Zip archive that contains the XML files (and images etc.) that define the document.

So it's entirely plausible to get a Zip-related error when trying to partition one.

Yeah - shortly after posting this, I did notice this comment that mentions that docx == zip.

I'd be inclined to suspect some kind of corruption has occurred in the S3 round-trip.

It would be weird that the corruption wouldn't happen in the local Docker image run, but does in the lambda image run - unless the corruption isn't in the trip itself, but in the saving of the file in the lambda file system?

To test for this though, I tried to use bytes instead but still got the same result.

        import io
        with io.BytesIO() as file_obj:
            s3.download_fileobj(self.bucket, self.key, file_obj)
            file_obj.seek(0)
            return partition(file=file_obj, **self.unstructured_kwargs)

Can you possibly do a SHA1 or MD5 hash check on the before and after to see if the file was changed in some way?

I'll be honest here - I'm not sure how I'd be able to do a hash check remotely from s3 before retrieval, to compare to the after? The below is the closest I can think of, but let me know if I'm missing the goal here.

import hashlib
....
        with tempfile.TemporaryDirectory() as temp_dir:
            file_path = f"{temp_dir}/{self.key}"
            os.makedirs(os.path.dirname(file_path), exist_ok=True)
            before_hash = self._calculate_hash(file_path)
            s3.download_file(self.bucket, self.key, file_path)
            after_hash = self.calculate_hash(file_path)

            if before_hash != after_hash:
                print("File has been modified during download")
            else:
                print("File has not been modified during download")
            return partition(filename=file_path, **self.unstructured_kwargs)

    def _calculate_hash(self, file_path):
        with open(file_path, 'rb') as file:
            file_hash = hashlib.sha1()
            while True:
                data = file.read(4096)
                if not data:
                    break
                file_hash.update(data)
            return file_hash.hexdigest()

Which - is honestly just ending with a FileNotFoundError: [Errno 2] No such file or directory error, which is odd since nothing in the handling of file changed when doing this... lol

@scanny
Copy link
Collaborator

scanny commented Oct 30, 2024

Hmm. Dunno. I would observe though that if you had a partial Zip archive, like you took one of 100k bytes and just truncated it at 50k bytes, this behavior would be plausible. The identification as a Zip archive is based on the first few bytes of the file and the central directory is at the end of the file.

It looks like it's getting as far as identifying the file as a Zip, and failing in the disambiguation code that reads the archive contents to figure out more specifically what flavor of Zip it is (DOCX, PPTX, XLSX, etc.)

But I don't have any good ideas about what's happening here. I think you need to find some means of observing what's happening, like writing to logs or something.

Btw the locally computed hash should be identical to the remotely computed one after the download, assuming you use the same hash type (SHA1 would be my first choice, which it looks like you're using). That's the way Git works, so no need to try to do the "before" in the lambda code.

@pastram-i
Copy link
Author

Btw the locally computed hash should be identical to the remotely computed one after the download, assuming you use the same hash type (SHA1 would be my first choice, which it looks like you're using). That's the way Git works, so no need to try to do the "before" in the lambda code.

Good catch - I guess I didn't consider this.

I think you're on to something. The "absolute local" and the "image local" (that gets from s3, but runs on my machine) hashes do match. However, the lambda hash does not. So I guess there is some sort of corruption happening here..

Though, I'm thinking I might sidebar this lambda deployment method either way. To pass this error and work on the entire process I passed a .pdf. This appears to process fine, but my usecase provides the file through an AWS API gateway, which has a 30s cutoff, and I reach a timeout on the API before the lambda can process the file...

Currently exploring other deployment options that would fit our use better.

@scanny
Copy link
Collaborator

scanny commented Oct 31, 2024

If the SHAs don't match, next thing to look at is the length. If the lambda version is longer, maybe there's a wrapper in there somewhere. If it's shorter, well then something is definitely going wrong up there :) Maybe the call doesn't wait for the download to finish or something, dunno much about AWS Lambda.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants