Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug/execution gets stuck #3756

Open
jjovalle99 opened this issue Oct 25, 2024 · 1 comment
Open

bug/execution gets stuck #3756

jjovalle99 opened this issue Oct 25, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@jjovalle99
Copy link

Hi,

We tried to parse about 5,000 documents using the Unstructured Serverless API. Although the code doesn't generate a specific error message, it seems the execution is stuck—it hasn't made any progress in about 12 hours. Please take a look at the last lines of the logs:

INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"
INFO: Successfully partitioned set #1, elements added to the final result.
INFO: Successfully partitioned set #2, elements added to the final result.
INFO: Successfully partitioned set #3, elements added to the final result.
INFO: Successfully partitioned set #4, elements added to the final result.
INFO: Successfully partitioned set #5, elements added to the final result.
2024-10-15 22:39:29,883 MainProcess INFO  partition finished in 3970.023798094s, attributes: file_id=e35763b61a2b
INFO: partition finished in 3970.023798094s, attributes: file_id=e35763b61a2b
INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"
INFO: Successfully partitioned set #1, elements added to the final result.
INFO: Successfully partitioned set #2, elements added to the final result.
2024-10-15 22:39:44,163 MainProcess INFO  partition finished in 3985.031034183s, attributes: file_id=15f63456b456
INFO: partition finished in 3985.031034183s, attributes: file_id=15f63456b456
INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"
INFO: Successfully partitioned set #1, elements added to the final result.
INFO: Successfully partitioned set #2, elements added to the final result.
INFO: Successfully partitioned set #3, elements added to the final result.
INFO: Successfully partitioned set #4, elements added to the final result.
INFO: Successfully partitioned set #5, elements added to the final result.
INFO: Successfully partitioned set #6, elements added to the final result.
INFO: Successfully partitioned set #7, elements added to the final result.
INFO: Successfully partitioned set #8, elements added to the final result.
INFO: Successfully partitioned set #9, elements added to the final result.
INFO: Successfully partitioned set #10, elements added to the final result.
INFO: Successfully partitioned set #11, elements added to the final result.
2024-10-15 22:39:55,225 MainProcess INFO  partition finished in 3995.593149836s, attributes: file_id=258eaf3414a5
INFO: partition finished in 3995.593149836s, attributes: file_id=258eaf3414a5
INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"
INFO: Successfully partitioned set #1, elements added to the final result.
INFO: Successfully partitioned set #2, elements added to the final result.
INFO: Successfully partitioned set #3, elements added to the final result.
INFO: Successfully partitioned set #4, elements added to the final result.
INFO: Successfully partitioned set #5, elements added to the final result.
INFO: Successfully partitioned set #6, elements added to the final result.
INFO: Successfully partitioned set #7, elements added to the final result.
INFO: Successfully partitioned set #8, elements added to the final result.
INFO: Successfully partitioned set #9, elements added to the final result.
INFO: Successfully partitioned set #10, elements added to the final result.
INFO: Successfully partitioned set #11, elements added to the final result.
INFO: Successfully partitioned set #12, elements added to the final result.
INFO: Successfully partitioned set #13, elements added to the final result.
INFO: Successfully partitioned set #14, elements added to the final result.
INFO: Successfully partitioned set #15, elements added to the final result.
2024-10-15 22:39:56,256 MainProcess INFO  partition finished in 3996.584588173s, attributes: file_id=b7f5de59f8b6
INFO: partition finished in 3996.584588173s, attributes: file_id=b7f5de59f8b6

As you can see, the last logs were from last night. Any ideas why is this happening? The following is the code I am using:

class DocumentParser(BaseModel):
    def create_pipeline(self, settings: Settings) -> Pipeline:
        connection_config: GcsConnectionConfig = GcsConnectionConfig(
            access_config=GcsAccessConfig(service_account_key=settings.gcp.SERVICE_ACCOUNT_FILE),
        )

        return Pipeline.from_configs(
            context=ProcessorConfig(),
            indexer_config=GcsIndexerConfig(remote_url=f"gs://{settings.gcp.INPUT_BUCKET}", recursive=True),
            downloader_config=GcsDownloaderConfig(),
            source_connection_config=connection_config,
            filterer_config=FiltererConfig(
                file_glob=[
                    "*.pdf",
                ],
            ),
            partitioner_config=PartitionerConfig(
                strategy="hi_res",
                partition_by_api=True,
                api_key=settings.unstructured.UNSTRUCTURED_API_KEY,
                partition_endpoint=settings.unstructured.UNSTRUCTURED_API_URL,
                additional_partition_args={
                    "split_pdf_page": True,
                    "split_pdf_allow_failed": True,
                    "split_pdf_concurrency_level": 15,
                    "extract_image_block_types": ["Image", "Table"],
                },
            ),
            chunker_config=ChunkerConfig(
                chunking_strategy="by_similarity",
                chunk_by_api=True,
                chunk_api_key=settings.unstructured.UNSTRUCTURED_API_KEY,
                chunking_endpoint=settings.unstructured.UNSTRUCTURED_API_URL,
                chunk_max_characters=1024,
            ),
            uploader_config=GcsUploaderConfig(
                remote_url=f"gs://{settings.gcp.OUTPUT_BUCKET}",
            ),
            destination_connection_config=connection_config,
        )

    def run(self) -> None:
        pipeline: Pipeline = self.create_pipeline()
        pipeline.run()


if __name__ == "__main__":
    from src.settings import settings

    document_parser: DocumentParser = DocumentParser()
    pipeline: Pipeline = document_parser.create_pipeline(settings=settings)
    pipeline.run()
@jjovalle99 jjovalle99 added the bug Something isn't working label Oct 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants
@jjovalle99 and others