Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] pdf2parquet is now failing ci/cd builds #552

Open
1 of 2 tasks
daw3rd opened this issue Aug 28, 2024 · 6 comments
Open
1 of 2 tasks

[Bug] pdf2parquet is now failing ci/cd builds #552

daw3rd opened this issue Aug 28, 2024 · 6 comments
Assignees
Labels
bug Something isn't working

Comments

@daw3rd
Copy link
Member

daw3rd commented Aug 28, 2024

Search before asking

  • I searched the issues and found no similar issues.

Component

Transforms/Other

What happened + What you expected to happen

Ci/CD builds are now failing for pdf2parquet in at least to unrelated PRs and I can reproduce the failure locally on my mac m1.

  1. https://github.com/IBM/data-prep-kit/actions/runs/10594844975/job/29359345957?pr=545
  2. https://github.com/IBM/data-prep-kit/actions/runs/10599667381/job/29378015278?pr=548

Reproduction script

cd transforms/language/pdf2parquet/python
make test-src

Anything else

E [21] Peng Zhang, Can Li, Liang Qiao, Zhanzhan Cheng, Shiliang Pu, Yi Niu, and Fei Wu. Vsr: A unified framework for document layout analysis combining vision, semantics and relations, 2021.
E
E [22] Peter W J Staar, Michele Dolfi, Christoph Auer, and Costas Bekas. Corpus conversion service: A machine learning platform to ingest documents at scale. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , KDD, pages 774-782. ACM, 2018.
E
E [23] Connor Shorten and Taghi M. Khoshgoftaar. A survey on image data augmentation for deep learning. Journal of Big Data , 6(1):60, 2019."]]
E num_pages: [[9]]
E num_tables: [[5]]
E num_doc_elements: [[147]]
E ext: [["pdf"]]
E hash: [["313bb7ef50bea94a1ef5ae4417f45923cb4ac383d49ba781c874eb9bfbc06be0"]]
E size: [[41244]]
E source_filename: [["2206.01062.pdf"]]
E assert <pyarrow.lib....\n 148\n ]\n] == <pyarrow.lib....\n 147\n ]\n]
E
E Use -v to get more diff

../../../../../data-processing-lib/python/src/data_processing/test_support/abstract_test.py:135: AssertionError
------------------------------------------------------------------------- Captured log call -------------------------------------------------------------------------
INFO pdf2parquet_transform:pdf2parquet_transform.py:286 pdf2parquet parameters are : {'artifacts_path': None, 'contents_type': <pdf2parquet_contents_types.MARKDOWN: 'text/markdown'>, 'do_table_structure': False, 'do_ocr': False}
INFO data_processing.runtime.execution_configuration:execution_configuration.py:80 pipeline id pipeline_id
INFO data_processing.runtime.execution_configuration:execution_configuration.py:83 code location None
INFO data_processing.data_access.data_access_factory_base17f66004-ebd0-4524-8132-d2a68d6e87bd:data_access_factory.py:195 data factory data_ is using local data access: input_folder - /Users/dawood/git/data-prep-kit/transforms/language/pdf2parquet/python/test-data/input output_folder - /tmp/pdf2parquet1ug5awwy
INFO data_processing.data_access.data_access_factory_base17f66004-ebd0-4524-8132-d2a68d6e87bd:data_access_factory.py:211 data factory data_ max_files -1, n_sample -1
INFO data_processing.data_access.data_access_factory_base17f66004-ebd0-4524-8132-d2a68d6e87bd:data_access_factory.py:225 data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf', '.zip'], files to checkpoint ['.parquet']
INFO data_processing.runtime.pure_python.transform_orchestrator:transform_orchestrator.py:46 orchestrator pdf2parquet started at 2024-08-28 14:28:22
INFO data_processing.runtime.pure_python.transform_orchestrator:transform_orchestrator.py:64 Number of files is 2, source profile {'max_file_size': 4.401191711425781, 'min_file_size': 4.110984802246094, 'total_file_size': 8.512176513671875}
INFO pdf2parquet_transform:pdf2parquet_transform.py:88 Initializing models
INFO data_processing.runtime.pure_python.transform_orchestrator:transform_orchestrator.py:166 Completed 1 files (50.0%) in 0.193 min
INFO pdf2parquet_transform:pdf2parquet_transform.py:186 Processing archive_doc_filename='2206.00785v1.pdf'
INFO pdf2parquet_transform:pdf2parquet_transform.py:186 Processing archive_doc_filename='2305.03393v1.pdf'
INFO data_processing.runtime.pure_python.transform_orchestrator:transform_orchestrator.py:166 Completed 2 files (100.0%) in 0.599 min
INFO data_processing.runtime.pure_python.transform_orchestrator:transform_orchestrator.py:170 Done processing 2 files, waiting for flush() completion.
INFO data_processing.runtime.pure_python.transform_orchestrator:transform_orchestrator.py:174 done flushing in 0.0 sec
INFO data_processing.runtime.pure_python.transform_launcher:transform_launcher.py:88 Completed execution in 0.663 min, execution result 0
WARNING data_processing.test_support.abstract_test:abstract_test.py:214 Differences in metadata.json being ignored for now.
INFO data_processing.test_support.abstract_test:abstract_test.py:261 Copying file with difference: /tmp/pdf2parquet1ug5awwy/2206.01062.parquet to /tmp/2206.01062.parquet
========================================================================= warnings summary ==========================================================================
test/test_pdf2parquet_python.py::TestPythonPdf2ParquetTransform::test_transform[launcher0-cli_params0-/Users/dawood/git/data-prep-kit/transforms/language/pdf2parquet/python/test-data/input-/Users/dawood/git/data-prep-kit/transforms/language/pdf2parquet/python/test-data/expected-ignore_columns0]
test/test_pdf2parquet_python.py::TestPythonPdf2JsonParquetTransform::test_transform[launcher0-cli_params0-/Users/dawood/git/data-prep-kit/transforms/language/pdf2parquet/python/test-data/input-/Users/dawood/git/data-prep-kit/transforms/language/pdf2parquet/python/test-data/expected_json-ignore_columns0]
/Users/dawood/git/data-prep-kit/transforms/language/pdf2parquet/python/venv/lib/python3.11/site-packages/torch/nn/modules/transformer.py:307: UserWarning: enable_nested_tensor is True, but self.use_nested_tensor is False because encoder_layer.self_attn.batch_first was not True(use batch_first for better inference performance)
warnings.warn(f"enable_nested_tensor is True, but self.use_nested_tensor is False because {why_not_sparsity_fast_path}")

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
====================================================================== short test summary info ======================================================================
FAILED test_pdf2parquet_python.py::TestPythonPdf2ParquetTransform::test_transform[launcher0-cli_params0-/Users/dawood/git/data-prep-kit/transforms/language/pdf2parquet/python/test-data/input-/Users/dawood/git/data-prep-kit/transforms/language/pdf2parquet/python/test-data/expected-ignore_columns0] - AssertionError: Row 0 of table 0 are not equal
FAILED test_pdf2parquet_python.py::TestPythonPdf2JsonParquetTransform::test_transform[launcher0-cli_params0-/Users/dawood/git/data-prep-kit/transforms/language/pdf2parquet/python/test-data/input-/Users/dawood/git/data-prep-kit/transforms/language/pdf2parquet/python/test-data/expected_json-ignore_columns0] - AssertionError: Row 0 of table 0 are not equal
FAILED test_pdf2parquet_python.py::TestPythonPdf2ParquetNoTableTransform::test_transform[launcher0-cli_params0-/Users/dawood/git/data-prep-kit/transforms/language/pdf2parquet/python/test-data/input-/Users/dawood/git/data-prep-kit/transforms/language/pdf2parquet/python/test-data/expected_md_no_table-ignore_columns0] - AssertionError: Row 0 of table 0 are not equal
======================================================== 3 failed, 1 passed, 2 warnings in 202.35s (0:03:22) ========================================================
make: *** [.defaults.test-src] Error 1

OS

Ubuntu, MacOS (limited support)

Python

3.11.x

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@daw3rd daw3rd added the bug Something isn't working label Aug 28, 2024
daw3rd added a commit that referenced this issue Aug 28, 2024
daw3rd added a commit that referenced this issue Aug 28, 2024
@dolfim-ibm
Copy link
Member

@daw3rd this is now solved, right?

@Bytes-Explorer
Copy link
Collaborator

@daw3rd Is this solved? If not, pls provide error msg for @dolfim-ibm to continue investigating.

@daw3rd
Copy link
Member Author

daw3rd commented Sep 12, 2024

@dolfim-ibm this is still failing running locally on mac m1
Again,

cd transforms/language/pdf2parquet/python
make test-src
...
test_pdf2parquet.py s
test_pdf2parquet_python.py Using temporary output path /tmp/pdf2parquetmy6c416f
12:35:18 INFO - pdf2parquet parameters are : {'artifacts_path': None, 'contents_type': <pdf2parquet_contents_types.MARKDOWN: 'text/markdown'>, 'do_table_structure': True, 'do_ocr': True, 'double_precision': 0}
12:35:18 INFO - pipeline id pipeline_id
12:35:18 INFO - code location None
12:35:18 INFO - data factory data_ is using local data access: input_folder - /Users/dawood/git/data-prep-kit/transforms/language/pdf2parquet/python/test-data/input output_folder - /tmp/pdf2parquetmy6c416f
12:35:18 INFO - data factory data_ max_files -1, n_sample -1
12:35:18 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf', '.zip'], files to checkpoint ['.parquet']
12:35:18 INFO - orchestrator pdf2parquet started at 2024-09-12 12:35:18
12:35:18 INFO - Number of files is 2, source profile {'max_file_size': 0.3013172149658203, 'min_file_size': 0.2757863998413086, 'total_file_size': 0.5771036148071289}
12:35:18 INFO - Initializing models
README.md: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3.49k/3.49k [00:00<00:00, 67.1MB/s]
Fetching 7 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 15.45it/s]
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/urllib/request.py", line 1348, in do_open
    h.request(req.get_method(), req.selector, req.data, headers,
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/http/client.py", line 1303, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/http/client.py", line 1349, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/http/client.py", line 1298, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/http/client.py", line 1058, in _send_output
    self.send(msg)
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/http/client.py", line 996, in send
    self.connect()
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/http/client.py", line 1475, in connect
    self.sock = self._context.wrap_socket(self.sock,
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/ssl.py", line 517, in wrap_socket
    return self.sslsocket_class._create(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/ssl.py", line 1104, in _create
    self.do_handshake()
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/ssl.py", line 1382, in do_handshake
    self._sslobj.do_handshake()
ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1006)

...

@dolfim-ibm
Copy link
Member

@daw3rd This looks like a temporary network issue of your connection. Can you please verify again?

@blublinsky
Copy link
Collaborator

It works for me. Has to be a network glitch

@Bytes-Explorer
Copy link
Collaborator

@daw3rd Can you pls try again and let the team know?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants