-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] pdf2parquet is now failing ci/cd builds #552
Labels
bug
Something isn't working
Comments
Merged
daw3rd
added a commit
that referenced
this issue
Aug 28, 2024
Signed-off-by: David Wood <[email protected]>
daw3rd
added a commit
that referenced
this issue
Aug 28, 2024
Signed-off-by: David Wood <[email protected]>
@daw3rd this is now solved, right? |
@daw3rd Is this solved? If not, pls provide error msg for @dolfim-ibm to continue investigating. |
@dolfim-ibm this is still failing running locally on mac m1
|
@daw3rd This looks like a temporary network issue of your connection. Can you please verify again? |
It works for me. Has to be a network glitch |
@daw3rd Can you pls try again and let the team know? |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Search before asking
Component
Transforms/Other
What happened + What you expected to happen
Ci/CD builds are now failing for pdf2parquet in at least to unrelated PRs and I can reproduce the failure locally on my mac m1.
Reproduction script
Anything else
E [21] Peng Zhang, Can Li, Liang Qiao, Zhanzhan Cheng, Shiliang Pu, Yi Niu, and Fei Wu. Vsr: A unified framework for document layout analysis combining vision, semantics and relations, 2021.
E
E [22] Peter W J Staar, Michele Dolfi, Christoph Auer, and Costas Bekas. Corpus conversion service: A machine learning platform to ingest documents at scale. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , KDD, pages 774-782. ACM, 2018.
E
E [23] Connor Shorten and Taghi M. Khoshgoftaar. A survey on image data augmentation for deep learning. Journal of Big Data , 6(1):60, 2019."]]
E num_pages: [[9]]
E num_tables: [[5]]
E num_doc_elements: [[147]]
E ext: [["pdf"]]
E hash: [["313bb7ef50bea94a1ef5ae4417f45923cb4ac383d49ba781c874eb9bfbc06be0"]]
E size: [[41244]]
E source_filename: [["2206.01062.pdf"]]
E assert <pyarrow.lib....\n 148\n ]\n] == <pyarrow.lib....\n 147\n ]\n]
E
E Use -v to get more diff
../../../../../data-processing-lib/python/src/data_processing/test_support/abstract_test.py:135: AssertionError
------------------------------------------------------------------------- Captured log call -------------------------------------------------------------------------
INFO pdf2parquet_transform:pdf2parquet_transform.py:286 pdf2parquet parameters are : {'artifacts_path': None, 'contents_type': <pdf2parquet_contents_types.MARKDOWN: 'text/markdown'>, 'do_table_structure': False, 'do_ocr': False}
INFO data_processing.runtime.execution_configuration:execution_configuration.py:80 pipeline id pipeline_id
INFO data_processing.runtime.execution_configuration:execution_configuration.py:83 code location None
INFO data_processing.data_access.data_access_factory_base17f66004-ebd0-4524-8132-d2a68d6e87bd:data_access_factory.py:195 data factory data_ is using local data access: input_folder - /Users/dawood/git/data-prep-kit/transforms/language/pdf2parquet/python/test-data/input output_folder - /tmp/pdf2parquet1ug5awwy
INFO data_processing.data_access.data_access_factory_base17f66004-ebd0-4524-8132-d2a68d6e87bd:data_access_factory.py:211 data factory data_ max_files -1, n_sample -1
INFO data_processing.data_access.data_access_factory_base17f66004-ebd0-4524-8132-d2a68d6e87bd:data_access_factory.py:225 data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf', '.zip'], files to checkpoint ['.parquet']
INFO data_processing.runtime.pure_python.transform_orchestrator:transform_orchestrator.py:46 orchestrator pdf2parquet started at 2024-08-28 14:28:22
INFO data_processing.runtime.pure_python.transform_orchestrator:transform_orchestrator.py:64 Number of files is 2, source profile {'max_file_size': 4.401191711425781, 'min_file_size': 4.110984802246094, 'total_file_size': 8.512176513671875}
INFO pdf2parquet_transform:pdf2parquet_transform.py:88 Initializing models
INFO data_processing.runtime.pure_python.transform_orchestrator:transform_orchestrator.py:166 Completed 1 files (50.0%) in 0.193 min
INFO pdf2parquet_transform:pdf2parquet_transform.py:186 Processing archive_doc_filename='2206.00785v1.pdf'
INFO pdf2parquet_transform:pdf2parquet_transform.py:186 Processing archive_doc_filename='2305.03393v1.pdf'
INFO data_processing.runtime.pure_python.transform_orchestrator:transform_orchestrator.py:166 Completed 2 files (100.0%) in 0.599 min
INFO data_processing.runtime.pure_python.transform_orchestrator:transform_orchestrator.py:170 Done processing 2 files, waiting for flush() completion.
INFO data_processing.runtime.pure_python.transform_orchestrator:transform_orchestrator.py:174 done flushing in 0.0 sec
INFO data_processing.runtime.pure_python.transform_launcher:transform_launcher.py:88 Completed execution in 0.663 min, execution result 0
WARNING data_processing.test_support.abstract_test:abstract_test.py:214 Differences in metadata.json being ignored for now.
INFO data_processing.test_support.abstract_test:abstract_test.py:261 Copying file with difference: /tmp/pdf2parquet1ug5awwy/2206.01062.parquet to /tmp/2206.01062.parquet
========================================================================= warnings summary ==========================================================================
test/test_pdf2parquet_python.py::TestPythonPdf2ParquetTransform::test_transform[launcher0-cli_params0-/Users/dawood/git/data-prep-kit/transforms/language/pdf2parquet/python/test-data/input-/Users/dawood/git/data-prep-kit/transforms/language/pdf2parquet/python/test-data/expected-ignore_columns0]
test/test_pdf2parquet_python.py::TestPythonPdf2JsonParquetTransform::test_transform[launcher0-cli_params0-/Users/dawood/git/data-prep-kit/transforms/language/pdf2parquet/python/test-data/input-/Users/dawood/git/data-prep-kit/transforms/language/pdf2parquet/python/test-data/expected_json-ignore_columns0]
/Users/dawood/git/data-prep-kit/transforms/language/pdf2parquet/python/venv/lib/python3.11/site-packages/torch/nn/modules/transformer.py:307: UserWarning: enable_nested_tensor is True, but self.use_nested_tensor is False because encoder_layer.self_attn.batch_first was not True(use batch_first for better inference performance)
warnings.warn(f"enable_nested_tensor is True, but self.use_nested_tensor is False because {why_not_sparsity_fast_path}")
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
====================================================================== short test summary info ======================================================================
FAILED test_pdf2parquet_python.py::TestPythonPdf2ParquetTransform::test_transform[launcher0-cli_params0-/Users/dawood/git/data-prep-kit/transforms/language/pdf2parquet/python/test-data/input-/Users/dawood/git/data-prep-kit/transforms/language/pdf2parquet/python/test-data/expected-ignore_columns0] - AssertionError: Row 0 of table 0 are not equal
FAILED test_pdf2parquet_python.py::TestPythonPdf2JsonParquetTransform::test_transform[launcher0-cli_params0-/Users/dawood/git/data-prep-kit/transforms/language/pdf2parquet/python/test-data/input-/Users/dawood/git/data-prep-kit/transforms/language/pdf2parquet/python/test-data/expected_json-ignore_columns0] - AssertionError: Row 0 of table 0 are not equal
FAILED test_pdf2parquet_python.py::TestPythonPdf2ParquetNoTableTransform::test_transform[launcher0-cli_params0-/Users/dawood/git/data-prep-kit/transforms/language/pdf2parquet/python/test-data/input-/Users/dawood/git/data-prep-kit/transforms/language/pdf2parquet/python/test-data/expected_md_no_table-ignore_columns0] - AssertionError: Row 0 of table 0 are not equal
======================================================== 3 failed, 1 passed, 2 warnings in 202.35s (0:03:22) ========================================================
make: *** [.defaults.test-src] Error 1
OS
Ubuntu, MacOS (limited support)
Python
3.11.x
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: