Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return image data from confluence #72

Open
ML-Abdula opened this issue Jun 24, 2024 · 4 comments
Open

Return image data from confluence #72

ML-Abdula opened this issue Jun 24, 2024 · 4 comments
Labels
enhancement New feature or request

Comments

@ML-Abdula
Copy link

from unstructured.ingest.connector.confluence import ConfluenceAccessConfig, SimpleConfluenceConfig
from unstructured.ingest.interfaces import PartitionConfig, ProcessorConfig, ReadConfig
from unstructured.ingest.runner import ConfluenceRunner

if __name__ == "__main__":
    runner = ConfluenceRunner(
        processor_config=ProcessorConfig(
            verbose=True,
            output_dir="confluence-ingest-output",
            num_processes=2,
        ),
        read_config=ReadConfig(),
        partition_config=PartitionConfig(strategy="hi_res",pdf_infer_table_structure=True,
            metadata_exclude=["filename", "file_directory", "metadata.data_source.date_processed"],
        ),
        connector_config=SimpleConfluenceConfig(
            access_config=ConfluenceAccessConfig(
                api_token="api-key",
            ),
            user_email="my-email",
            url="url",
        ),
    )
   runner.run()

This returns a list of json with hierarchy but even with hi_res and pdf_infer_table_structure=True I'm unable to access any image data. All I get is textual data which is required but in my usecase I'm also looking for images from same document

@ML-Abdula ML-Abdula added the enhancement New feature or request label Jun 24, 2024
@ML-Abdula
Copy link
Author

2024-06-24 08:14:06,670 MainProcess DEBUG    updating download directory to: /root/.cache/unstructured/ingest/confluence/d78233987c
2024-06-24 08:14:06,674 MainProcess INFO     running pipeline: DocFactory -> Reader -> Partitioner -> Copier with config: {"reprocess": false, "verbose": true, "work_dir": "/root/.cache/unstructured/ingest/pipeline", "output_dir": "confluence-ingest-output2", "num_processes": 2, "raise_on_error": false}
2024-06-24 08:14:06,789 MainProcess INFO     Running doc factory to generate ingest docs. Source connector: {"processor_config": {"reprocess": false, "verbose": true, "work_dir": "/root/.cache/unstructured/ingest/pipeline", "output_dir": "confluence-ingest-output2", "num_processes": 2, "raise_on_error": false}, "read_config": {"download_dir": "/root/.cache/unstructured/ingest/confluence/d78233987c", "re_download": false, "preserve_downloads": false, "download_only": false, "max_docs": null}, "connector_config": {"user_email": "[emial], "access_config": {"api_token": "*******"}, "url": "*******", "max_num_of_spaces": 500, "max_num_of_docs_from_each_space": 100, "spaces": []}, "_confluence": null}
2024-06-24 08:14:21,820 MainProcess INFO     processing 155 docs via 2 processes
2024-06-24 08:14:21,879 MainProcess INFO     Calling Reader with 155 docs
2024-06-24 08:14:21,880 MainProcess INFO     Running source node to download data associated with ingest docs
2024-06-24 08:14:57,880 MainProcess INFO     Calling Partitioner with 155 docs
2024-06-24 08:14:57,882 MainProcess INFO     Running partition node to extract content from json files. Config: {"pdf_infer_table_structure": true, "strategy": "hi_res", "ocr_languages": null, "encoding": null, "additional_partition_args": {}, "skip_infer_table_types": null, "fields_include": ["element_id", "text", "type", "metadata", "embeddings"], "flatten_metadata": false, "metadata_exclude": ["filename", "file_directory", "metadata.data_source.date_processed"], "metadata_include": [], "partition_endpoint": "https://api.unstructured.io/general/v0/general", "partition_by_api": false, "api_key": "*******", "hi_res_model_name": null}, partition kwargs: {}]
2024-06-24 08:14:57,888 MainProcess INFO     Creating /root/.cache/unstructured/ingest/pipeline/partitioned
2024-06-24 08:15:00,732 MainProcess INFO     Calling Copier with 155 docs
2024-06-24 08:15:00,734 MainProcess INFO     Running copy node to move content to desired output location

@ML-Abdula ML-Abdula changed the title Returns image data from confluence Return image data from confluence Jun 24, 2024
@ML-Abdula
Copy link
Author

@christinestraub @scanny anyone who can help me on this?

@christinestraub
Copy link

This returns a list of json with hierarchy but even with hi_res and pdf_infer_table_structure=True I'm unable to access any image data. All I get is textual data which is required but in my use case I'm also looking for images from same document

@ML-Abdula Do you mean you're unable to get any elements with category "Image" in the returned json? Can you please share the document you're trying to process?

@scanny
Copy link

scanny commented Jun 25, 2024

@ML-Abdula Confluence is web-pages, right? So Confluence "documents" would go to partition_html().

HTML does not embed images, rather it contains <img href=...> "links" to images. partition_html() does not currently traverse those links to download images. Pretty sure the reason for that is the security risk inherent in downloading arbitrary image files.

So I think that explains why no Image elements are present in the output for the Confluence connector. You could suggest an enhancement. Perhaps there's a way to let you download the images yourself or perhaps to identify trusted zones or something. That should be in a separate issue though so it can be discussed independently.

@MthwRobinson MthwRobinson transferred this issue from Unstructured-IO/unstructured Aug 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants