Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] need an example of using doc_quality plugin with installed pypi packages #575

Open
1 of 2 tasks
sujee opened this issue Sep 6, 2024 · 4 comments
Open
1 of 2 tasks
Assignees
Labels
enhancement New feature or request med priority

Comments

@sujee
Copy link
Contributor

sujee commented Sep 6, 2024

Search before asking

  • I searched the issues and found no similar issues.

Component

Transforms/Other

What happened + What you expected to happen

The current sample code looks for bad_word_filepath in project directory (assuming this is run from source tree).

Currently this file is in : transforms/language/doc_quality/ray/ldnoobw/en/

We need an example showing how to use this using PYPI packages.

doc_quality_basedir = os.path.join(rootdir, "transforms", "language", "doc_quality", "python")
worker_options = {"num_cpus" : MY_CONFIG.RAY_NUM_CPUS}
params = {
    # where to run
    "run_locally": True,
    # Data access. Only required parameters are specified
    "data_local_config": ParamsUtils.convert_to_ast(local_conf),
    # orchestrator
    "runtime_worker_options": ParamsUtils.convert_to_ast(worker_options),
    "runtime_num_workers": MY_CONFIG.RAY_RUNTIME_WORKERS,
    "runtime_pipeline_id": "pipeline_id",
    "runtime_job_id": "job_id",
    "runtime_creation_delay": 0,
    # doc quality configuration
    text_lang_cli_param: "en",
    doc_content_column_cli_param: "contents",
    bad_word_filepath_cli_param: os.path.join(doc_quality_basedir, "ldnoobw", "en"),
}

I have the following packages installed

data_prep_toolkit                0.2.1.dev2
data_prep_toolkit_ray            0.2.1.dev2
data_prep_toolkit_transforms     0.2.1.dev2
data_prep_toolkit_transforms_ray 0.2.1.dev2

Reproduction script

https://github.com/IBM/data-prep-kit/blob/dev/examples/notebooks/rag/rag_1A_dpk_process_ray.ipynb

Step 7

Anything else

No response

OS

Ubuntu

Python

3.11.x

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@sujee sujee added the bug Something isn't working label Sep 6, 2024
@sujee sujee changed the title [Bug] need workable example of doc_quality plugin using installed pypi packages [Bug] need an example of using doc_quality plugin with installed pypi packages Sep 6, 2024
@daw3rd daw3rd added enhancement New feature or request and removed bug Something isn't working labels Sep 12, 2024
@daw3rd daw3rd changed the title [Bug] need an example of using doc_quality plugin with installed pypi packages [Feature] need an example of using doc_quality plugin with installed pypi packages Sep 12, 2024
@dtsuzuku-ibm
Copy link
Collaborator

dtsuzuku-ibm commented Sep 20, 2024

I might be misunderstanding something, but if the request is to include badword file into pypi package, it sounds weird to me.
Since badword file is the file that user of doc_quality should prepare, it sounds natural to me that user specifies the path to badword file in their project directory.

@sujee
Copy link
Contributor Author

sujee commented Oct 29, 2024

no need to publish the 'bad word files' to pypi.
But can we give a url to a accessible badwords file (we can point to our example from github (https://github.com/IBM/data-prep-kit/tree/dev/transforms/language/doc_quality/ray/ldnoobw) or any other open source ones)?
So user can download it and use it locally?

@shahrokhDaijavad
Copy link
Member

@sujee I am not sure whether you are just asking a question or if you want @dtsuzuku-ibm to make any changes in his code. If it is the former, e.g., you want to use this transform in a Colab notebook and you have no access to the local directory, you can specify the filepath as a parameter and use what we have in the ldnoobw directory of our repo. The files in this directory are all publicly available, i.e., they are open source. Downloading them from our repo or other open-source URLs doesn't make a difference. If you are suggesting a code change, can you be more specific? Thanks.

no need to publish the 'bad word files' to pypi. But can we give a url to a accessible badwords file (we can point to our example from github (https://github.com/IBM/data-prep-kit/tree/dev/transforms/language/doc_quality/ray/ldnoobw) or any other open source ones)? So user can download it and use it locally?

@sujee
Copy link
Contributor Author

sujee commented Oct 29, 2024

no code change necessary, just to be clear :-)

I will work on an example showcasing:

  1. downloading the bad-words files from a location (could be ours or any other sources)
  2. using it with the transform.

for (1) are there published 'bad words files' we can access?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request med priority
Projects
None yet
Development

No branches or pull requests

5 participants