Enhance Code2Parquet module to handle non-code text as well #520

shahrokhDaijavad · 2024-08-19T21:46:23Z

Search before asking

I searched the issues and found no similar issues.

Component

Other

Feature

The current version of Code2Parquet takes zip files made of text files that are all code. We need the ability to handle plain text (non-code) files as input as well.

Are you willing to submit a PR?

Yes I am willing to submit a PR!

daw3rd · 2024-08-21T12:09:00Z

What extensions are you proposing be handled as text? Can we be more specific about requirements?

blublinsky · 2024-08-21T12:11:16Z

I do not think it matters. All the files inside zip are handled as text

shahrokhDaijavad · 2024-08-21T12:25:39Z

@daw3rd The requirement to "handle" general text (code and non-code) came from Nirmit last Friday when I was showing him the new root README files. He said in addition to code, pdf and html ingestion, we should have text too. So, when I discussed this with Boris, he convinced me that the best way to do this is to generalize code2parquet to handle arbitrary text and not just code text.

daw3rd · 2024-08-21T17:49:07Z

That does not provide any more detail on the requirements.

Do you want to import .txt files from zips similarly to how we do .pdf contained in zip?
Are there other extensions?
In the case of code2parquet, we know it is code (.py, .java, .c, etc.). how do we know which we're importing? Or maybe we don't put any code-specific columns?

Why not merge all html, text, pdf into the same module. Probably because of the varying requirements, but to not have text2parquet, when we already have pdf2parquet and html2parquet, breaks the pattern. I'm not convinced adapting code2parquet is the right way to go.

shahrokhDaijavad · 2024-08-21T18:36:01Z

Answers and comments to your questions:

Yes
No
I don't think Boris has changed anything that the original code2parquet was doing when the code_data parameter is by default true, so the ext column has the file extension extracted from the file path (.py, .java, ...) I assume with the new code, when code_data is false, the ext column says .txt? (@blublinsky to confirm).

When I discussed this with Boris, he mentioned the alternative option of creating an independent text2parquet, but he convinced himself (and me) that it meant duplicating almost everything that the code2parquet was doing, whereas adding a new flag to that code was much more efficient.

shahrokhDaijavad added the enhancement New feature or request label Aug 19, 2024

shahrokhDaijavad assigned blublinsky Aug 19, 2024

blublinsky mentioned this issue Aug 21, 2024

refactoring code to parquet to zip2parquet #525

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance Code2Parquet module to handle non-code text as well #520

Enhance Code2Parquet module to handle non-code text as well #520

shahrokhDaijavad commented Aug 19, 2024

daw3rd commented Aug 21, 2024

blublinsky commented Aug 21, 2024

shahrokhDaijavad commented Aug 21, 2024

daw3rd commented Aug 21, 2024 •

edited

Loading

shahrokhDaijavad commented Aug 21, 2024

Enhance Code2Parquet module to handle non-code text as well #520

Enhance Code2Parquet module to handle non-code text as well #520

Comments

shahrokhDaijavad commented Aug 19, 2024

Search before asking

Component

Feature

Are you willing to submit a PR?

daw3rd commented Aug 21, 2024

blublinsky commented Aug 21, 2024

shahrokhDaijavad commented Aug 21, 2024

daw3rd commented Aug 21, 2024 • edited Loading

shahrokhDaijavad commented Aug 21, 2024

daw3rd commented Aug 21, 2024 •

edited

Loading