-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance Code2Parquet module to handle non-code text as well #520
Comments
What extensions are you proposing be handled as text? Can we be more specific about requirements? |
I do not think it matters. All the files inside zip are handled as text |
@daw3rd The requirement to "handle" general text (code and non-code) came from Nirmit last Friday when I was showing him the new root README files. He said in addition to code, pdf and html ingestion, we should have text too. So, when I discussed this with Boris, he convinced me that the best way to do this is to generalize code2parquet to handle arbitrary text and not just code text. |
That does not provide any more detail on the requirements.
Why not merge all html, text, pdf into the same module. Probably because of the varying requirements, but to not have text2parquet, when we already have pdf2parquet and html2parquet, breaks the pattern. I'm not convinced adapting code2parquet is the right way to go. |
Answers and comments to your questions:
When I discussed this with Boris, he mentioned the alternative option of creating an independent text2parquet, but he convinced himself (and me) that it meant duplicating almost everything that the code2parquet was doing, whereas adding a new flag to that code was much more efficient. |
Search before asking
Component
Other
Feature
The current version of Code2Parquet takes zip files made of text files that are all code. We need the ability to handle plain text (non-code) files as input as well.
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: