Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading input from the input range (or file) #261

Open
p-mitana opened this issue Nov 14, 2018 · 10 comments
Open

Reading input from the input range (or file) #261

p-mitana opened this issue Nov 14, 2018 · 10 comments

Comments

@p-mitana
Copy link

I am trying to work with big files (SQL files ~9MB in size). I have the grammar which defines a single SQL instruction (sort of). I would like to parse the instructions from the input file one by one and avoid reading the entire file in the memory with readText, but it seems like currently it is impossible to do this.

@veelo
Copy link
Collaborator

veelo commented Nov 14, 2018

If you can split your input into instructions outside of the grammar, you can read your file however you like and let your parser parse each instruction individually.

@p-mitana
Copy link
Author

If. However, with SQL I can't reasonably do it - at least unless I want to create the other lexer which will split instructions on semicolons that are not part of strings.

As parsing does not require the entire input at once (it looks char by char anyway), I believe that reading an input range should is an important feature for a parsing library.

@veelo
Copy link
Collaborator

veelo commented Nov 14, 2018

As parsing does not require the entire input at once

But it does. A rule can only succeed once all its sub-rules succeed. The top rule cannot succeed before the entire input has been read.

@veelo
Copy link
Collaborator

veelo commented Nov 14, 2018

I don't remember what SQL looks like, but if it is basically a list of instructions and the parser does not need to do much backtracking, you may be able to define your grammar in a way that input after the first instruction is discarded (Instruction .* eoi). Then you may be able to read a portion of your file that is guaranteed to be large enough for any instruction, parse that, then progress your moving window buffer with the parsed input length. This way you will process your file instruction-per-instruction.

@p-mitana
Copy link
Author

It depends.

If I had a rule that parses the entire SQL file at once then yes - it wll suceed only if it reads all the instructions and EOI.

However, I can have the rule, that does not end with EOI - such as SQL instruction. It can succeed multiple times along one input, ant it actually does. When I parse the long string, for example:

SELECT * FROM table1;
SELECT * FROM table2;

it will succeed and parse only the first instruction. After reading the first semicolon the SQLInstruction rule will succeed and all its sub-rules will as well. Then I can cut off the ParseTree's end property and parse again.

As parser iterates over string's character until the root rule either succeeds or fails without looking further than it needs, it can read the characters from the range as long as it needs them. The only concern is the lookahead feature, but in this case a ForwardRange requirement and saving the range on lookahead could do the trick.

@p-mitana
Copy link
Author

Then you may be able to read a portion of your file that is guaranteed to be large enough for any instruction

Yes, I can do this of course. But I believe it is an overcomplication - as I need to either make assumption on how long the instruction will be or make several parsing attempts if the instruction is longer then expected or preparse the file and split instructions from each other. Having the parser library read my data from a range instead of string would remove this need at all.

@veelo
Copy link
Collaborator

veelo commented Nov 14, 2018

I see. I don't see an easy way to do this, though.

@veelo
Copy link
Collaborator

veelo commented Nov 14, 2018

Do you know iopipe? https://www.youtube.com/watch?v=9fzttyj4JCs (I have no personal experience with it, though). If you get a parse error because the instruction is longer than your buffer, you could increase the buffer size and retry.

@p-mitana
Copy link
Author

I haven't heared about it yet. May be worth trying someday.

In case of these SQL files, I will probably have to tackle the problem in a very different way, as it turned out that parsing them (in future possibly many times bigger than currently) may consume too much memory.

Anyway, thank you for help and I hope anyway, that this issue will make its way into pegged sometime :)

@denizzzka
Copy link

If you can split your input into instructions outside of the grammar, you can read your file however you like and let your parser parse each instruction individually.

In this case, line numbering in error messages will be broken

Hi from 2024 :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants