Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patch issue 70 #121

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

czaefferer
Copy link

This PR contains:

  • bugfix
  • feature
  • refactor
  • tests
  • documentation
  • metadata

Breaking Changes?

  • yes
  • no

Unescaped quotes that are not preceded or succeeded by a separator or Linebreak are handled like any other character.

Please Describe Your Changes

fixes issue #70

@czaefferer
Copy link
Author

Any news or thoughts about this?

@shellscape
Copy link
Collaborator

My apologies @czaefferer, I have not been able to find the time to review this properly. Hopefully soon.

@gnanaprakash

This comment has been minimized.

joe,sam,ja"n
joe,sam,"ja"n"
joe,"sa
"m",jan
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test here should actually fail. each line in a csv file indicates a row, and a newline indicates the start of a new row. the result in the snapshot isn't accurate.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you're right there, this is from the RFC 4180 (https://tools.ietf.org/html/rfc4180):

field = (escaped / non-escaped)
escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
non-escaped = *TEXTDATA

As I understand this, a newline within a quoted field does not indicate the start of a new line (I wish it would, that would have saved me so much trouble in the past).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm you may be right there. I'm giving things a big refresh in order to try and alleviate the pain of working through tokenizing things like this, remove some complication. hang tight and I'll follow up when I have that work done.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of curious note is this bit:

   7.  If double-quotes are used to enclose fields, then a double-quote
       appearing inside a field must be escaped by preceding it with
       another double quote.  For example:

       "aaa","b""bb","ccc"

That would seem to indicate that joe,sam,"ja"n" is invalid, would it not?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should be invalid. However, if a single double-quote is found which isn't followed by a separator or a newline, it can be identified as invalid, and could then be treated like any other character. This does indeed take the standard a bit more lax, but then again, so do a lot of other options like configurable separators, quote- and escape-characters. To me it would be totally fine if this behaviour would need to be enabled by configuration. But considering how often I see this intentionally used in real world files, I think this is a far better solution than to invalidate the whole record (which, given the alternative, would not be acceptable in my use-cases).

Right now however, the parser does neither invalidate rows nor treat misplaced double-quotes as regular text, usually it merges two rows, but I've also seen it merge 8K out of 40K rows due to a single misplaced double-quote.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it certainly doesn't follow the spec at present. I'm working on a proper tokenizer for this that'll make implementing parts of the spec far easier. I work with the PostCSS folks quite a bit and maintain the postcss-less, postcss-values-parser modules which both follow the PostCSS tokenizer model and it's amazing how much easier it is to work with versus an ad-hoc parser like what's in the module at present

@shellscape
Copy link
Collaborator

FWIW I confirmed that Excel does export as "thing with ""quote" for a cell that has a double quote.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants