Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patch issue 70 #121

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 14 additions & 3 deletions index.js
Original file line number Diff line number Diff line change
Expand Up @@ -153,7 +153,7 @@ class CsvParser extends Transform {
}

for (let i = start; i < end; i++) {
const isStartingQuote = !isQuoted && buf[i] === this.quote
const isStartingQuote = !isQuoted && buf[i] === this.quote && (i === start || buf[i - 1] === comma)
const isEndingQuote = isQuoted && buf[i] === this.quote && i + 1 <= end && buf[i + 1] === comma
const isEscape = isQuoted && buf[i] === this.escape && i + 1 < end && buf[i + 1] === this.quote

Expand Down Expand Up @@ -222,8 +222,10 @@ class CsvParser extends Transform {
const bufLen = buf.length

for (let i = start; i < bufLen; i++) {
const prevChr = i > 0 ? buf[i - 1] : null
const chr = buf[i]
const nextChr = i + 1 < bufLen ? buf[i + 1] : null
const nextNextChr = i + 2 < bufLen ? buf[i + 2] : null

this._currentRowBytes++
if (this._currentRowBytes > this.maxRowBytes) {
Expand All @@ -237,10 +239,19 @@ class CsvParser extends Transform {
if (this._escaped) {
this._escaped = false
// non-escaped quote (quoting the cell)
continue
} else {
this._quoted = !this._quoted
// not in escape- or quote-mode, currently at start or previous char was separator or linebreak -> enter quote mode
if (!this._quoted && (prevChr === null || prevChr === this.separator || prevChr === nl || prevChr === this.newline)) {
this._quoted = true
continue
}
// in quote-mode but not escape-mode, next char is separator or linebreak -> leave quote mode
if (this._quoted && (nextChr === this.separator || (this.customNewline ? nextChr === this.newline : nextChr === nl || (nextChr === cr && nextNextChr === nl)))) {
this._quoted = false
continue
}
}
continue
}

if (!this._quoted) {
Expand Down
11 changes: 11 additions & 0 deletions test/data/unescaped_quotes.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
a,b,c
jo"e,sam,jan
"jo"e",sam,jan
joe,sa"m,jan
joe,"sa"m",jan
joe,sam,ja"n
joe,sam,"ja"n"
joe,"sa
"m",jan
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test here should actually fail. each line in a csv file indicates a row, and a newline indicates the start of a new row. the result in the snapshot isn't accurate.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you're right there, this is from the RFC 4180 (https://tools.ietf.org/html/rfc4180):

field = (escaped / non-escaped)
escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
non-escaped = *TEXTDATA

As I understand this, a newline within a quoted field does not indicate the start of a new line (I wish it would, that would have saved me so much trouble in the past).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm you may be right there. I'm giving things a big refresh in order to try and alleviate the pain of working through tokenizing things like this, remove some complication. hang tight and I'll follow up when I have that work done.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of curious note is this bit:

   7.  If double-quotes are used to enclose fields, then a double-quote
       appearing inside a field must be escaped by preceding it with
       another double quote.  For example:

       "aaa","b""bb","ccc"

That would seem to indicate that joe,sam,"ja"n" is invalid, would it not?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should be invalid. However, if a single double-quote is found which isn't followed by a separator or a newline, it can be identified as invalid, and could then be treated like any other character. This does indeed take the standard a bit more lax, but then again, so do a lot of other options like configurable separators, quote- and escape-characters. To me it would be totally fine if this behaviour would need to be enabled by configuration. But considering how often I see this intentionally used in real world files, I think this is a far better solution than to invalidate the whole record (which, given the alternative, would not be acceptable in my use-cases).

Right now however, the parser does neither invalidate rows nor treat misplaced double-quotes as regular text, usually it merges two rows, but I've also seen it merge 8K out of 40K rows due to a single misplaced double-quote.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it certainly doesn't follow the spec at present. I'm working on a proper tokenizer for this that'll make implementing parts of the spec far easier. I work with the PostCSS folks quite a bit and maintain the postcss-less, postcss-values-parser modules which both follow the PostCSS tokenizer model and it's amazing how much easier it is to work with versus an ad-hoc parser like what's in the module at present

joe,crlf,"jan"
joe,sam,"ja"n"
4 changes: 2 additions & 2 deletions test/maxRowBytes.test.js
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,9 @@ const { collect } = require('./helpers/helper')
test.cb('optional row size limit', (t) => {
const verify = (err, lines) => {
t.is(err.message, 'Row exceeds the maximum size', 'strict row size')
t.is(lines.length, 4576, '4576 rows before error')
t.is(lines.length, 13, '13 rows before error')
t.end()
}

collect('max_row_size.csv', { maxRowBytes: 200 }, verify)
collect('max_row_size.csv', { maxRowBytes: 170 }, verify)
})
77 changes: 76 additions & 1 deletion test/snapshots/test.js.md
Original file line number Diff line number Diff line change
Expand Up @@ -504,4 +504,79 @@ Generated by [AVA](https://ava.li).

> Snapshot 1

[]
[

## cell with unescaped quotes

> first row

Row {
a: 'jo"e',
b: 'sam',
c: 'jan',
}

> second row

Row {
a: 'jo"e',
b: 'sam',
c: 'jan',
}

> third row

Row {
a: 'joe',
b: 'sa"m',
c: 'jan',
}

> fourth row

Row {
a: 'joe',
b: 'sa"m',
c: 'jan',
}

> fifth row

Row {
a: 'joe',
b: 'sam',
c: 'ja"n',
}

> sixth row

Row {
a: 'joe',
b: 'sam',
c: 'ja"n',
}

> seventh row

Row {
a: 'joe',
b: `sa␊
"m`,
c: 'jan',
}

> eighth row

Row {
a: 'joe',
b: 'crlf',
c: 'jan',
}

> ninth row

Row {
a: 'joe',
b: 'sam',
c: 'ja"n',
}
Binary file modified test/snapshots/test.js.snap
Binary file not shown.
20 changes: 20 additions & 0 deletions test/test.js
Original file line number Diff line number Diff line change
Expand Up @@ -145,6 +145,26 @@ test.cb('cell with newline', (t) => {
})
})

test.cb('cell with unescaped quotes', (t) => {
const verify = (err, lines) => {
// console.log(lines);
t.false(err, 'no err')
t.snapshot(lines[0], 'first row')
t.snapshot(lines[1], 'second row')
t.snapshot(lines[2], 'third row')
t.snapshot(lines[3], 'fourth row')
t.snapshot(lines[4], 'fifth row')
t.snapshot(lines[5], 'sixth row')
t.snapshot(lines[6], 'seventh row')
t.snapshot(lines[7], 'eighth row')
t.snapshot(lines[8], 'ninth row')
t.is(lines.length, 9, '9 rows')
t.end()
}

collect('unescaped_quotes.csv', verify)
})

test.cb('cell with escaped quote in quotes', (t) => {
const headers = bops.from('a\n')
const cell = bops.from('"ha ""ha"" ha"\n')
Expand Down