-
-
Notifications
You must be signed in to change notification settings - Fork 797
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ByteSourceJsonBootstrapper uses CharBufferReader for byte[] inputs #1079
Conversation
if (n < 0L) { | ||
throw new IllegalArgumentException("number of characters to skip cannot be negative"); | ||
} | ||
int skipped = Math.min((int) n, this.charBuffer.remaining()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please don't cast long to int - there is an edge case where n is too big and the cast value could be nonsense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will change this to:
int skipped = Math.min((int) n, this.charBuffer.remaining()); | |
int skipped = Math.min(Math.toIntExact(n), this.charBuffer.remaining()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestion works for values of n greater than max-int. Your suggestion will raise an exception. If the API defines n to be long, I think we should respect the API and handle large longs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies - I had planned to suggest a change but then didn't. The suggestion is
Math.min((int) Math.min(n, Integer.MAX_VALUE), this.charBuffer.remaining());
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a funny set of APIs, CharBuffer.remaining
returns an int so that suggests you can't have more the max-int chars. So maybe, your suggestion works fine. The skip API accepts longs but it appears that it not something we should worry about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the CharBuffer
is wrapping a char[]
so is practically limited to Integer.MAX_VALUE - 8
|
||
@Override | ||
public void mark(int readAheadLimit) { | ||
this.charBuffer.mark(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this ignores the input which seems wrong - if we can't work out the right impl, then you should change the markSupported
to return false
public void close() { | ||
this.charBuffer.position(this.charBuffer.limit()); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a new line at end of file?
assertArrayEquals("\0\0".toCharArray(), chars); | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
files should end with new lines
import java.io.Reader; | ||
import java.nio.CharBuffer; | ||
|
||
public class CharBufferReader extends Reader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't do anything yet but since CharBufferReader is only used in one place - I think it would be a good idea to move it to the same package as ByteSourceJsonBootstrapper and to make it package private and final to discourage its use by other users. We should at least add javadoc to the class saying that this class is only for internal use of jackson-core.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do, wanted to figure out where all this might be needed then encapsulate.
private final CharBuffer charBuffer; | ||
|
||
public CharBufferReader(CharBuffer buffer) { | ||
this.charBuffer = buffer.duplicate(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to duplicate the char buffer? If we strongly discourage the use of this class and we only have 1 usage of this class - that usage will not modify the input buffer. If we know the input buffer won't be modified, then theoretically, we don't need to duplicate it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will refactor this when encapsulating, was being defensive.
Address PR comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review @pjfanning ! I have updated to address some of your comments.
int size = _inputEnd - _inputPtr; | ||
if (size >= 0 && size <= 8192) { | ||
// [jackson-core#488] Avoid overhead of heap ByteBuffer allocated by InputStreamReader | ||
// when processing small inputs up to 8KiB. | ||
Charset charset = Charset.forName(enc.getJavaName()); | ||
return new CharBufferReader(charset.decode(ByteBuffer.wrap(_inputBuffer, _inputPtr, _inputEnd))); | ||
} | ||
in = new ByteArrayInputStream(_inputBuffer, _inputPtr, _inputEnd); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open to thoughts here on threshold.
This new path decoding from ByteBuffer
to CharBuffer
will allocate a char[]
based on the encoding Charset
, so I chose 8192
as that would be the break even point for InputStreamReader
's StreamDecoder
's 8192
byte heap ByteBuffer
.
// [jackson-core#488] Avoid overhead of heap ByteBuffer allocated by InputStreamReader | ||
// when processing small inputs up to 8KiB. | ||
Charset charset = Charset.forName(enc.getJavaName()); | ||
return new CharBufferReader(charset.decode(ByteBuffer.wrap(_inputBuffer, _inputPtr, _inputEnd))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder how this compares with constructing a string:
return new StringReader(new String(_inputBuffer, _inputPtr, _inputEnd - _inputPtr, charset));
The string methods tend to perform very well due to extensive optimization in the jdk, however when we use a reader, the char[]
used by CharBufferReader
may perform better since it's not converting between bytes and chars on read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, will benchmark and test this out. For Latin-1/ASCII compressed strings, we'll eat the cost of a byte array copy on string creation and decode should be no-op, while non-Latin-1 case may be a bit more expensive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The JsonEncoding class only supports Unicode charsets. UTF8, variants of UTF16, variants of UTF32. 7 bit ASCII is a subset of UTF8. Latin-1 is not supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FasterXML/jackson-benchmarks#9 (comment) shows new String
to StringReader
is better, so I created #1081 as an alternative to this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The JsonEncoding class only supports Unicode charsets. UTF8, variants of UTF16, variants of UTF32. 7 bit ASCII is a sunset of UTF8. Latin-1 is not supported.
I should have been more precise. The Latin-1
was intended to reference the single byte encoding compact strings JEP 254, but I should have said non-ASCII UTF-8
will be a bit more expensive in terms of memory overhead.
I'm going to close this out in favor of the simpler, and more efficient #1081 |
Initial draft to address #593 (and #995 (comment) ) when deserializing from a
byte[]
as theInputStreamReader
code path triggers an 8KiBHeapByteBuffer
allocation forStreamDecoder
regardless of input byte array length. This allocation significantly penalizes smallerbyte[]
sources.I would appreciate thoughts on the approach here implementing a
Reader
via aCharBuffer
that is decoded via wrapping the sourcebyte[]
. This should avoid unnecessary 8KiB heap byte buffer allocation and leverage OpenJDK's continued charset decoding improvements (e.g. https://cl4es.github.io/2021/02/23/Faster-Charset-Decoding.html ).Initial benchmarks from FasterXML/jackson-benchmarks#9 show
CharBufferReader
providing performance equivalent toByteArrayInputStream
source in worst case, and anywhere from ~2x to ~10x speedup in best case.@carterkozak for thoughts as well.