Issue 2314663002: Rework scanner-character-streams.

vogelheim

Description was changed from ========== Rework scanner-character-streams. - Smaller, more consistent streams API (Advance, Back, ...

4 years, 3 months ago (2016-09-06 09:02:01 UTC) #1

Description was changed from

==========
Rework scanner-character-streams.

- Smaller, more consistent streams API (Advance, Back, pos, Seek)
- Remove implementations from the header, in favor of creation functions.

Observe:
- Performance:
  - All Utf16CharacterStream methods have an inlinable V8_LIKELY w/ a
    body of only a few instructions. I expect most calls to end up there.
  - There used to be performance problems w/ bookmarking, particularly
    with copying too much data on SetBookmark w/ UTF-8 streaming streams.
    All those copies are gone.
  - The old streaming streams implementation used to copy data even for
    2-byte input. It no longer does.
  - The only remaining 'slow' method is the Seek(.) slow case for utf-8
    streaming streams. I don't expect this to be called a lot; and even if,
    I expect it to be offset by the gains in the (vastly more frequent)
    calls to the other methods or the 'fast path'.
  - If it still bothers us, there are several ways to speed it up.
- API & code cleanliness:
  - I want to remove the 'old' API in a follow-up CL, which should mostly
    delete code, or replace it 1:1.
  - In a 2nd follow-up I want to delete much of the UTF-8 handling in Blink
    for streaming streams.
  - The "bookmark" is now always implemented (and mostly very fast), so we
    should be able to use it for more things.


BUG=v8:4947
==========

to

==========
Rework scanner-character-streams.

- Smaller, more consistent streams API (Advance, Back, pos, Seek)
- Remove implementations from the header, in favor of creation functions.

Observe:
- Performance:
  - All Utf16CharacterStream methods have an inlinable V8_LIKELY w/ a
    body of only a few instructions. I expect most calls to end up there.
  - There used to be performance problems w/ bookmarking, particularly
    with copying too much data on SetBookmark w/ UTF-8 streaming streams.
    All those copies are gone.
  - The old streaming streams implementation used to copy data even for
    2-byte input. It no longer does.
  - The only remaining 'slow' method is the Seek(.) slow case for utf-8
    streaming streams. I don't expect this to be called a lot; and even if,
    I expect it to be offset by the gains in the (vastly more frequent)
    calls to the other methods or the 'fast path'.
  - If it still bothers us, there are several ways to speed it up.
- API & code cleanliness:
  - I want to remove the 'old' API in a follow-up CL, which should mostly
    delete code, or replace it 1:1.
  - In a 2nd follow-up I want to delete much of the UTF-8 handling in Blink
    for streaming streams.
  - The "bookmark" is now always implemented (and mostly very fast), so we
    should be able to use it for more things.
- Testing & correctness:
  - The unit tests now cover all stream implementations,
    and are pretty good and triggering all the edge cases.
  - Vastly more DCHECKs of the invariants.


BUG=v8:4947
==========

vogelheim

The CQ bit was checked by vogelheim@chromium.org to run a CQ dry run

4 years, 3 months ago (2016-09-06 09:02:05 UTC) #2

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2314663002/1

4 years, 3 months ago (2016-09-06 09:02:10 UTC) #3

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 3 months ago (2016-09-06 09:06:01 UTC) #4

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: v8_linux_mips64el_compile_rel on master.tryserver.v8 (JOB_FAILED, http://build.chromium.org/p/tryserver.v8/builders/v8_linux_mips64el_compile_rel/builds/20806) v8_linux_mipsel_compile_rel on ...

4 years, 3 months ago (2016-09-06 09:06:02 UTC) #5

vogelheim

The CQ bit was checked by vogelheim@chromium.org to run a CQ dry run

4 years, 3 months ago (2016-09-06 11:10:10 UTC) #6

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2314663002/20001

4 years, 3 months ago (2016-09-06 11:10:14 UTC) #7

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 3 months ago (2016-09-06 11:13:44 UTC) #8

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: v8_presubmit on master.tryserver.v8 (JOB_FAILED, http://build.chromium.org/p/tryserver.v8/builders/v8_presubmit/builds/23383)

4 years, 3 months ago (2016-09-06 11:13:45 UTC) #9

vogelheim

The CQ bit was checked by vogelheim@chromium.org to run a CQ dry run

4 years, 3 months ago (2016-09-06 11:21:32 UTC) #10

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2314663002/40001

4 years, 3 months ago (2016-09-06 11:21:35 UTC) #11

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 3 months ago (2016-09-06 11:24:24 UTC) #12

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: v8_win_nosnap_shared_rel_ng on master.tryserver.v8 (JOB_FAILED, http://build.chromium.org/p/tryserver.v8/builders/v8_win_nosnap_shared_rel_ng/builds/8231) v8_win_rel_ng on ...

4 years, 3 months ago (2016-09-06 11:24:25 UTC) #13

vogelheim

The CQ bit was checked by vogelheim@chromium.org to run a CQ dry run

4 years, 3 months ago (2016-09-06 11:41:06 UTC) #14

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2314663002/60001

4 years, 3 months ago (2016-09-06 11:41:09 UTC) #15

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 3 months ago (2016-09-06 11:43:39 UTC) #16

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: v8_linux_mipsel_compile_rel on master.tryserver.v8 (JOB_FAILED, http://build.chromium.org/p/tryserver.v8/builders/v8_linux_mipsel_compile_rel/builds/20813) v8_win64_rel_ng on ...

4 years, 3 months ago (2016-09-06 11:43:40 UTC) #17

vogelheim

The CQ bit was checked by vogelheim@chromium.org to run a CQ dry run

4 years, 3 months ago (2016-09-06 11:49:17 UTC) #18

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2314663002/80001

4 years, 3 months ago (2016-09-06 11:49:19 UTC) #19

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 3 months ago (2016-09-06 11:51:49 UTC) #20

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: v8_win64_rel_ng on master.tryserver.v8 (JOB_FAILED, http://build.chromium.org/p/tryserver.v8/builders/v8_win64_rel_ng/builds/13729) v8_win_rel_ng on ...

4 years, 3 months ago (2016-09-06 11:51:50 UTC) #21

vogelheim

The CQ bit was checked by vogelheim@chromium.org to run a CQ dry run

4 years, 3 months ago (2016-09-06 12:27:39 UTC) #22

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2314663002/100001

4 years, 3 months ago (2016-09-06 12:27:44 UTC) #23

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 3 months ago (2016-09-06 12:30:54 UTC) #24

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: v8_linux64_gyp_rel_ng on master.tryserver.v8 (JOB_FAILED, http://build.chromium.org/p/tryserver.v8/builders/v8_linux64_gyp_rel_ng/builds/3671) v8_linux_gcc_compile_rel on ...

4 years, 3 months ago (2016-09-06 12:30:54 UTC) #25

vogelheim

The CQ bit was checked by vogelheim@chromium.org to run a CQ dry run

4 years, 3 months ago (2016-09-06 12:54:57 UTC) #26

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2314663002/120001

4 years, 3 months ago (2016-09-06 12:55:06 UTC) #27

vogelheim

The CQ bit was checked by vogelheim@chromium.org to run a CQ dry run

4 years, 3 months ago (2016-09-06 13:03:49 UTC) #28

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2314663002/160001

4 years, 3 months ago (2016-09-06 13:03:56 UTC) #29

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 3 months ago (2016-09-06 13:33:27 UTC) #36

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

4 years, 3 months ago (2016-09-06 13:33:28 UTC) #37

vogelheim

vogelheim@chromium.org changed reviewers: + marja@chromium.org, nikolaos@chromium.org

4 years, 3 months ago (2016-09-06 13:42:26 UTC) #39

vogelheim

marja, nikolaos: Please have a look, and please let me know what you think.

4 years, 3 months ago (2016-09-06 13:42:26 UTC) #40

marja

"stuff I don't understand" overflowed, maybe continue offline? https://codereview.chromium.org/2314663002/diff/160001/src/parsing/scanner-character-streams.cc File src/parsing/scanner-character-streams.cc (right): https://codereview.chromium.org/2314663002/diff/160001/src/parsing/scanner-character-streams.cc#newcode202 src/parsing/scanner-character-streams.cc:202: size_t ...

4 years, 3 months ago (2016-09-07 09:17:57 UTC) #41

vogelheim

fyi, added comments to document Utf8ExternalStreamingStream's internal state, as a quick fix to Marja's comments. ...

4 years, 3 months ago (2016-09-07 09:43:52 UTC) #42

marja

https://codereview.chromium.org/2314663002/diff/180001/src/parsing/scanner-character-streams.cc File src/parsing/scanner-character-streams.cc (right): https://codereview.chromium.org/2314663002/diff/180001/src/parsing/scanner-character-streams.cc#newcode234 src/parsing/scanner-character-streams.cc:234: unibrow::Utf8::Utf8IncrementalBuffer buffer; Clarify that buffer contains the characters from ...

4 years, 3 months ago (2016-09-07 11:55:11 UTC) #43

vogelheim

https://codereview.chromium.org/2314663002/diff/160001/src/parsing/scanner-character-streams.cc File src/parsing/scanner-character-streams.cc (right): https://codereview.chromium.org/2314663002/diff/160001/src/parsing/scanner-character-streams.cc#newcode202 src/parsing/scanner-character-streams.cc:202: size_t byte_pos; On 2016/09/07 09:17:57, marja wrote: > This ...

4 years, 3 months ago (2016-09-07 12:32:56 UTC) #44

https://codereview.chromium.org/2314663002/diff/160001/src/parsing/scanner-ch...
File src/parsing/scanner-character-streams.cc (right):

https://codereview.chromium.org/2314663002/diff/160001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:202: size_t byte_pos;
On 2016/09/07 09:17:57, marja wrote:
> This could use a comment: is byte_pos the position inside the chunk or from
the
> beginning? It's not clear at that point for somebody reading the code. Without
> further information, I'd assume it's the position inside the chunk (so, we'd
use
> it for indexing into a chunk), but then there is char_pos too, and that's
> probably a global char pos, so now I'm not sure about byte_pos either. Also
> maybe rename to global_byte_pos / byte_pos_inside_chunk depending on which it
is
> :)
> 
> ... and maybe my first interpretation of ChunkPos was incorrect too; I assumed
> it's a pair "chunk + position inside it" but maybe it's not?

Comments provided about the struct.

https://codereview.chromium.org/2314663002/diff/160001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:212:
unibrow::Utf8::Utf8IncrementalBuffer buffer;
On 2016/09/07 09:17:57, marja wrote:
> This could use a comment, what is buffer? Why does every chunk need one?

Done. Renamed w/ clearer name (there are too many things called "buffer"..) +
comment about the struct & intended use.

https://codereview.chromium.org/2314663002/diff/160001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:218: size_t
Utf8ExternalStreamingStream::FillBuffer(size_t position) {
On 2016/09/07 09:17:57, marja wrote:
> I'd also specify in the API which positions the params are; this is
> char_position, right? Would it make sense to call it char_position?

APIs are always about characters; the API has no notion of bytes or encodings.

https://codereview.chromium.org/2314663002/diff/160001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:228: size_t iter = 0;
On 2016/09/07 09:17:57, marja wrote:
> Nit: "it" would be a more commonly used name.

Done.

https://codereview.chromium.org/2314663002/diff/160001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:265:
unibrow::Utf8::Utf8IncrementalBuffer buffer =
On 2016/09/07 09:17:57, marja wrote:
> Umm, what? I don't understand what this if does.

Renamed, to hopefully make it more understandable.

https://codereview.chromium.org/2314663002/diff/160001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:293: if (current_.char_pos - position
== 0) {
On 2016/09/07 09:17:57, marja wrote:
> Why not if (current_.char_pos == position) ?

Done.

https://codereview.chromium.org/2314663002/diff/160001/src/unicode.cc
File src/unicode.cc (right):

https://codereview.chromium.org/2314663002/diff/160001/src/unicode.cc#newcode313
src/unicode.cc:313: // All other cases:
On 2016/09/07 09:17:57, marja wrote:
> Why does it make sense to do NonAsciiSequenceLength(next) if buffer is not 0,
> that is, next is a continuation character?
> 
> I'd reorder this code like this:
> 
> fast path (which you have now)
> 
> if (buffer == 0) {
>   this is the start of a multi-byte character
>   NonAsciiSequenceLength(next)
>   etc
> } else {
>   expect a continuation character, update the buffer based on that (otherwise
> kBadChar)
> }

Done. Your version makes much more sense.

https://codereview.chromium.org/2314663002/diff/180001/src/parsing/scanner.h
File src/parsing/scanner.h (right):

https://codereview.chromium.org/2314663002/diff/180001/src/parsing/scanner.h#...
src/parsing/scanner.h:35: class Utf16CharacterStream {
On 2016/09/07 11:55:11, marja wrote:
> General comment from offline discussion: going back one character fills the
> buffer again!
> 
> Back();
> Back();
> 
> is inefficient

Added Back2 / PushBack2 to fix this. Actual use of these would be in a future
CL.

https://codereview.chromium.org/2314663002/diff/180001/src/parsing/scanner.h#...
src/parsing/scanner.h:110: // This shouldn't ever be called if new_pos is inside
the current buffer.
On 2016/09/07 11:55:11, marja wrote:
> Comment from nikolaos@: this is unclear.

Done.

nickie

I'm publishing my comments so far, because it turns out that they are in two ...

4 years, 3 months ago (2016-09-07 13:28:30 UTC) #46

I'm publishing my comments so far, because it turns out that they are in two
different patch sets (as this CL evolves) and none of them is the current one...
 :-)

https://codereview.chromium.org/2314663002/diff/160001/src/parsing/scanner.h
File src/parsing/scanner.h (right):

https://codereview.chromium.org/2314663002/diff/160001/src/parsing/scanner.h#...
src/parsing/scanner.h:46: } else if (ReadBlock()) {
I suppose you did not like how it was before:
if (V8_LIKELY(...) || ReadBlock())

https://codereview.chromium.org/2314663002/diff/160001/src/parsing/scanner.h#...
src/parsing/scanner.h:58: inline void Back() {
A comment explaining what this should do would be welcome.
It's straightforward in the most likely case, not in the other one.

https://codereview.chromium.org/2314663002/diff/160001/src/parsing/scanner.h#...
src/parsing/scanner.h:62: ReadBlockAt(pos() - 1);
We discussed this offline.  This can make a series of consecutive Back() pretty
expensive.  It could be "void Back(int num = 1)" and use "pos() - num" here.

https://codereview.chromium.org/2314663002/diff/160001/src/parsing/scanner.h#...
src/parsing/scanner.h:67: return buffer_pos_ + (buffer_cursor_ - buffer_start_);
This calculation is quite expensive and the pos() method is called very often
from the parser.  I suppose, the scanner delegates it here.

Maybe it would make sense to measure this against having a separate current_pos_
field (like the pos_ in old line 59) and making sure that this is consistent.

https://codereview.chromium.org/2314663002/diff/160001/src/parsing/scanner.h#...
src/parsing/scanner.h:86: Back();
Why not simply this?
DCHECK_EQ(code_unit, static_cast<uc32>(*(buffer_cursor_++)));
After Back() it is guaranteed that buffer_cursor_ < buffer_end_, isn't it?

In a quite unfrequent case, this Advance-Back will result in an extra buffer
loaded, right?

https://codereview.chromium.org/2314663002/diff/160001/src/parsing/scanner.h#...
src/parsing/scanner.h:89: bool SetBookmark() {
I don't understand why the bookmark should be stored in the stream.  A simpler
interface would be to use "bookmark = pos()" directly to obtain a bookmark, and
something like "Seek(bookmark)" to reset to that.  If we still want to have
streams that do not support bookmarking (?) then we could have a SetBookmark
method returning either pos() or NO_BOOKMARK (see below).

https://codereview.chromium.org/2314663002/diff/160001/src/parsing/scanner.h#...
src/parsing/scanner.h:94: DCHECK(bookmark_ != (size_t)-1);
How about something like this?  (with a better name, I guess)
static const size_t NO_BOOKMARK = numeric_limits<size_t>::max();

https://codereview.chromium.org/2314663002/diff/160001/src/parsing/scanner.h#...
src/parsing/scanner.h:106: bookmark_((size_t)-1) {}
Again, NO_BOOKMARK here.

https://codereview.chromium.org/2314663002/diff/160001/src/parsing/scanner.h#...
src/parsing/scanner.h:114: buffer_cursor_ = buffer_start_;
I don't understand this.  Maybe related to my next comment.

https://codereview.chromium.org/2314663002/diff/160001/src/parsing/scanner.h#...
src/parsing/scanner.h:119: // position pos() of the input. Returns true if data
is available; false if
How can this not be true?
pos() is calculated in such a way to enforce that.
I suppose this comment should change and describe more clearly what ReadBlock is
expected to.

https://codereview.chromium.org/2314663002/diff/180001/src/parsing/scanner-ch...
File src/parsing/scanner-character-streams.cc (right):

https://codereview.chromium.org/2314663002/diff/180001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:22: // source (ReadBlock can be called
with pos_ pointing to any position,
What is pos_?  Is it buffer_pos_?

https://codereview.chromium.org/2314663002/diff/180001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:33: virtual size_t FillBuffer(size_t
position) = 0;
Add a comment explaining what FillBuffer is expected to do.
As I understand it, it reads up to KBufferSize elements (2-byte characters) from
the source, starting from 'position', and stores them in the buffer_, starting
from the beginning and overwriting its current contents.  It returns the number
of elements read.

https://codereview.chromium.org/2314663002/diff/180001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:44: DCHECK_EQ(buffer_start_, buffer_); 
// Please nobody mess w/ buffer_start_.
Could this comment go to the top of the class, as an invariant that should be
respected?  Or if it is just meant to be respected when ReadBlock is called, add
the comment just before ReadBlock.

For some reason, I suspect that this is an invariant of Utf16CharacterStream
(that buffer_start_ is not expected to change), not this class.

https://codereview.chromium.org/2314663002/diff/180001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:78: buffer_pos_ = start_position;
Although we keep end_position (in length_), we don't keep start_position.  It is
later possible that we "SeekAt" positions in the string that are before
start_position.  Do we care about this?

https://codereview.chromium.org/2314663002/diff/180001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:89: }
Essentially:
size_t length = std::min(length_ - from_pos, kBufferSize);
Or Min instead of std::min, I'm not sure why we have that.
PS: That's how you have it in
ExternalOneByteStringUtf16CharacterStream::FillBuffer.

https://codereview.chromium.org/2314663002/diff/180001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:96: //
ExternalTwoByteStringUtf16CharacterStream.
Add a comment?  How is this different from the previous one?  (Obviously, no
buffer is needed.)

https://codereview.chromium.org/2314663002/diff/180001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:133: // UTF16 buffer to read characters
from an external latin1 string.
This looks very much like GenericStringUtf16CharacterStream.
Can they be merged, or is there a reason (efficiency?) why they must be
separate?  The only difference I saw was in FillBuffer (the for loop vs. the
call to String::WriteToFlat<uc16>.  I'd expect a similar call to work for uint8
too.

https://codereview.chromium.org/2314663002/diff/180001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:139: int end_position);
These parameters are of type size_t in other classes, e.g.,
GenericStringUtf16CharacterStream.

https://codereview.chromium.org/2314663002/diff/180001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:179: // Utf8ExternalStreamingStream -
chunked streaming of Utf-8 data.
nit: Make the format of the comment consistent.

Possibly for the future, as we discussed this offline.  In this class, and the
first of the two that follow, we could use the chunks themselves as buffers. 
Instead, we have an extra buffer in BufferedUtf16CharacterStream.

https://codereview.chromium.org/2314663002/diff/180001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:195: chunks_.clear();
I think this line is redundant, as this method is private and only called from
the destructor.  I guess the whole method is redundant.  The for loop could go
directly in the destructor.

https://codereview.chromium.org/2314663002/diff/180001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:241: size_t
Utf8ExternalStreamingStream::FillBuffer(size_t position) {
For the future: I think that this translation between the logical position
(i.e., the parameter, #of UTF8 characters) and the physical position (cpos)
should not be necessary.  The parameter should be something more abstract that,
in the case of the UTF8 stream, it would provide directly the physical position.
 The loop in lines 253-260 is fairly expensive.

https://codereview.chromium.org/2314663002/diff/180001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:249: Chunk& current = chunks_[chunk];
Consider renaming this "current" to something else, not to confuse this with
"current_" (e.g., look at line 265).

https://codereview.chromium.org/2314663002/diff/180001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:337: };
Possibly for the future, as we discussed this offline.  This (and the following)
is very similar to (some of) the contents of Utf8ExternalStreamingStream.  Can
it be merged?  The common denominator is the "chunking" behavior.

https://codereview.chromium.org/2314663002/diff/180001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:395: // OneByteExternalStreamingStream
Add a comment?

https://codereview.chromium.org/2314663002/diff/180001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:423: // TwoByteExternalStreamingStream
Add a comment?

nickie

My last comments for this round. In general, I think it looks good enough for ...

4 years, 3 months ago (2016-09-07 13:48:44 UTC) #47

marja

Getting better! I was focusing on the utf8 streaming case, where the complexity lies. I ...

4 years, 3 months ago (2016-09-08 09:46:23 UTC) #48

vogelheim

https://codereview.chromium.org/2314663002/diff/160001/src/parsing/scanner.h File src/parsing/scanner.h (right): https://codereview.chromium.org/2314663002/diff/160001/src/parsing/scanner.h#newcode46 src/parsing/scanner.h:46: } else if (ReadBlock()) { On 2016/09/07 13:28:28, nickie ...

4 years, 3 months ago (2016-09-08 13:09:04 UTC) #49

https://codereview.chromium.org/2314663002/diff/160001/src/parsing/scanner.h
File src/parsing/scanner.h (right):

https://codereview.chromium.org/2314663002/diff/160001/src/parsing/scanner.h#...
src/parsing/scanner.h:46: } else if (ReadBlock()) {
On 2016/09/07 13:28:28, nickie wrote:
> I suppose you did not like how it was before:
> if (V8_LIKELY(...) || ReadBlock())

I liked it, but: V8_LIKELY supplies information about the if-else to the
compiler. I'm not sure the compiler can transfer this explicit likeli-hood
information on a sub-expression to the branches it might generate from the
surrounding expression.

Admittedly, I intuited that, and I can't find documentation that is explicit
about this. I've never seen a V8_(UN)LIKELY or __builtin_expect that did not
cover the entire condition, though.

https://codereview.chromium.org/2314663002/diff/160001/src/parsing/scanner.h#...
src/parsing/scanner.h:58: inline void Back() {
On 2016/09/07 13:28:28, nickie wrote:
> A comment explaining what this should do would be welcome.
> It's straightforward in the most likely case, not in the other one.

Done.

https://codereview.chromium.org/2314663002/diff/160001/src/parsing/scanner.h#...
src/parsing/scanner.h:60: buffer_cursor_--;
On 2016/09/07 09:17:57, marja wrote:
> Nit: --buffer_cursor_;

Why? [Here and elsewhere.]

------

I find the prefix-version odd to read.

The style guide has no opinion: V8 style references Chromium style which
references Google style, which says: "For simple scalar (non-object) values
there is no reason to prefer one form and we allow either."

[Ref:
https://google.github.io/styleguide/cppguide.html#Preincrement_and_Predecrement
]

https://codereview.chromium.org/2314663002/diff/160001/src/parsing/scanner.h#...
src/parsing/scanner.h:62: ReadBlockAt(pos() - 1);
On 2016/09/07 13:28:28, nickie wrote:
> We discussed this offline.  This can make a series of consecutive Back()
pretty
> expensive.  It could be "void Back(int num = 1)" and use "pos() - num" here.

That's almost the same as Seek. :)  The Scanner only uses num = 1..2, so I added
a Back2().

https://codereview.chromium.org/2314663002/diff/160001/src/parsing/scanner.h#...
src/parsing/scanner.h:67: return buffer_pos_ + (buffer_cursor_ - buffer_start_);
On 2016/09/07 13:28:28, nickie wrote:
> This calculation is quite expensive and the pos() method is called very often
> from the parser.  I suppose, the scanner delegates it here.
> 
> Maybe it would make sense to measure this against having a separate
current_pos_
> field (like the pos_ in old line 59) and making sure that this is consistent.

Hmm. I think I should try it both ways.


(My thinking was: pos() is fully inline-able, and should be only 2 instructions
on x64, on data that is already in cache (and oftentimes also in a register) at
that point. The alternative, an additional variable + 1 instruction to maintain
might not be faster.)

https://codereview.chromium.org/2314663002/diff/160001/src/parsing/scanner.h#...
src/parsing/scanner.h:89: bool SetBookmark() {
On 2016/09/07 13:28:28, nickie wrote:
> I don't understand why the bookmark should be stored in the stream.  A simpler
> interface would be to use "bookmark = pos()" directly to obtain a bookmark,
and
> something like "Seek(bookmark)" to reset to that.  If we still want to have
> streams that do not support bookmarking (?) then we could have a SetBookmark
> method returning either pos() or NO_BOOKMARK (see below).

You're exactly right: The bookmark doesn't belong in the stream (any more). But
I was desperately trying to keep the CL small-ish, and hence wanted to avoid
touching the clients of the stream just yet, and  to remove the "// Legacy API"
in a separate, upcoming CL. That would exactly what you suggest.

https://codereview.chromium.org/2314663002/diff/160001/src/parsing/scanner.h#...
src/parsing/scanner.h:94: DCHECK(bookmark_ != (size_t)-1);
On 2016/09/07 13:28:28, nickie wrote:
> How about something like this?  (with a better name, I guess)
> static const size_t NO_BOOKMARK = numeric_limits<size_t>::max();

Done.

https://codereview.chromium.org/2314663002/diff/160001/src/parsing/scanner.h#...
src/parsing/scanner.h:106: bookmark_((size_t)-1) {}
On 2016/09/07 13:28:28, nickie wrote:
> Again, NO_BOOKMARK here.

Done.

https://codereview.chromium.org/2314663002/diff/160001/src/parsing/scanner.h#...
src/parsing/scanner.h:114: buffer_cursor_ = buffer_start_;
On 2016/09/07 13:28:28, nickie wrote:
> I don't understand this.  Maybe related to my next comment.

See below.

This wants to set the position for ReadBlock() (which reads it from pos()), so
it must change all the vars that make up pos() to accomplish this.

https://codereview.chromium.org/2314663002/diff/160001/src/parsing/scanner.h#...
src/parsing/scanner.h:119: // position pos() of the input. Returns true if data
is available; false if
On 2016/09/07 13:28:29, nickie wrote:
> How can this not be true?
> pos() is calculated in such a way to enforce that.
> I suppose this comment should change and describe more clearly what ReadBlock
is
> expected to.

It's meant to explain the post-condition for ReadBlock, and in particular remind
the implementer that they can/must update both buffer_pos and buffer_start_.

I think the source of confusion is that pos() is really a derivative of three
values, so any method that updates any of those needs to take care to consider
them jointly.

I updated the comments to reflect this.

------

Maybe the best way of fixing this is to drop the ReadBlock() signature, and
always use something like FillBuffer(size_t). If so, I'll do it in a separate
step.

https://codereview.chromium.org/2314663002/diff/180001/src/parsing/scanner-ch...
File src/parsing/scanner-character-streams.cc (right):

https://codereview.chromium.org/2314663002/diff/180001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:22: // source (ReadBlock can be called
with pos_ pointing to any position,
On 2016/09/07 13:28:29, nickie wrote:
> What is pos_?  Is it buffer_pos_?

No, pos(). Updated comment.

https://codereview.chromium.org/2314663002/diff/180001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:33: virtual size_t FillBuffer(size_t
position) = 0;
On 2016/09/07 13:28:29, nickie wrote:
> Add a comment explaining what FillBuffer is expected to do.
> As I understand it, it reads up to KBufferSize elements (2-byte characters)
from
> the source, starting from 'position', and stores them in the buffer_, starting
> from the beginning and overwriting its current contents.  It returns the
number
> of elements read.

Done.

https://codereview.chromium.org/2314663002/diff/180001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:44: DCHECK_EQ(buffer_start_, buffer_); 
// Please nobody mess w/ buffer_start_.
On 2016/09/07 13:28:29, nickie wrote:
> Could this comment go to the top of the class, as an invariant that should be
> respected?  Or if it is just meant to be respected when ReadBlock is called,
add
> the comment just before ReadBlock.
> 
> For some reason, I suspect that this is an invariant of Utf16CharacterStream
> (that buffer_start_ is not expected to change), not this class.

Done.

https://codereview.chromium.org/2314663002/diff/180001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:78: buffer_pos_ = start_position;
On 2016/09/07 13:28:29, nickie wrote:
> Although we keep end_position (in length_), we don't keep start_position.  It
is
> later possible that we "SeekAt" positions in the string that are before
> start_position.  Do we care about this?

You're right, that's an error condition we don't check for. I adopted this from
the previous implementation. Not sure whether we should add a check.

https://codereview.chromium.org/2314663002/diff/180001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:89: }
On 2016/09/07 13:28:29, nickie wrote:
> Essentially:
> size_t length = std::min(length_ - from_pos, kBufferSize);
> Or Min instead of std::min, I'm not sure why we have that.
> PS: That's how you have it in
> ExternalOneByteStringUtf16CharacterStream::FillBuffer.

Done.

https://codereview.chromium.org/2314663002/diff/180001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:96: //
ExternalTwoByteStringUtf16CharacterStream.
On 2016/09/07 13:28:29, nickie wrote:
> Add a comment?  How is this different from the previous one?  (Obviously, no
> buffer is needed.)

Done.

https://codereview.chromium.org/2314663002/diff/180001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:133: // UTF16 buffer to read characters
from an external latin1 string.
On 2016/09/07 13:28:29, nickie wrote:
> This looks very much like GenericStringUtf16CharacterStream.
> Can they be merged, or is there a reason (efficiency?) why they must be
> separate?  The only difference I saw was in FillBuffer (the for loop vs. the
> call to String::WriteToFlat<uc16>.  I'd expect a similar call to work for
uint8
> too.

Yes, this would work. I need to check the performance impact. (These were
separated out long ago; I haven't double-checked the impact.)

https://codereview.chromium.org/2314663002/diff/180001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:139: int end_position);
On 2016/09/07 13:28:29, nickie wrote:
> These parameters are of type size_t in other classes, e.g.,
> GenericStringUtf16CharacterStream.

Good point. Made them all size_t.

https://codereview.chromium.org/2314663002/diff/180001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:179: // Utf8ExternalStreamingStream -
chunked streaming of Utf-8 data.
On 2016/09/07 13:28:29, nickie wrote:
> nit: Make the format of the comment consistent.
> 
> Possibly for the future, as we discussed this offline.  In this class, and the
> first of the two that follow, we could use the chunks themselves as buffers. 
> Instead, we have an extra buffer in BufferedUtf16CharacterStream.

Done.

https://codereview.chromium.org/2314663002/diff/180001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:195: chunks_.clear();
On 2016/09/07 13:28:29, nickie wrote:
> I think this line is redundant, as this method is private and only called from
> the destructor.  I guess the whole method is redundant.  The for loop could go
> directly in the destructor.

Done.

https://codereview.chromium.org/2314663002/diff/180001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:234:
unibrow::Utf8::Utf8IncrementalBuffer buffer;
On 2016/09/07 11:55:11, marja wrote:
> Clarify that buffer contains the characters from the *previous* chunk
> 
> Also rename to partial_character_buffer or sth
> 

Done.

https://codereview.chromium.org/2314663002/diff/180001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:241: size_t
Utf8ExternalStreamingStream::FillBuffer(size_t position) {
On 2016/09/07 13:28:29, nickie wrote:
> For the future: I think that this translation between the logical position
> (i.e., the parameter, #of UTF8 characters) and the physical position (cpos)
> should not be necessary.  The parameter should be something more abstract
that,
> in the case of the UTF8 stream, it would provide directly the physical
position.
>  The loop in lines 253-260 is fairly expensive.

Hmm. I agree on the "expensive" part.

I think parameterizing the 'position' would be a very logical API, but since
that would force the client to dynamically deal with different types of
positions, I doubt it's easily (or efficiently) translatable into C++.

https://codereview.chromium.org/2314663002/diff/180001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:249: Chunk& current = chunks_[chunk];
On 2016/09/07 13:28:29, nickie wrote:
> Consider renaming this "current" to something else, not to confuse this with
> "current_" (e.g., look at line 265).

Done. (current_chunk)

https://codereview.chromium.org/2314663002/diff/180001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:337: };
On 2016/09/07 13:28:29, nickie wrote:
> Possibly for the future, as we discussed this offline.  This (and the
following)
> is very similar to (some of) the contents of Utf8ExternalStreamingStream.  Can
> it be merged?  The common denominator is the "chunking" behavior.

I don't see an easy way to merge this... they require different info per chunk,
and search the chunks according to difference criteria. We can make this a
templatized structure, but I doubt it's worth the effort.

https://codereview.chromium.org/2314663002/diff/180001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:395: // OneByteExternalStreamingStream
On 2016/09/07 13:28:29, nickie wrote:
> Add a comment?

Done.

https://codereview.chromium.org/2314663002/diff/180001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:423: // TwoByteExternalStreamingStream
On 2016/09/07 13:28:29, nickie wrote:
> Add a comment?

Done.

vogelheim

https://codereview.chromium.org/2314663002/diff/220001/src/parsing/scanner-character-streams.cc File src/parsing/scanner-character-streams.cc (right): https://codereview.chromium.org/2314663002/diff/220001/src/parsing/scanner-character-streams.cc#newcode44 src/parsing/scanner-character-streams.cc:44: DCHECK_EQ(buffer_start_, buffer_); // Please nobody mess w/ buffer_start_. On ...

4 years, 3 months ago (2016-09-08 15:02:18 UTC) #51

https://codereview.chromium.org/2314663002/diff/220001/src/parsing/scanner-ch...
File src/parsing/scanner-character-streams.cc (right):

https://codereview.chromium.org/2314663002/diff/220001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:44: DCHECK_EQ(buffer_start_, buffer_); 
// Please nobody mess w/ buffer_start_.
On 2016/09/08 09:46:23, marja wrote:
> Hmm, nikolaos@ commented on this offline I think... so hmm, who updates
> buffer_start_?

None of BufferedUtf16CharacterStream's sub-classes. Updated comments.

https://codereview.chromium.org/2314663002/diff/220001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:180: return length;
On 2016/09/07 13:48:44, nickie wrote:
> This for loop should be equivalent to an instance of CopyCharsUnsigned, which
I
> presume is what some instance of String::WriteToFlat does.  Compare with the
> method in line 84.

Yes, nice catch. Done.

https://codereview.chromium.org/2314663002/diff/220001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:246: size_t byte_pos;
On 2016/09/08 09:46:23, marja wrote:
> It's confusing that you above use byte_pos and char_pos to index into a chunk
> (if I understood this correctly) and here you use byte_pos and char_pos to
> indicate where the chunk begins in the byte / char stream. Can you rename one
> set of them? Maybe keep the above byte_pos / char_pos and rename this one to
> chunk_beginning_byte_pos or something along those lines but less horrible.
> 
> (Also, when reading the code, if there's someobject.byte_pos, and byte_pos
means
> something different depending on which one someobject is, it's confusing to
the
> reader.)

I don't get the confusing part.

Essentially, a position in the UTF-8 world comprises both a byte_pos and a
char_pos (and a Utf8IncrementalBuffer, in case we allow our position to point
into the middle of an UTF-8 char). And so if I store a chunk, I need to store
it's start position (position == byte pos + char pos + buffer); if I want to
store the current position (also position == byte_pos + char_pos + buffer).


To clarify this, I could add seperate struct Position for this purpose. I'm just
not convinced that would really make things more readable.

https://codereview.chromium.org/2314663002/diff/220001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:256: // If the desired position is
*not* at the current_ position, then we need
On 2016/09/08 09:46:23, marja wrote:
> Could this function be structured differently so that it would be easier to
> understand?
> 
> Idea 1: fast path first, what if the desired position is at the current
> position?
> 
> Idea 2: separate the chunk finding to a helper function?
> 
> (I can't tell in advance whether these would work...)

I'm skeptical.

The fast path is what happens after the non-fast path has run. Putting it first
would be awkward.

A separate chunk finding function would also be awkward, since the callee has to
do some of the utf-8 processing (to find the current position); or maybe the
caller has to, but then the helper function would return a partial result, e.g.
maybe the last block, but it can't be sure (since it didn't skip ahead to the
desired position.)


Also, as mentioned somewhere below to a similar comment: All the helper would do
is to move straight-line code out of the way, without either adding an
abstraction; or increasing re-use; or simplifying things. It would merely add an
indirection which - at least in my view - doesn't really clarify things all that
much.

https://codereview.chromium.org/2314663002/diff/220001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:263: Chunk& current = chunks_[chunk];
On 2016/09/08 09:46:23, marja wrote:
> Could you rename current to current_chunk, just to be extra clear (related to
> the above whine that I'm losing track what current.byte_pos etc are)?

Done.

https://codereview.chromium.org/2314663002/diff/220001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:264:
unibrow::Utf8::Utf8IncrementalBuffer b = current.incomplete_first_char;
On 2016/09/08 09:46:23, marja wrote:
> I had to think for a while why we need to do this here; maybe add a comment?
> 
> E.g.,:
> 
> Read through the chunk character by character to find the right position. The
> chunk might begin with an incomplete character, whose beginning is already
> stored in incomplete_first_char.

Done.

https://codereview.chromium.org/2314663002/diff/220001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:276: // of the last block. That should
only happen if we seek forward (which
On 2016/09/08 09:46:23, marja wrote:
> ... of the last chunk?

Done.

https://codereview.chromium.org/2314663002/diff/220001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:285: }
On 2016/09/08 09:46:23, marja wrote:
> Can finding the position be moved to a helper function, or will that mess up
> with the recursion thing? Clearly this function does two things: first it
finds
> the right position, then it reads data from that position.

See above.

https://codereview.chromium.org/2314663002/diff/220001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:296: Chunk& current_chunk =
chunks_[current_.chunk_no];
On 2016/09/08 09:46:23, marja wrote:
> Oops, now we have current_chunk..
> 
> It's confusing that we have:
> current_
> current
> current_chunk in this function!

Done... I guess? It's now current_chunk and current_

https://codereview.chromium.org/2314663002/diff/220001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:305: :
unibrow::Utf8::Utf8IncrementalBuffer(0);
On 2016/09/08 09:46:23, marja wrote:
> Umm, what? Why?

As the method comment says: "and decoding from the beginning of the using
incomplete_first_char) will yield the stream's char_pos-th unicode character."

That's what we're dooing here. If we decode from the beginning of a chunk, we'll
use the incomplete_first_char from that chunk.

https://codereview.chromium.org/2314663002/diff/220001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:428: Chunk& chunk =
chunks_[FindChunk(chunks_, source_, position)];
On 2016/09/08 09:46:23, marja wrote:
> Here you already have a FindChunk helper, so I think you should have one for
the
> hard case too. :)

Nack. Here, I had two identical uses of it, so this case removes unnecessary
code duplication. In the other case, the code is substantially different and is
only used once, to this would merely add an indirection. I guess to some extent
it's a matter of taste, but indirection (without any other benefit) oftentimes
makes things harder to read, at least in my opinion.

https://codereview.chromium.org/2314663002/diff/220001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:476: 
On 2016/09/08 09:46:23, marja wrote:
> What happens if we have data where the last character is incomplete?

For the last chunk, the rounding (below: number_of_chars = ... / 2;) would
exclude the last half-char. A subsequent call to ReadBlock would try to read the
2nd char of that half-char, and would determine in the line below that we're out
of data.

https://codereview.chromium.org/2314663002/diff/220001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:486: bool lonley_byte =
(chunks_[chunk_no].byte_pos == (2 * position + 1));
On 2016/09/08 09:46:23, marja wrote:
> typo: lonley

Done.

https://codereview.chromium.org/2314663002/diff/220001/src/parsing/scanner.h
File src/parsing/scanner.h (right):

https://codereview.chromium.org/2314663002/diff/220001/src/parsing/scanner.h#...
src/parsing/scanner.h:53: buffer_cursor_++;
On 2016/09/08 09:46:23, marja wrote:
> Hmm, I commented on this before, right? foo++ / foo-- -> ++foo / --foo in
> several places.

Why?

[Brief discussion in previous round of answers; admittedly only after you sent
this one. :) ]

https://codereview.chromium.org/2314663002/diff/220001/src/unicode.h
File src/unicode.h (right):

https://codereview.chromium.org/2314663002/diff/220001/src/unicode.h#newcode163
src/unicode.h:163: Utf8IncrementalBuffer& buffer);
On 2016/09/07 13:48:44, nickie wrote:
> I think this reference should be turned into a pointer, according to the code
> style.  I'm not sure if it's enforced by the CQ.

Done.

nickie

LGTM, modulo some old comments (left for subsequent CLs or as food for thought) and ...

4 years, 3 months ago (2016-09-09 11:21:49 UTC) #52

vogelheim

The CQ bit was checked by vogelheim@chromium.org to run a CQ dry run

4 years, 3 months ago (2016-09-14 10:54:21 UTC) #53

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2314663002/280001

4 years, 3 months ago (2016-09-14 10:54:29 UTC) #54

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 3 months ago (2016-09-14 10:59:59 UTC) #55

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: v8_linux64_gyp_rel_ng on master.tryserver.v8 (JOB_FAILED, http://build.chromium.org/p/tryserver.v8/builders/v8_linux64_gyp_rel_ng/builds/4214)

4 years, 3 months ago (2016-09-14 11:00:00 UTC) #56

vogelheim

The CQ bit was checked by vogelheim@chromium.org to run a CQ dry run

4 years, 3 months ago (2016-09-14 11:25:56 UTC) #57

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2314663002/300001

4 years, 3 months ago (2016-09-14 11:26:06 UTC) #58

vogelheim

PTAL. @marja: The utf-8 part is largely re-written. Now with many, smaller functions + commentary. ...

4 years, 3 months ago (2016-09-14 11:28:19 UTC) #59

nickie

Still LGTM, but I did not review carefully the UTF8 part. https://codereview.chromium.org/2314663002/diff/260001/src/parsing/scanner-character-streams.cc File src/parsing/scanner-character-streams.cc (right): ...

4 years, 3 months ago (2016-09-14 11:48:18 UTC) #60

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 3 months ago (2016-09-14 11:52:37 UTC) #61

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: v8_linux64_gyp_rel_ng on master.tryserver.v8 (JOB_FAILED, http://build.chromium.org/p/tryserver.v8/builders/v8_linux64_gyp_rel_ng/builds/4221)

4 years, 3 months ago (2016-09-14 11:52:38 UTC) #62

vogelheim

The CQ bit was checked by vogelheim@chromium.org to run a CQ dry run

4 years, 3 months ago (2016-09-14 17:08:33 UTC) #63

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2314663002/320001

4 years, 3 months ago (2016-09-14 17:08:41 UTC) #64

vogelheim

On 2016/09/14 11:48:18, nickie wrote: > There has to be a virtual destructor but it ...

4 years, 3 months ago (2016-09-14 17:09:16 UTC) #65

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 3 months ago (2016-09-14 17:40:13 UTC) #66

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

4 years, 3 months ago (2016-09-14 17:40:15 UTC) #67

marja

scanner-character-streams.cc + test-scanner-streams.cc LGTM % comments This new version is much clearer, it's much easier ...

4 years, 3 months ago (2016-09-15 08:25:17 UTC) #68

vogelheim

https://codereview.chromium.org/2314663002/diff/320001/src/parsing/scanner-character-streams.cc File src/parsing/scanner-character-streams.cc (right): https://codereview.chromium.org/2314663002/diff/320001/src/parsing/scanner-character-streams.cc#newcode208 src/parsing/scanner-character-streams.cc:208: struct ChunkPosition { On 2016/09/15 08:25:16, marja wrote: > ...

4 years, 3 months ago (2016-09-15 11:29:26 UTC) #70

https://codereview.chromium.org/2314663002/diff/320001/src/parsing/scanner-ch...
File src/parsing/scanner-character-streams.cc (right):

https://codereview.chromium.org/2314663002/diff/320001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:208: struct ChunkPosition {
On 2016/09/15 08:25:16, marja wrote:
> Now this doesn't contain the chunk number any more, so "position" would be a
> better name.

Hmm.. I called it StreamPosition.

https://codereview.chromium.org/2314663002/diff/320001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:217: struct Position {
On 2016/09/15 08:25:16, marja wrote:
> And this could be called PositionAndChunk or sth. Because the above struct
*is*
> Position and this is just adding information to it.

I'd prefer to keep this one as Position, since
- 1, it's the shorter name and it gets used a lot more, and
- 2, I do think of this as the 'true' position.

If this were an STL-like design, the ..::iterator would be this Position. The
other "position" is really more a helper so I can conveniently define Chunk +
Position, and to highlight where they are similar.

https://codereview.chromium.org/2314663002/diff/320001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:243:
ScriptCompiler::ExternalSourceStream* source_stream_;
On 2016/09/15 08:25:16, marja wrote:
> I like this new version, now the structs and their members are clear to me
after
> first reading through it!

Acknowledged.

https://codereview.chromium.org/2314663002/diff/320001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:282: // The buffer_ is writeable, but
buffer_*_ members are const. So we get a
On 2016/09/15 08:25:16, marja wrote:
> Typo: writable

Done.

https://codereview.chromium.org/2314663002/diff/320001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:283: // non-const pointer into buffer
that points to the same char as buffer_end_.
On 2016/09/15 08:25:16, marja wrote:
> Hmm, what, why? Do we need to write into it? No?

Yes, we need to write into the buffer. See lines #296 and #298-299. Also, it's
called FillBufferFromCurrentChunk, because we really do want to write into the
buffer.

https://codereview.chromium.org/2314663002/diff/320001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:333: return;
On 2016/09/15 08:25:16, marja wrote:
> Hmm? Why don't we need to get more data to find "position"?

Done.

Yeah... that's a left over from an older version, where I tried to write it so
there was only one location that would fetch new chunks. Changed.

https://codereview.chromium.org/2314663002/diff/320001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:341: }
On 2016/09/15 08:25:16, marja wrote:
> Style nit: I'd write this like:
> size_t chunk_no = chunks_.size() - 1;
> while (chunk_no > 0 && chunks_[chunk_no].start.chars > position) {
>   --chunk_no;
> }

Done.

https://codereview.chromium.org/2314663002/diff/320001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:366:
unibrow::Utf8::Utf8IncrementalBuffer(0)}};
On 2016/09/15 08:25:17, marja wrote:
> This is great!

Acknowledged.

https://codereview.chromium.org/2314663002/diff/320001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:387: SkipToPosition(position);
On 2016/09/15 08:25:16, marja wrote:
> One thing I still find surprising in this CL is that SkipToPosition is not
> guaranteed to actually find the position, and it also doesn't return a bool,
but
> the caller is required to check whether it really skipped to the right
position.
> But I can somehow see why you did it like that... so idk.
> 
> An alternative would be:
> bool position_found = SkipToPosition(position);
> while (!position_found) {
>   if (FetchChunk()) {
>     position_found = SkipToPosition(position);
>   } else {
>     // out of data
>     break;
>   }
> }
> 
> In this version, also the out-of-data condition (which confused me) is
clearer.

Done, sort of.

https://codereview.chromium.org/2314663002/diff/320001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:425: const uint8_t* chunk;
On 2016/09/15 08:25:16, marja wrote:
> I'd call it "data" instead of "chunk" (because the struct is already called
> "Chunk").

Done.

https://codereview.chromium.org/2314663002/diff/320001/src/parsing/scanner-ch...
src/parsing/scanner-character-streams.cc:506: Chunk& chunk =
chunks_[FindChunk(chunks_, source_, position)];
On 2016/09/15 08:25:16, marja wrote:
> Why not const Chunk& ?

Done.

https://codereview.chromium.org/2314663002/diff/320001/test/cctest/parsing/te...
File test/cctest/parsing/test-scanner-streams.cc (right):

https://codereview.chromium.org/2314663002/diff/320001/test/cctest/parsing/te...
test/cctest/parsing/test-scanner-streams.cc:280: // th cases.
On 2016/09/15 08:25:17, marja wrote:
> typo

Done.

vogelheim

The CQ bit was checked by vogelheim@chromium.org to run a CQ dry run

4 years, 3 months ago (2016-09-15 11:29:54 UTC) #71

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2314663002/360001

4 years, 3 months ago (2016-09-15 11:30:02 UTC) #72

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 3 months ago (2016-09-15 12:00:54 UTC) #73

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

4 years, 3 months ago (2016-09-15 12:00:55 UTC) #74

vogelheim

The CQ bit was checked by vogelheim@chromium.org to run a CQ dry run

4 years, 3 months ago (2016-09-16 08:14:31 UTC) #75

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2314663002/360001

4 years, 3 months ago (2016-09-16 08:14:34 UTC) #76

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 3 months ago (2016-09-16 08:17:05 UTC) #77

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

4 years, 3 months ago (2016-09-16 08:17:06 UTC) #78

vogelheim

The patchset sent to the CQ was uploaded after l-g-t-m from nikolaos@chromium.org, marja@chromium.org Link to ...

4 years, 3 months ago (2016-09-16 08:27:14 UTC) #80

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2314663002/360001

4 years, 3 months ago (2016-09-16 08:27:25 UTC) #81

commit-bot: I haz the power

Description was changed from ========== Rework scanner-character-streams. - Smaller, more consistent streams API (Advance, Back, ...

4 years, 3 months ago (2016-09-16 08:29:47 UTC) #82

Message was sent while issue was closed.

Description was changed from

==========
Rework scanner-character-streams.

- Smaller, more consistent streams API (Advance, Back, pos, Seek)
- Remove implementations from the header, in favor of creation functions.

Observe:
- Performance:
  - All Utf16CharacterStream methods have an inlinable V8_LIKELY w/ a
    body of only a few instructions. I expect most calls to end up there.
  - There used to be performance problems w/ bookmarking, particularly
    with copying too much data on SetBookmark w/ UTF-8 streaming streams.
    All those copies are gone.
  - The old streaming streams implementation used to copy data even for
    2-byte input. It no longer does.
  - The only remaining 'slow' method is the Seek(.) slow case for utf-8
    streaming streams. I don't expect this to be called a lot; and even if,
    I expect it to be offset by the gains in the (vastly more frequent)
    calls to the other methods or the 'fast path'.
  - If it still bothers us, there are several ways to speed it up.
- API & code cleanliness:
  - I want to remove the 'old' API in a follow-up CL, which should mostly
    delete code, or replace it 1:1.
  - In a 2nd follow-up I want to delete much of the UTF-8 handling in Blink
    for streaming streams.
  - The "bookmark" is now always implemented (and mostly very fast), so we
    should be able to use it for more things.
- Testing & correctness:
  - The unit tests now cover all stream implementations,
    and are pretty good and triggering all the edge cases.
  - Vastly more DCHECKs of the invariants.


BUG=v8:4947
==========

to

==========
Rework scanner-character-streams.

- Smaller, more consistent streams API (Advance, Back, pos, Seek)
- Remove implementations from the header, in favor of creation functions.

Observe:
- Performance:
  - All Utf16CharacterStream methods have an inlinable V8_LIKELY w/ a
    body of only a few instructions. I expect most calls to end up there.
  - There used to be performance problems w/ bookmarking, particularly
    with copying too much data on SetBookmark w/ UTF-8 streaming streams.
    All those copies are gone.
  - The old streaming streams implementation used to copy data even for
    2-byte input. It no longer does.
  - The only remaining 'slow' method is the Seek(.) slow case for utf-8
    streaming streams. I don't expect this to be called a lot; and even if,
    I expect it to be offset by the gains in the (vastly more frequent)
    calls to the other methods or the 'fast path'.
  - If it still bothers us, there are several ways to speed it up.
- API & code cleanliness:
  - I want to remove the 'old' API in a follow-up CL, which should mostly
    delete code, or replace it 1:1.
  - In a 2nd follow-up I want to delete much of the UTF-8 handling in Blink
    for streaming streams.
  - The "bookmark" is now always implemented (and mostly very fast), so we
    should be able to use it for more things.
- Testing & correctness:
  - The unit tests now cover all stream implementations,
    and are pretty good and triggering all the edge cases.
  - Vastly more DCHECKs of the invariants.


BUG=v8:4947
==========

commit-bot: I haz the power

Description was changed from ========== Rework scanner-character-streams. - Smaller, more consistent streams API (Advance, Back, ...

4 years, 3 months ago (2016-09-16 08:30:02 UTC) #84

Message was sent while issue was closed.

Description was changed from

==========
Rework scanner-character-streams.

- Smaller, more consistent streams API (Advance, Back, pos, Seek)
- Remove implementations from the header, in favor of creation functions.

Observe:
- Performance:
  - All Utf16CharacterStream methods have an inlinable V8_LIKELY w/ a
    body of only a few instructions. I expect most calls to end up there.
  - There used to be performance problems w/ bookmarking, particularly
    with copying too much data on SetBookmark w/ UTF-8 streaming streams.
    All those copies are gone.
  - The old streaming streams implementation used to copy data even for
    2-byte input. It no longer does.
  - The only remaining 'slow' method is the Seek(.) slow case for utf-8
    streaming streams. I don't expect this to be called a lot; and even if,
    I expect it to be offset by the gains in the (vastly more frequent)
    calls to the other methods or the 'fast path'.
  - If it still bothers us, there are several ways to speed it up.
- API & code cleanliness:
  - I want to remove the 'old' API in a follow-up CL, which should mostly
    delete code, or replace it 1:1.
  - In a 2nd follow-up I want to delete much of the UTF-8 handling in Blink
    for streaming streams.
  - The "bookmark" is now always implemented (and mostly very fast), so we
    should be able to use it for more things.
- Testing & correctness:
  - The unit tests now cover all stream implementations,
    and are pretty good and triggering all the edge cases.
  - Vastly more DCHECKs of the invariants.


BUG=v8:4947
==========

to

==========
Rework scanner-character-streams.

- Smaller, more consistent streams API (Advance, Back, pos, Seek)
- Remove implementations from the header, in favor of creation functions.

Observe:
- Performance:
  - All Utf16CharacterStream methods have an inlinable V8_LIKELY w/ a
    body of only a few instructions. I expect most calls to end up there.
  - There used to be performance problems w/ bookmarking, particularly
    with copying too much data on SetBookmark w/ UTF-8 streaming streams.
    All those copies are gone.
  - The old streaming streams implementation used to copy data even for
    2-byte input. It no longer does.
  - The only remaining 'slow' method is the Seek(.) slow case for utf-8
    streaming streams. I don't expect this to be called a lot; and even if,
    I expect it to be offset by the gains in the (vastly more frequent)
    calls to the other methods or the 'fast path'.
  - If it still bothers us, there are several ways to speed it up.
- API & code cleanliness:
  - I want to remove the 'old' API in a follow-up CL, which should mostly
    delete code, or replace it 1:1.
  - In a 2nd follow-up I want to delete much of the UTF-8 handling in Blink
    for streaming streams.
  - The "bookmark" is now always implemented (and mostly very fast), so we
    should be able to use it for more things.
- Testing & correctness:
  - The unit tests now cover all stream implementations,
    and are pretty good and triggering all the edge cases.
  - Vastly more DCHECKs of the invariants.

BUG=v8:4947

Committed: https://crrev.com/642d6d314c7934a05cd2386e4b10fca3267769fc
Cr-Commit-Position: refs/heads/master@{#39464}
==========

commit-bot: I haz the power

4 years, 3 months ago (2016-09-16 08:30:03 UTC) #85

Message was sent while issue was closed.

Patchset 9 (id:??) landed as
https://crrev.com/642d6d314c7934a05cd2386e4b10fca3267769fc
Cr-Commit-Position: refs/heads/master@{#39464}

Issue 2314663002: Rework scanner-character-streams. (Closed)

Description

Patch Set 1 #

Patch Set 2 : Some fixes, and marching down the very long road to make all compilers happy. #

Patch Set 3 : Add comments to document Utf8ExternalStreamingStream's internal state. #

Patch Set 4 : Marja's feedback, round 1. #

Patch Set 5 : Feedback, round 2. #

Patch Set 6 : Behold! A re-write of utf-8 handling (+ improved tests + fixes) #

Patch Set 7 : Fix compiler issue. #

Patch Set 8 : Niko's feedback and fix compile even harder #

Patch Set 9 : Marja's feedback. #

Messages