Issue 9965010: Regexp: Improve the speed that we scan for an initial point where a non-anchored

Issue 9965010: Regexp: Improve the speed that we scan for an initial point where a non-anchored (Closed)

Created:
8 years, 8 months ago by Erik Corry

Modified:
8 years, 8 months ago

Reviewers:
ulan

CC:
v8-dev

Base URL:
http://v8.googlecode.com/svn/branches/bleeding_edge/

Visibility:
Public.

More Reviews

Description

Regexp: Improve the speed that we scan for an initial point where a non-anchored regexp can match by using a Boyer-Moore-like table. This is done by identifying non-greedy non-capturing loops in the nodes that eat any character one at a time. For example in the middle of the regexp /foo[\s\S]*?bar/ we find such a loop. There is also such a loop implicitly inserted at the start of any non-anchored regexp. When we have found such a loop we look ahead in the nodes to find the set of characters that can come at given distances. For example for the regexp /.?foo/ we know that there are at least 3 characters ahead of us, and the sets of characters that can occur are [any, [f, o], [o]]. We find a range in the lookahead info where the set of characters is reasonably constrained. In our example this is from index 1 to 2 (0 is not constrained). We can now look 3 characters ahead and if we don't find one of [f, o] (the union of [f, o] and [o]) then we can skip forwards by the range size (in this case 2). For Unicode input strings we do the same, but modulo 128. We also look at the first string fed to the regexp and use that to get a hint of the character frequencies in the inputs. This affects the assessment of whether the set of characters is 'reasonably constrained'. We still have the old lookahead mechanism, which uses a wide load of multiple characters followed by a mask and compare to determine whether a match is possible at this point. Committed: http://code.google.com/p/v8/source/detail?r=11204

Patch Set 1 #

Patch Set 2 : '' #

Patch Set 3 : '' #

Total comments: 58

Patch Set 4 : '' #

Patch Set 5 : '' #

Total comments: 6

Created: 8 years, 8 months ago

Download [raw] [tar.bz2]

	Unified diffs	Side-by-side diffs	Delta from patch set	Stats (+611 lines, -37 lines)			Patch
M	src/arm/regexp-macro-assembler-arm.cc	View	1 2 3	2 chunks	+12 lines, -4 lines	0 comments	Download
M	src/ia32/regexp-macro-assembler-ia32.cc	View	1 2 3	3 chunks	+6 lines, -6 lines	0 comments	Download
M	src/jsregexp.h	View	1 2 3	14 chunks	+137 lines, -3 lines	0 comments	Download
M	src/jsregexp.cc	View	1 2 3 4	19 chunks	+438 lines, -17 lines	6 comments	Download
M	src/x64/regexp-macro-assembler-x64.cc	View	1 2 3	2 chunks	+14 lines, -6 lines	0 comments	Download
M	test/cctest/test-regexp.cc	View		1 chunk	+4 lines, -1 line	0 comments	Download

Messages

Total messages: 7 (0 generated)

Expand Messages | Collapse Messages

ulan

First round of comments, mostly nits that you can fix while I am reviewing GetSkipTable ...

8 years, 8 months ago (2012-03-30 13:04:48 UTC) #3

ulan

Second round of comments. http://codereview.chromium.org/9965010/diff/6/src/jsregexp.cc File src/jsregexp.cc (right): http://codereview.chromium.org/9965010/diff/6/src/jsregexp.cc#newcode3349 src/jsregexp.cc:3349: int skip = max_lookahead - ...

8 years, 8 months ago (2012-03-30 16:54:51 UTC) #4

Erik Corry

New version uploaded. More comments and clearer heuristics. Performance is a little better, partly due ...

8 years, 8 months ago (2012-04-01 00:48:46 UTC) #5

New version uploaded.  More comments and clearer heuristics.   Performance is a
little better, partly due to better heuristics and also with a small
contribution from Lasse's instruction micro-optimization comments on the
previous CL.

http://codereview.chromium.org/9965010/diff/6/src/jsregexp.cc
File src/jsregexp.cc (right):

http://codereview.chromium.org/9965010/diff/6/src/jsregexp.cc#newcode808
src/jsregexp.cc:808: &sorted_frequencies_[0], &frequencies_[0],
sizeof(sorted_frequencies_));
On 2012/03/30 13:04:48, ulan wrote:
> A nit: for consistency I would prefer breaking memcpy() line as in qsort().

Done.

http://codereview.chromium.org/9965010/diff/6/src/jsregexp.cc#newcode817
src/jsregexp.cc:817: int freq = frequencies_[in_character].counter();
On 2012/03/30 13:04:48, ulan wrote:
> Either mask the in_character or ASSERT(in_character < kTableSize).

Done.

http://codereview.chromium.org/9965010/diff/6/src/jsregexp.cc#newcode819
src/jsregexp.cc:819: return popular;
On 2012/03/30 13:04:48, ulan wrote:
> Named magic numbers become less magic :)
> 
> int freq_in_percents = (freq * 100) / total_frequencies_
> return (freq_in_percents > kPopularityThresholdInPercents);

Done.

http://codereview.chromium.org/9965010/diff/6/src/jsregexp.cc#newcode826
src/jsregexp.cc:826: explicit CharacterFrequency(int character) : counter_(0),
character_(character) { }
On 2012/03/30 13:04:48, ulan wrote:
> Long line.

Done.

http://codereview.chromium.org/9965010/diff/6/src/jsregexp.cc#newcode2126
src/jsregexp.cc:2126: int offset, BoyerMooreLookahead* bm, bool not_at_start) {
On 2012/03/30 13:04:48, ulan wrote:
> For consistency with the surrounding code consider breaking the line as in
> EatsAtLeast.

Done.

http://codereview.chromium.org/9965010/diff/6/src/jsregexp.cc#newcode2593
src/jsregexp.cc:2593: if (body_can_be_zero_length_) {  // || info()->visited) {
On 2012/03/30 13:04:48, ulan wrote:
> Forgotten comment.

Done.

http://codereview.chromium.org/9965010/diff/6/src/jsregexp.cc#newcode2597
src/jsregexp.cc:2597: // VisitMarker marker(info());
On 2012/03/30 13:04:48, ulan wrote:
> Comment.

Done.

http://codereview.chromium.org/9965010/diff/6/src/jsregexp.cc#newcode3112
src/jsregexp.cc:3112: if (compiler->ascii()) {
On 2012/03/30 13:04:48, ulan wrote:
> A nit, I like
> 
> max_char = compiler->ascii() ? String::kMaxAsciiCharCode 
>                              : String::kMaxUtf16CodeUnit;

Yes, but a bug in gcc means you get a linker error because it takes the address,
but then fails to emit the integer.

http://codereview.chromium.org/9965010/diff/6/src/jsregexp.cc#newcode3117
src/jsregexp.cc:3117: if (ranges->at(0).to() < max_char) return NULL;
On 2012/03/30 13:04:48, ulan wrote:
> Consider using ranges(0)->IsEverything()

Done.

http://codereview.chromium.org/9965010/diff/6/src/jsregexp.cc#newcode3187
src/jsregexp.cc:3187: int preload_characters = eats_at_least > 4 ? 4 :
eats_at_least;
On 2012/03/30 13:04:48, ulan wrote:
> Min(4, eats_at_least);

Done.

http://codereview.chromium.org/9965010/diff/6/src/jsregexp.cc#newcode3256
src/jsregexp.cc:3256: map_length_(map_length),
On 2012/03/30 13:04:48, ulan wrote:
> the Set() method assumes that map_length is a power of 2.
> We need to check it here, e.g. ASSERT((map_length & (map_length-1)) == 0).

Done.

http://codereview.chromium.org/9965010/diff/6/src/jsregexp.cc#newcode3263
src/jsregexp.cc:3263: bitmaps_ = new ZoneList<ZoneList<bool>*>(length);
On 2012/03/30 13:04:48, ulan wrote:
> I wonder if bit-packing (ZoneList<bool>) would improve performance.

We normally only do this once per regexp, so I don't think it will show up.

http://codereview.chromium.org/9965010/diff/6/src/jsregexp.cc#newcode3302
src/jsregexp.cc:3302: int vagueness_factor = 10;
On 2012/03/30 13:04:48, ulan wrote:
> Could you please write a comment describing the mathematical function that
this
> loop is trying to maximize? Even better move the loop (or its body) out to a
> separate function and write a postcondition for that function.
> 
> Also, why 10 and 4?

I refactored this function because it reflected far too much some experiments I
made that did not work out.  Sorry about that!

http://codereview.chromium.org/9965010/diff/6/src/jsregexp.cc#newcode3306
src/jsregexp.cc:3306: int max = length_ - 1;
On 2012/03/30 13:04:48, ulan wrote:
> This 'max' can just be called 'i'. I think there are too many 'max' words in
> variable names, so they turn into noise.

Done.

http://codereview.chromium.org/9965010/diff/6/src/jsregexp.cc#newcode3314
src/jsregexp.cc:3314: if (max_vagueness < kTooManyCharacters) max_vagueness *=
2;
On 2012/03/30 13:04:48, ulan wrote:
> What is the rationale behind this?
> 
> Nit: max_vagueness can become greater than kTooManyCharacters, probably
> max_vagueness = Min(max_vagueness * 2, kToomanyCharacters) is more precise.

All refactored away.

http://codereview.chromium.org/9965010/diff/6/src/jsregexp.cc#newcode3320
src/jsregexp.cc:3320: biggest_interval = remembered_max - max;
On 2012/03/30 13:04:48, ulan wrote:
> Consider maintaining (left, right) instead of (biggest_interval,
> biggest_interval_max). This way you don't have to do all those conversions
like
> remembered_max - max
> 1 + biggest_interval_max - biggest_interval

Done.

http://codereview.chromium.org/9965010/diff/6/src/jsregexp.cc#newcode3349
src/jsregexp.cc:3349: int skip = max_lookahead - i;
On 2012/03/30 16:54:51, ulan wrote:
> This is the crucial part. A comment explaining why it is safe to skip that
many
> characters would be helpful.

Done.

http://codereview.chromium.org/9965010/diff/6/src/jsregexp.cc#newcode3369
src/jsregexp.cc:3369: int boolean_skip_distance = sorted_table[kSize / 2];
On 2012/03/30 16:54:51, ulan wrote:
> Instead of kSize / 2 we can take any i in [0..kSize-1], right?
> 
> Could you please put a comment here explaining why kSize / 2?
> 
> I am not fond of the name 'boolean_skip_distance'. Maybe just skip_distance?

Done.

http://codereview.chromium.org/9965010/diff/6/src/jsregexp.cc#newcode3371
src/jsregexp.cc:3371: boolean_skip_table->set(i, table[i] >=
boolean_skip_distance ? 0 : 1);
On 2012/03/30 13:04:48, ulan wrote:
> Can (x >= y ? 0 : 1) be replaced with (x < y) ?

We don't normally allow implicit bool->int and int->bool conversions.

http://codereview.chromium.org/9965010/diff/6/src/jsregexp.cc#newcode3383
src/jsregexp.cc:3383: bool found_single_character = false;
On 2012/03/30 16:54:51, ulan wrote:
> Consider using a counter: 
> 
> character_count = 0;
> ...
> for (int j = 0; j < map_length_ && character_count < 2; j++) {
>   if (map->at(j)) {
>     single_character = j;
>     character_count++;
>   }
> }
> if (character_count > 1) break;

I want to bail out when I find the second character, so it would be misleading
to call it a count.

http://codereview.chromium.org/9965010/diff/6/src/jsregexp.cc#newcode3420
src/jsregexp.cc:3420: masm->CheckCharacter(single_character, &cont);
On 2012/03/30 16:54:51, ulan wrote:
> I think this implicitly assumes that
> String::kMaxAsciiCharCode < map_length &&
> String::kMaxAsciiCharCode < RegExpMacroAssembler::kTableSize

Fixed.

http://codereview.chromium.org/9965010/diff/6/src/jsregexp.cc#newcode3598
src/jsregexp.cc:3598: if (alt1.guards() == NULL ||
On 2012/03/30 16:54:51, ulan wrote:
> This fits in one line.

Done.

http://codereview.chromium.org/9965010/diff/6/src/jsregexp.cc#newcode3598
src/jsregexp.cc:3598: if (alt1.guards() == NULL ||
On 2012/03/30 16:54:51, ulan wrote:
> This fits in one line.

Done.

http://codereview.chromium.org/9965010/diff/6/src/jsregexp.cc#newcode3606
src/jsregexp.cc:3606: // a pattern of the form ...abc... where we can look 6
characters ahead
On 2012/03/30 16:54:51, ulan wrote:
> Can we put a similar comment in description of EmitSkipInstructions and
> GetSkipTable.

Done.

http://codereview.chromium.org/9965010/diff/6/src/jsregexp.cc#newcode5792
src/jsregexp.cc:5792: if (alt.guards() != NULL &&
On 2012/03/30 13:04:48, ulan wrote:
> This fits in one line.

Done.

http://codereview.chromium.org/9965010/diff/6/src/jsregexp.cc#newcode5843
src/jsregexp.cc:5843: for (int m = range.from(); m <= to; m++) {
On 2012/03/30 16:54:51, ulan wrote:
> Why don't we handle ignore_case() here like in the ATOM branch?

The character classes have had case-independence ranges added, but this is not
done for atoms (there is nowhere to put them). 

The correct solution is perhaps to convert atoms into character classes, though
this is what JSC do, and they want to move away from it (it may make some
optimization opportunities harder to discover).

http://codereview.chromium.org/9965010/diff/6/src/jsregexp.cc#newcode5852
src/jsregexp.cc:5852: on_success()->FillInBMInfo(offset,
On 2012/03/30 13:04:48, ulan wrote:
> This fits in line if you put the comment in another line.

But the comment only applies to the line it is on, so that would obscure the
meaning.

http://codereview.chromium.org/9965010/diff/6/src/jsregexp.cc#newcode6029
src/jsregexp.cc:6029: int half_way = sample_subject->length() / 2;
On 2012/03/30 13:04:48, ulan wrote:
> Why do we consider only the upper half of the sample string?

We consider a 100-character chunk in the middle of the sample string.

Fixed to centre the sample, rather than have it start at the centre.

http://codereview.chromium.org/9965010/diff/6/src/jsregexp.cc#newcode6031
src/jsregexp.cc:6031: if (chars_sampled++ > 100) break;
On 2012/03/30 13:04:48, ulan wrote:
> A nit: consider moving chars_sampled++ either to the for loop header or to the
> end of the loop body. Otherwise, if the loop breaks, chars_sampled will be one
> more than the chars actually sampled.

Done.

ulan

LGTM http://codereview.chromium.org/9965010/diff/1009/src/jsregexp.cc File src/jsregexp.cc (right): http://codereview.chromium.org/9965010/diff/1009/src/jsregexp.cc#newcode3316 src/jsregexp.cc:3316: int probability = (in_quickcheck_range ? kSize / 2 ...

8 years, 8 months ago (2012-04-02 08:12:24 UTC) #6

Erik Corry

8 years, 8 months ago (2012-04-02 09:36:22 UTC) #7

http://codereview.chromium.org/9965010/diff/1009/src/jsregexp.cc
File src/jsregexp.cc (right):

http://codereview.chromium.org/9965010/diff/1009/src/jsregexp.cc#newcode3316
src/jsregexp.cc:3316: int probability = (in_quickcheck_range ? kSize / 2 :
kSize) - frequency;
On 2012/04/02 08:12:24, ulan wrote:
> This shouldn't affect the correctness but:
> 
> if I read the code correctly, the probability can be negative here, because
> frequency can be > kSize. I think the upper bound on frequency is 2*kSize.
> 
> Also, did you mean (kSize - frequency) / 2 in the quickcheck case?

This code is as intended.  I added a few comments.

http://codereview.chromium.org/9965010/diff/1009/src/jsregexp.cc#newcode3329
src/jsregexp.cc:3329: // occur occur in the subject string in the range between
min_lookahead and
On 2012/04/02 08:12:24, ulan wrote:
> 'occur' twice.

Done.

http://codereview.chromium.org/9965010/diff/1009/src/jsregexp.cc#newcode6018
src/jsregexp.cc:6018: if (chars_sampled > kSampleSize) break;
On 2012/04/02 08:12:24, ulan wrote:
> Now we can move this into the for loop condition.
> i < sample_subject->length() && chars_sampled < kSampleSize

Done.

Expand Messages | Collapse Messages